Data Deduplication
Posted on 19/10/21
What is deduplication?
Deduplication is the process by which redundant, replicated data is erased to save storage space and prevent it being processed multiple times. Deduplication’s primary objective is cost-saving – the fewer times a copy of the same file has to be retained or processed, the more saved in time and storage costs. Poor management and storage of duplicate data is estimated to cost companies $3.1 trillion a year in the US alone (Gluent, 2020).
Data deduplication is usually carried out by analysing a dataset, identifying duplicated files or data, and deleting the redundant copies. This process facilitates the storage of a single file rather than potentially tens or thousands containing the same information. Once this practice is applied across an entire dataset, gigabytes of storage can be saved.
Deduplication and Subject Access Requests
Whilst not a requirement, deduplication has a key role to play in the Data Subject Access Request (DSAR) process. Due to the time-sensitive nature of DSARs, any method or tool that can accelerate this process is valuable to ensuring deadlines are met. Removing redundant files from a subject’s dataset before the redaction and review processes will save significant person-hours and, therefore, expense.
A former employee’s mailbox, for example, is likely to contain huge numbers of duplicate attachments, footer images, email threads, etc. By retaining only a single instance of each file, the review and redaction time will be significantly shortened. During internal testing, we experienced reductions of between 19% and 55% of original datasets via deduplication – e.g. a .pst file containing 1,222 files was cut down to just 548 documents.
Deduplication in Smartbox.ai
Smartbox automatically identifies and indexes duplicate files during its AI analysis. Even if duplicate files do not share a name, the content and its position on a page is used to determine whether a file is an exact copy or not. Duplicate files are indicated by an icon in the UI.
During or after data upload users are able to use the ‘auto-deduplication’ function – this tool automatically selects an instance of each duplicate file to retain and deletes the rest. This method allows redundant files to be dealt with quickly and with minimal user input. Alternatively, these files can be dealt with more manually in the Duplicates Report. From here users can select which duplicates – and which instances of them – are retained and which are deleted.
The flexibility of these tools provides users with a choice over how to manage their duplicate data – both on a high-level and individual basis. Combined with the bulk redaction functionality, an entire dataset can be deduplicated and redacted with only a handful of clicks.