[Review] Data Deduplication: Definition/Necessity/Functions/Types [MiniTool Wiki]
What Is Data Deduplication?
In computing, data deduplication is a technology for eliminating duplicated versions of repeating data. After applying the tech successfully, storage utilization will be improved. Besides, by reducing the overall amount of storage media required for storing the data, the money cost is lowered. Moreover, when it comes to network data transfer, the time and bandwidth requirements will also be reduced.
Data deduplication is implemented in some filesystems (e.g. ZFS and Write Anywhere File Layout) and in different disk arrays models. It is a service available on both NTFS and ReFS on Windows servers.
Windows Server Data Deduplication
This and the next part apply to Windows Server 2016, 2019, as well as 2022.
Data deduplication, also known as Dedup, is a function that reduces the influence of redundant data on storage expense. When enabled, data deduplication examines the data on target volume/partition by looking for duplicated portions on the volume to optimize free space.
Duplicated portions of the dataset are saved once and (optionally) compressed for extra savings. Data deduplication optimizes redundancies without compromising data integrity or fidelity.
What’s data management, master data management, data management platform, enterprise data management, data management software, customer data management…
Why Need Data Deduplication?
Large datasets usually have lots of duplication that increases the costs of saving the data. Data deduplication helps storage admins reduce costs that involve maintaining duplicated data. The following are some examples that can generate duplications.
- Backup snapshots might have minor differences from day to day.
- User file shares may have a lot of copies of the same or similar files.
- Virtualization guests might be almost identical from VM-to-VM (virtual machine).
The free space that you can get from data deduplication depends on the dataset or workload on the target volume. Datasets that have high duplication can reach optimization ratios of up to 95% or a 20x reduction in storage utilization.
How Does Data Deduplication Work?
The deduplication process needs a comparison of data “chunks” (also called “byte patterns”) that are unique, contiguous blocks of data. Those chunks are identified and saved during the process of analysis. They are also compared to other chunks within existing data.
Whenever a match happens, the redundant chunk is replaced with a small reference that points to the saved chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or transferred can be greatly reduced.
What is data replication? How many data replication types are there? How to perform data replication to protect from data loss in case of computer crashes?
One of the data deduplication techniques is single-instance (data) storage that replaces multiple copies of content at the whole-file level with a single shared copy. That is distinct from modern approaches to data deduplication that can operate at the sub-block or segment level.
Data Deduplication vs Data Compression
Data deduplication algorithms are different from those of data compressions like LZ77 and LZ78. While compression algorithms identify redundant data inside individual files and encode this redundant data more efficiently, deduplication aims to inspect large volumes of data and identify large sections, including entire files or large sections of files that are identical, and replace them with a shared copy.
Types of Data Deduplication
There are several kinds of data deduplication.
- Post-process deduplication
- In-line deduplication
- Source deduplication
- Target deduplication