The very old and good rsync can be key to WAN deduplication.
WAN Deduplication purpose is to reduce the amount of traffic on the WAN links by removing duplicate date. It is also called as DRE (Data Redundancy Elimination).
'rsync' is utility provided in Unix variants for a long time. It is mainly used to mirror the data across multiple servers and also is used to for backing up the data. It recursively goes through all files and directories and updates the data in the backup or mirror server. One feature that is interesting to WAN deduplication is its ability to do 'delta encoding'. 'rsync' has feature to send only 'differences' to the destination machine which inturn creates a new file from the existing file and the 'delta' information it gets from origin server. 'rsync' also can compress the data using 'zlib' and thereby saving even more WAN bandwidth.
To understand the algorithms used by 'rsync' for delta encoding, check this technical report. it uses mechanism called 'rolling checksum' (Alder32 algorithm) to figure out the differences between the file the mirror machine has and the file the origin sever has. Why is this 'rolling checksum' required? Note that when file gets updated by anybody in the origin server, the changes could be anywhere in the file. It can be in the beginning of the file, middle of the file or at the end of the file. 'Delta' generation should work and not duplicate any common data irrespective of placement of changes the original file has undergone. Mirror server breaks down the file it has into multiple chunks of some size (typically S = 1K), calculates both rolling checksum and MD5 checksum on the chunks. Then it sends them to the origin server. Origin Server does rolling checksum for chunks of size S. Since the file might have undergone changes, origin server creates rolling checksum for each byte offset. Note that mirror server generates rolling checksum for non-overlapping chunks. Since origin server generates large number checksums, rolling checksum algorithm needs to be very fast. Alder32 algorithm has property of generating checksum without going through the all bytes of the chunk and hence it is very fast. It can generate checksum incrementally. Then the received checksums are compared with the checksums generated locally by origin server. If any checksum matches, then it verifies by comparing with MD5 checksum. If MD5 checksum also matches, then origin server assumes that mirror server has this chunk. Once it finds out the all duplicate data, it only sends the matching block information and any new or modified data for non-matching chunks. For detailed information on this algorithm, please check this link.
Other utilities which are useful for WAN deduplication : 'rdiff'. 'rdiff' uses the rsync algorithm to generate delta file with the difference from old file and new file. Then this can be applied to old file at other machine to get the new file.