Monday, December 28, 2009

librsync usage

Please see the RSYNC algorithm here at  blog on rsync Librsync is library that can be linked with any application.  It has all the features that required for de-duplication.   Any de-duplication functionality requires that new file content is generated from the old content and delta information.  It requires following features.

  • Generation of signatures by the entity that is expected to receive new file. Signatures are nothing but weak (Adler32 )and strong checksums (MD5) of all non-overlapping blocks of the file it has.
  • Entity that is supposed to send the new file should have facility to take the signature file, new file it has and generate delta file.  Delta file consisting of set of instructions.  Instructions include COPY a block from old file (delta file only  has reference to block),   Use the content from the delta file for certain blocks if old file does not have this content 
  • Ability to  merge the delta file with old content file to generate new file.  This is typically done by the entity which receives the delta file.
Librsync library provides all these capabilities.  If you look at the whole.c file of librsync library, you find following capabilities:
  •  rs_sig_file:  This function can be used to generate signature file.  This function is called by the entity that has old content file.
  • rs_loadsig_file:  This function can be used to load the signature file contents in memory based hash list.  This function is called by the entity that has new content file.
  • rs_delta_file :  This function can be used to create delta file from signatures it loaded before and new content file it has.  This function is called by the entity that has new content file.
  • rs_patch_file:  This function can be used to generate new content file from the delta file and the existing content file from which signature file was created before.   This function is called by the receiver of the content.
 'librsync' library maintains all the state information local to the a control block called 'job'.   Due to this, librsync can be used simultaneously to do multiple operations at the same time.  Each operation has its own 'job' and hence it does not have any impact on other operations.  Hence it is suitable for proxy applications where one 'user process' processes multiple connections at the same time.

1 comment:

ag said...

thanks a lot for such deep yet straight forward explanation. There are but few good guides for librsync on the internet.Yours is one of them....
:)