Client-Side Deduplication (new backup format)

Deduplication is an approach that involves multiple usage of the same data parts in various processes.

This functionality is not supported for legacy backup format

The new backup format uses client-side deduplication. This approach brings the following benefits:

  • Client-side deduplication is much faster compared to a server deduplication
  • The absence of internet connection issues
  • An internet traffic decrease
  • A server deduplication database constantly grows, and this can cause a significant expense increase. Client-side deduplication uses local capacities only.

How It Works

Regardless of a backup type, the first backup is always a full backup. Bringing a routine to a backup, a backup implies data updates, thus next backup jobs are usually incremental and depend on full backup and previous incremental backups as well.

The backup format reckons for a full backup plan independence, so each separate backup plan has its own deduplication database. Moreover, backup plan generations also have their own deduplication databases.

Once a backup plan is run, the application reads backup data in batches aliquot to block size. Once a block is read, it is compared with deduplication database records. If a block is not found, it is delivered to storage and is assigned with a block ID, which becomes a new deduplication database record. The block scanning continues, and if a block matches any of the deduplication database records, a block with such ID is excluded from a backup plan.

This approach significantly decreases a backup size, especially in virtual environments with a large number of identical blocks.

If a deduplication database is deleted or corrupted, a full backup is always forced

Deduplication cannot work for some types of files. Archives, some media files or database files that are considered as a changed ones will not be handled

For image-based backup type, the approach is slightly different. Instead of cluster reading, a Master File Table (MFT) is read then the mechanism checks which files have been modified. This decreases source data reading exponentially.

https://git.cloudberrylab.com/egor.m/doc-help-mbs.git