Compression for data archiving and backup revisited

Corneliu Constantinescu

doi:10.1117/12.825361

3 September 2009 Compression for data archiving and backup revisited

Corneliu Constantinescu

Proceedings Volume 7444, Mathematics for Signal and Information Processing; 74440C (2009) https://doi.org/10.1117/12.825361
Event: SPIE Optical Engineering + Applications, 2009, San Diego, California, United States

Abstract

Data deduplication is a simple, dictionary based compression method that became very popular in storage archiving and backup. It has the advantage of direct, random access to any piece ("chunk") of a file in one table lookup; that's not the case with differential file compression, the other common storage archival method. The compression efficiency (chunk matching) of deduplication improves for smaller chunk sizes, however the sequence of hashes replacing the deduplicated object (file) increases significantly. Within the sequence of chunks that an object is decomposed, sub-sequences of adjacent chunks tend to repeat. We exploit this insight first in an online scheme used to reduce the amount of hash metadata generated. With each newly created entry in the chunk repository we add a "chronological" pointer linking this entry with the next new entry, in time order. When the hashes produced by the chunker follow the chronological pointers we encode them as a "sequence of hashes" by specifying the first hash in the sequence and the length of the sequence. The shrinkage is orders of magnitude smaller than what a customary compression algorithm (gzip) achieves and has a significant impact on the overall deduplication efficiency when relatively small chunks are used. A second scheme is also introduced that optimizes the chunk sizes by joining repeated sub-sequences of small chunks into new "super chunks" with the constraint to achieve practically the same matching performance. We employ suffix arrays to find these repeating subsequences and to determine a new encoding that covers the original sequence. As a result, fewer chunks are used to represent a file, reducing fragmentation i.e. the number of disk accesses needed to reconstruct the file, and requiring fewer entries in the chunk dictionary and fewer hashes to encode a file. As the experimental results show, this method provides over 10 time reduction in fragmentation and over 5 times reduction in the number of entries in repository while achieving similar or slightly better overall deduplication ratios.

Citation Download Citation

Corneliu Constantinescu "Compression for data archiving and backup revisited", Proc. SPIE 7444, Mathematics for Signal and Information Processing, 74440C (3 September 2009); https://doi.org/10.1117/12.825361

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available