Main content

Papers  /

Date created: | Last Updated:

: DOI | ARK

Creating DOI. Please wait...

Create DOI

Category: Communication

Description: Computing checksums to prevent bit rot is accepted wisdom in the digital preservation community. Yet in other domains, this wisdom is approached quite differently. New hashing algorithms continue to be developed in the cryptography community, typically with very different use cases in mind, focusing on encryption and security over integrity or identification. Checksumming is also a key feature of modern filesystems. The implementers of these filesystems concern themselves with block-level integrity, rather than focus on ‘files’ or objects in the way digital preservation systems do. Cloud-based object storage systems also compute checksums, providing integrity guarantees as part of the service. And there is the blockchain - distributed peer to peer systems where hashing is fundamental. How do we reconcile these different approaches to bit-level preservation using checksums? Can we compare the costs, in terms of compute resources or time, of the different approaches? Is there a way to verify the accepted wisdom of the digital preservation community and reconcile this with the diverse and expanding approaches to checksum validation? This paper describes how checksumming functionality is understood and implemented in modern filesystems. A cost analysis is presented, comparing different approaches to data integrity, including pure CPU checksumming with tools such as md5sum, the block-level metadata used by filesystems such as ZFS, and the contrast with data integrity done by cloud service providers’ object storage services. From this analysis we describe the benefits of developing a new standard for mapping the block-level metadata produced by filesystem checksum reporting tools to the file-centered checksum reporting and validation required for adherence to current expectations of accepted digital preservation best practices. By better understanding different approaches to data integrity, it is possible to make better use of the computer hardware dedicated to digital preservation, taking advantage of the increased computational efficiency of filesystem-level checksumming techniques. This work closes a gap between current best practices in digital preservation and in high-performance computing. Sample code paths for working with and validating block checksums are also demonstrated.

License: CC-By Attribution 4.0 International

Files

Loading files...

Citation

Components

312. Storage Organization and Integrity

The two papers in Session 312 explore the issues and topics pertaining to the theme of Storage Organization and Integrity with recent examples of adva...

Recent Activity

Loading logs...

Tags

Recent Activity

Loading logs...

OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.