Ad Hoc: Lightning Talks  /

212.12 LT24 Harvesting the web menagerie: Wrangling Federal Files for Analysis of Legacy Files


Date created: | Last Updated:


Creating DOI. Please wait...

Create DOI

Category: Communication

Description: The Library of Congress web archiving (LCWA) program has been selecting and harvesting Web sites since 2000. From the current corpus, we are working to build datasets of various media types that could be of interest to researchers and others working with tools for file characterization and validation, amongst other digital preservation tasks. We have restricted the files to those harvested from .gov domains, which offers a unique opportunity to make these files available as government works. The long timeframe of collecting ensures that these represent a wide range of versions and uses of many types of content. We are exploring the development and distribution of these file sets as one aspect of our research work to answer the persistent need and desire across digital content management fields to obtain test sets of real world files for use in benchmarking and testing digital content management tools and services. Content from the LCWA presents a unique opportunity to produce such real world test sets of files for use internally and externally to advance digital content management research and practice. In this lightning talk, we will present a brief overview of an initial sample of 1,000 pdfs. This analysis was undertaken with basic tools and offers a first view into the possibilities for identifying, extracting, and using this Web archive data in applied research.

License: CC-By Attribution 4.0 International


Loading files...



Recent Activity

Loading logs...

OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.