Main content
Ad Hoc: Lightning Talks /
212.12 LT24 Harvesting the web menagerie: Wrangling Federal Files for Analysis of Legacy Files
Date created: | Last Updated:
: DOI | ARK
Creating DOI. Please wait...
Category: Communication
Description: The Library of Congress web archiving (LCWA) program has been selecting and harvesting Web sites since 2000. From the current corpus, we are working to build datasets of various media types that could be of interest to researchers and others working with tools for file characterization and validation, amongst other digital preservation tasks. We have restricted the files to those harvested from .gov domains, which offers a unique opportunity to make these files available as government works. The long timeframe of collecting ensures that these represent a wide range of versions and uses of many types of content. We are exploring the development and distribution of these file sets as one aspect of our research work to answer the persistent need and desire across digital content management fields to obtain test sets of real world files for use in benchmarking and testing digital content management tools and services. Content from the LCWA presents a unique opportunity to produce such real world test sets of files for use internally and externally to advance digital content management research and practice. In this lightning talk, we will present a brief overview of an initial sample of 1,000 pdfs. This analysis was undertaken with basic tools and offers a first view into the possibilities for identifying, extracting, and using this Web archive data in applied research.