212.12 LT24 Harvesting the web menagerie: Wrangling Federal Files for Analysis of Legacy Files

Jesse A. Johnston

doi:None

Title	Authors

Ad Hoc: Lightning Talks /

212.12 LT24 Harvesting the web menagerie: Wrangling Federal Files for Analysis of Legacy Files

Contributors:

Jesse A. Johnston

Date created: | Last Updated:

: DOI | ARK

Creating DOI. Please wait...

Create DOI

Category: Communication

Description: The Library of Congress web archiving (LCWA) program has been selecting and harvesting Web sites since 2000. From the current corpus, we are working to build datasets of various media types that could be of interest to researchers and others working with tools for file characterization and validation, amongst other digital preservation tasks. We have restricted the files to those harvested from .gov domains, which offers a unique opportunity to make these files available as government works. The long timeframe of collecting ensures that these represent a wide range of versions and uses of many types of content. We are exploring the development and distribution of these file sets as one aspect of our research work to answer the persistent need and desire across digital content management fields to obtain test sets of real world files for use in benchmarking and testing digital content management tools and services. Content from the LCWA presents a unique opportunity to produce such real world test sets of files for use internally and externally to advance digital content management research and practice. In this lightning talk, we will present a brief overview of an initial sample of 1,000 pdfs. This analysis was undertaken with basic tools and offers a first view into the possibilities for identifying, extracting, and using this Web archive data in applied research.

License: CC-By Attribution 4.0 International

Projects
Registrations

Results: All Projects Results: My Projects Results: All Registrations Results: My Registrations

Files

Loading files...

Citation

Recent Activity

Loading logs...

OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.

This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.

Create an Account Learn More Hide this message

Main content

Ad Hoc: Lightning Talks /

212.12 LT24 Harvesting the web menagerie: Wrangling Federal Files for Analysis of Legacy Files

Files

Citation

Tags

Recent Activity

Start managing your projects on the OSF today.

Main content

Links to this project

Ad Hoc: Lightning Talks /

212.12 LT24 Harvesting the web menagerie: Wrangling Federal Files for Analysis of Legacy Files

Link other OSF projects

Files

Citation

Tags

Recent Activity

Start managing your projects on the OSF today.