402.2 PDF Mayhem: Is Broken Really Broken?

Heikki Helin; Kimmo Koivunen; Johan Kylander; Juha Lehtonen

doi:10.17605/OSF.IO/FZXC9

Title	Authors

Papers /

402.2 PDF Mayhem: Is Broken Really Broken?

Contributors:

Date created: | Last Updated:

: DOI | ARK

Creating DOI. Please wait...

Create DOI

Category: Communication

Description: In this paper, we focus on the quality of PDF files. We are interested in errors that validators report during the validation process: how accurate are these errors and can we build easy workarounds to avoid or even fix these problems? We present our findings from a pilot experiment where we validated more than 200,000 PDF files from well-known corpora with different validators and found several thousand problematic files. We then devised a process of reconstructing the invalid files and analyzing the converted data. Our results show that there are potentially working methods for avoiding problems during the PDF validation and these methods can significantly reduce the workload for preservation specialists who are responsible for the quality of the data. Our further aim is to master and manage PDF validation so that we can build an automated workflow which is able to migrate most of PDF files to PDF/A files during the ingest of a digital preservation repository. To achieve this in reliable manner we need further studies to build on what we have presented here.

License: CC-By Attribution 4.0 International

Projects
Registrations

Results: All Projects Results: My Projects Results: All Registrations Results: My Registrations

Files

Loading files...

Citation

Components

402. Formats

Gordon

The two papers in Session 402 explore the issues and topics pertaining to the theme of Formats, with examples of good practice and an associated discu...

Select: All components ^*contains supplemental materials for a preprint

Loading projects and components...

Type the following to continue:

Recent Activity

Loading logs...

OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.

This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.

Create an Account Learn More Hide this message

Main content

Papers /

402.2 PDF Mayhem: Is Broken Really Broken?

Files

Citation

Components

402. Formats

Tags

Recent Activity

Start managing your projects on the OSF today.

Main content

Links to this project

Papers /

402.2 PDF Mayhem: Is Broken Really Broken?

Link other OSF projects

Files

Citation

Components

402. Formats

Tags

Recent Activity

Start managing your projects on the OSF today.