Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
<h1>Project Goals</h1> 1. Digitize everything; 2. Rehouse materials in archival boxes; 3. Make content available via Stanford Digital Repository (SDR); 4. Convert select data from PDF to actionable format (tabular text). These concise goals involve a deceptive amount of human effort. This will be a long, iterative process. <h1>Anticipated Challenges</h1> 1. **Volume and heterogeneity of materials**: Forty-four 3-ring binders; 4-D data. Digitization alone will be a huge effort, but metadata creation and data curation will be an even bigger job. 2. **Hand-written data & metadata** - an overwhelming amount. Can’t be OCR’d (or, can it?), so significant effort required to transcribe & QC content. Crowd-sourcing an option? Good news: some of the data have been typed and presented in reports - may make automated extraction of data from PDFs possible. 3. **Metadata**: how much is needed? Which format(s): XML, EML, readmes; how granular? Need good models to look at - maybe existing CalCOFI time-series as a start. Expert curation is necessary for quality metadata creation; not a job that can be outsourced. 4. **Conversion from PDF to tabular format** so data can be reused easily - most interested in this. We can’t do it all by hand (well, we can, but...). How much can be automated? Which tools are best suited for tabular data extraction? Advice welcome! <h1>Making a Plan</h1> - Inventory - Take a detailed inventory. What do we have? Data, reports, correspondence. How much do we have? What kinds do we have? - Organize - How should we group items? By cruise, station, variable, year? Need to standardize dates, stations, variables, cruise names... - Appraise - Are there duplicates? Is anything missing? Prioritize: what is most valuable or in the worst shape? - Create metadata - Create descriptive & administrative metadata to guide digitization process: sub-collection titles, file names, etc. - Digitization - Stanford University Libraries has a well-equipped lab for systematic digitization & deposit into the SDR - Create metadata - Data need readme files and item- & data-level metadata to facilitate understanding & reuse. - Make actionable - Conversion from PDF to actionable tabular data is critical for enabling reuse of the data. How do we make it happen at scale? <h1>Tools for extraction of data from PDFs</h1> 1. [Tabula][2] 2. A [review of tools][3] by Open Knowledge Foundation [See [this nice poster][4] on legacy data curation: Clark, Lynn. ‘Unlocking GATE: Gaining Access to Analog Data in a Digital World’. 2013.] This content was lifted directly from my Sept. 2016 [poster][5] at the RDA Rescue of Data at Risk meeting. [1]: https://searchworks.stanford.edu/view/2229625 [2]: http://tabula.technology/ [3]: http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-and-data-from-pdfs.html [4]: https://opensky.ucar.edu:/islandora/object/dcerc:20/ [5]: https://osf.io/dp9pe/
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.