Home

Toggle view:
View
Compare

Menu

Project Wiki Pages
View
Wiki Version:

An effort to get machine readable data management plans, since they're useful in #datalib research (and useful for many others, I'm sure!).

DMPTool

Files:

  • this json file contains all public data management plans, obtained using DMPTool's API; speficially, this call.
  • the folder labeled "2017-03-10_DMPTool-DMPs-Text contains all the public data management plans in plain text (.txt) format.

Steps to obtaining machine-readable PDFs:

  1. Use this script to download all public data management plans to your machine.
  2. OCR the directory using this script that comes from this repository on GitHub: Image-To-Text. The output from this script are two folders: jpg and text.
  3. Navigate to the text folder. You will see each page in each PDF has it's own text files. To combine all the OCR docs (.txt) with the same prefix together, navigate to the "text" folder on the command line and execute the follow code (tested only on Ubuntu Linux):
    • for i in *.txt; do echo $i | sed 's/^\([0-9]\+\)-[0-9]\+\.txt$/\1/'; done | sort -u | while read i; do cat $i-*.txt >out/$i.txt; done
  4. To find keywords within these text files, use the command below (tested only on Ubuntu Linux):
    • grep -rnw '/PATH/TO/OCR'd/TEXT/' -e "container"

DMPOnline

×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.

Create an Account Learn More Hide this message