Creating a Harvester

Menu

Loading wiki pages...

View
Wiki Version:
<h1>Creating a SHARE Harvester with scrAPI</h1> <p>There are two ways you can create a harvester for the SHARE project.</p> <p>If you're creating a harvester for an OAI PMH feed, you should see the section on creating an OAI PMH harvester using classes in the scrapi library.</p> <p>If you're creating a harvester with a custom data format, see the section on creating a custom harvester.</p> <p>See more information in the wiki section on <a href="https://osf.io/wur56/wiki/scrAPI/" rel="nofollow">scrAPI</a></p> <hr> <h2>Making a class based harvester</h2> <p>If you're creating a harvester for a service that uses OAI-PMH, you can create a harvester using scrapi classes that will harvest data and send normalized data through the SHARE pipeline. </p> <p>You can automate this process by using this <a href="https://github.com/erinspace/autooai" rel="nofollow">auto oai tool</a>.</p> <p>Your harvester will live in the <a href="https://github.com/CenterForOpenScience/scrapi/tree/develop/scrapi/harvesters" rel="nofollow">scrapi harvesters directory</a> along with the other harvesters.</p> <p>This class based harvester will make calls to the specified OAI PMH service using the ListRecords verb and the oai_dc namespace, with a date range of one day in the past.</p> <p>You can find the base class definition for the OAI PMH class in the scrapi code, <a href="https://github.com/CenterForOpenScience/scrapi/blob/develop/scrapi/base/__init__.py" rel="nofollow">available here</a>. </p> <p>To create a class-based harvester, follow these steps: </p> <ol> <li> <p>Fork the <a href="https://github.com/CenterForOpenScience/scrapi/" rel="nofollow">scrapi</a> repo, and create your own harvester in a folder with the same name under the <a href="https://github.com/fabianvf/scrapi/tree/develop/scrapi/harvesters" rel="nofollow">scrapi/harvesters directory</a>.</p> <ul> <li>See the <a href="https://help.github.com/articles/fork-a-repo" rel="nofollow">GitHub help page on forking</a> for detailed instructions.</li> </ul> </li> <li> <p>Folow the setup instructions on the <a href="https://github.com/CenterForOpenScience/scrapi/" rel="nofollow">scrapi</a> repo README. </p> </li> </ol> <p>Set your local settings to baseline 7. <code>cp scrapi/settings/<a href="http://local-dist.py" rel="nofollow">local-dist.py</a> scrapi/settings/<a href="http://local.py" rel="nofollow">local.py</a></code></p> <ol> <li>To see a harvester run, <code>invoke harvester [harvester name here]</code></li> </ol> <p>If you'd like to run a harvester with a certain number of days back, run<br> <code>invoke harvester [harvester name here] --days=[number of days back]</code></p> <p>See the <a href="https://github.com/CenterForOpenScience/SHARE/wiki/Provider-Names" rel="nofollow">list of provider names</a> or see the <a href="https://github.com/CenterForOpenScience/scrapi/tree/develop/scrapi/harvesters" rel="nofollow">names of the current providers</a> in scrapi for examples of harvesters you can invoke.</p> <ol> <li> <p>Within your new harvester folder, create a file named <code><a href="http://yourharvester.py" rel="nofollow">yourharvester.py</a></code> where you will create an instance of the harvester class. </p> <p>Your <code><a href="http://yourharvester.py" rel="nofollow">yourharvester.py</a></code> will have 3 main parts: <br> - The imports section at the top, where you'll import the base OAI harvester class - The schema transformer, which defines each main element and where in the source API that item can be found. - Your instance of the harvester class, with some key areas defined:<br> + the name of your provider (as it will show up in the source field). <em>Note: This is the official name of your provider, and the name you will use to invoke it later when running! It has to be unique, and not collide with any other provider already in the system.</em> + the base url where you will make your OAI requests. Should include everything before the ? in the request url<br> + a property list of elements that don't fit into the set schema - see more in the "Property List" section down below. + a list of "approved sets" - if your provider has a certain set of items with a particular "setSpec" entry that should make their way into the notification service, list the approved "setSpec" items here. Only those entries that are in the approved setSpec list will be normalized and set to the notification Service. + timeout - time in seconds to wait between subsequent requests to gather resources. + timezone_granularity - how much time detail to include in the OAI request. Setting timezone_granularity to True will add 'T00:00:00Z' to the date request.</p> </li> </ol> <p>Here's an example of what your <code><a href="http://myharvester.py" rel="nofollow">myharvester.py</a></code> file might look like:</p> <pre class="highlight"><code class="language-python">&quot;&quot;&quot; A harvester for Calhoun: The NPS Institutional Archive for the SHARE project An example API call: <a href="http://calhoun.nps.edu/oai/request?verb=ListRecords" rel="nofollow">http://calhoun.nps.edu/oai/request?verb=ListRecords</a>&metadataPrefix=oai_dc &quot;&quot;&quot; from __future__ import unicode_literals from scrapi.base import OAIHarvester class CalhounHarvester(OAIHarvester): short_name = 'calhoun' long_name = 'Calhoun: Institutional Archive of the Naval Postgraduate School' url = '<a href="http://calhoun.nps.edu" rel="nofollow">http://calhoun.nps.edu</a>' verify = False base_url = '<a href="http://calhoun.nps.edu/oai/request'" rel="nofollow">http://calhoun.nps.edu/oai/request'</a> property_list = [ 'type', 'source', 'format', 'setSpec', 'date', 'rights' ] approved_sets = [ 'com_10945_7075', 'com_10945_6', 'col_10945_17' ]</code></pre> <ol> <li> <p>Add your provider's favicon to the <a href="https://github.com/CenterForOpenScience/scrapi/tree/develop/img/favicons" rel="nofollow">favicon folder</a></p> </li> <li> <p>From the root directory, run <code>invoke provider_map</code></p> </li> <li> <p>Test your harvester locally by running <code>invoke harvester harvester_short_name_here</code></p> </li> <li> <p>This should be the (unique) name that you gave your harvester in the "short_name" variable when creating the harvester.</p> </li> <li> <p>Create a pull request to add your new harvester to the <a href="https://github.com/CenterForOpenScience/scrapi/" rel="nofollow">scrapi</a> repo</p> </li> </ol> <h3>Creating an OAI PMH Harvester Property List</h3> <p>Property lists are created from the elements that don't match the <a href="https://osf.io/wur56/wiki/Schema/" rel="nofollow">base schema</a>. Also include elements that the base schema will only save one of. For example, many sources have more than one description field or identifier field. To make sure this metadata is still captured, include it in the property list so that all elements show up in the normalized data. This way, the first description or identifier is saved in the primary schema field, and the others are included in the OtherProperties field.</p> <p>Also make sure to include items in the header that might not fit into our standard schema, such as 'setSpec' for OAI harvesters. </p> <p>Here's an example of a property list:</p> <p><code>property_list = ['date', 'identifier', 'setSpec', 'description']</code></p> <p>If you're creating an OAI PMH harvester, the <a href="https://github.com/erinspace/autooai" rel="nofollow">auto oai tool</a> will automatically create a property list out of items that don't match the base schema.</p> <h2>Making a Custom Harvester</h2> <p>Many harvesters for the SHARE project are written for providers with an OAI-PMH endpoint, and can be written very quickly by creating an instance of an oai harvester class. However, many other providers have a custom data output that requires a bit more of a custom implementation. </p> <p>Here's how to create a custom harvester using tools provided within <a href="https://github.com/CenterForOpenScience/scrapi" rel="nofollow">scrapi</a>. For more information about scrapi, see the <a href="https://github.com/CenterForOpenScience/scrapi" rel="nofollow">GitHub repo</a>. </p> <p>To create a harvester, first fork the <a href="https://github.com/CenterForOpenScience/scrapi" rel="nofollow">scrapi repo</a>. You'll add your new harvester in the <a href="https://github.com/CenterForOpenScience/scrapi/tree/develop/scrapi/harvesters" rel="nofollow">harvesters folder</a>. </p> <p>Here's what a typical custom harvester looks like: </p> <pre class="highlight"><code class="language-python">&quot;&quot;&quot; A CrossRef harvester for the SHARE project Example API request: <a href="http://api.crossref.org/v1/works?filter=from-pub-date:2015-02-02,until-pub-date:2015-02-02" rel="nofollow">http://api.crossref.org/v1/works?filter=from-pub-date:2015-02-02,until-pub-date:2015-02-02</a>&rows=1000 &quot;&quot;&quot; from __future__ import unicode_literals import json import logging from datetime import date, timedelta from six.moves import xrange from nameparser import HumanName from scrapi import requests from scrapi import settings from scrapi.base import JSONHarvester from scrapi.linter.document import RawDocument from scrapi.base.helpers import build_properties, compose, datetime_formatter logger = logging.getLogger(__name__) def process_contributor(author, orcid): name = HumanName(author) ret = { 'name': author, 'givenName': name.first, 'additionalName': name.middle, 'familyName': name.last, 'sameAs': [orcid] if orcid else [] } return ret def process_sponsorships(funder): sponsorships = [] for element in funder: sponsorship = {} if element.get('name'): sponsorship['sponsor'] = { 'sponsorName': element['name'] } if element.get('award'): sponsorship['award'] = { 'awardName': ', '.join(element['award']) } if element.get('DOI'): sponsorship['award']['awardIdentifier'] = '<a href="http://dx.doi.org/" rel="nofollow">http://dx.doi.org/</a>{}'.format(element['DOI']) sponsorships.append(sponsorship) return sponsorships class CrossRefHarvester(JSONHarvester): short_name = 'crossref' long_name = 'CrossRef' url = '<a href="http://www.crossref.org" rel="nofollow">http://www.crossref.org</a>' DEFAULT_ENCODING = 'UTF-8' record_encoding = None @property def schema(self): return { 'title': ('/title', lambda x: x[0] if x else ''), 'description': ('/subtitle', lambda x: x[0] if (isinstance(x, list) and x) else x or ''), 'providerUpdatedDateTime': ('/issued/date-parts', compose(datetime_formatter, lambda x: ' '.join([str(part) for part in x[0]]))), 'uris': { 'canonicalUri': '/URL' }, 'contributors': ('/author', compose(lambda x: [ process_contributor(*[ '{} {}'.format(entry.get('given'), entry.get('family')), entry.get('ORCID') ]) for entry in x ], lambda x: x or [])), 'sponsorships': ('/funder', lambda x: process_sponsorships(x) if x else []), 'otherProperties': build_properties( ('journalTitle', '/container-title'), ('volume', '/volume'), ('tags', ('/subject', '/container-title', lambda x, y: [tag.lower() for tag in (x or []) + (y or [])])), ('issue', '/issue'), ('publisher', '/publisher'), ('type', '/type'), ('ISSN', '/ISSN'), ('ISBN', '/ISBN'), ('member', '/member'), ('score', '/score'), ('issued', '/issued'), ('deposited', '/deposited'), ('indexed', '/indexed'), ('page', '/page'), ('issue', '/issue'), ('volume', '/volume'), ('referenceCount', '/reference-count'), ('updatePolicy', '/update-policy'), ('depositedTimestamp', '/deposited/timestamp') ) } def harvest(self, start_date=None, end_date=None): start_date = start_date or date.today() - timedelta(settings.DAYS_BACK) end_date = end_date or date.today() base_url = '<a href="http://api.crossref.org/v1/works?filter=from-pub-date:" rel="nofollow">http://api.crossref.org/v1/works?filter=from-pub-date:</a>{},until-pub-date:{}&rows={{}}&offset={{}}'.format(start_date.isoformat(), end_date.isoformat()) total = requests.get(base_url.format('0', '0')).json()['message']['total-results'] <a href="http://logger.info" rel="nofollow">logger.info</a>('{} documents to be harvested'.format(total)) doc_list = [] for i in xrange(0, total, 1000): records = requests.get(base_url.format(1000, i)).json()['message']['items'] <a href="http://logger.info" rel="nofollow">logger.info</a>('Harvested {} documents'.format(i + len(records))) for record in records: doc_id = record['DOI'] doc_list.append(RawDocument({ 'doc': json.dumps(record), 'source': self.short_name, 'docID': doc_id, 'filetype': 'json' })) return doc_list</code></pre> <p><strong>scrapi</strong><br> Scrapi has a few custom tools to help with requests, parsing your provider's schema to match with the SHARE Schema, and linting the provided documents to make sure that the normalized results match with the SHARE schema.</p> <p><strong>HarvesterClass(FormatHarvester)</strong><br> The base harvester class will inherit from a base harvester type, either a JSONHarvester or a XML Harvester. The base harvester specifies the schema that will be transformed into the new SHARE harvester. </p> <p><strong>HarvesterClass Methods</strong></p> <p><strong>schema()</strong> Return a dictionary, where the outer keys are the elements of the SHARE schema, and the values are the equivalent entry in the target provider, as well as any functions that should be run on the result of that request to format it properly. Use xpath statements for XML target schemas, and the keyword itself for json schemas.</p> <p>For more information, see the section on <a href="https://osf.io/wur56/wiki/Transformers/" rel="nofollow">schema transformers</a>.</p> <p><strong>harvest()</strong> This function will be very similar for all custom harvesters. Request a group of records from your provider, and add each of those records to a list of RawDocuments, (a type defined by scrAPI. A raw document consists of the rawrecord, stored in doc, the source, or the name of the source, a docID, or unique identifier for the document, and a filetype of the provider API.</p> <p><strong>helper functions()</strong> Each harvester will have smaller helper functions that format return values of the provider API. These are passed to the schema transformer to properly format the information received from the provider.These functions can include: <em> process_contributors() </em> process_tags() * Any other helper functions that go in the schema transformer</p> <h2>Creating Tests</h2> <p>Test are mostly auto-generated, but you do have to make a slight modification to generate the test for the first time.</p> <p>Inside of <code>scrapi/tests/<a href="http://test_harvesters.py" rel="nofollow">test_harvesters.py</a></code> change the 'record_mode' on line 22 to 'once.' It should now read: </p> <pre class="highlight"><code>with vcr.use_cassette('tests/vcr/{}.yaml'.format(harvester_name), match_on=['host'], record_mode='once'):</code></pre> <p>Run your test with: <code>py.test tests/<a href="http://test_harvesters.py" rel="nofollow">test_harvesters.py</a>::test_harvester\[your_shortname_here\]</code></p> <p>There is a chance that your automatically created test will fail when run for the first time. If that's the case, you can create a new vcr file that will hopefully work.</p> <p>Delete the old vcr file inside <code>scrapi/tests/vcr/<a href="http://shortname.py" rel="nofollow">shortname.py</a></code></p> <p>Change the date within the "freeze time" decorator above def test_harvester() to a date where you know the harvester had results (don't forget to import from freezegun). For example: <code>@freeze_time("2014-03-15")</code></p> <p>Inside of scrapi/tests/<a href="http://test_harvesters.py" rel="nofollow">test_harvesters.py</a> make sure the record mode is still once.</p> <p>Re-run your test with <code>py.test tests/<a href="http://test_harvesters.py" rel="nofollow">test_harvesters.py</a>::test_harvester\[your_shortname_here\]</code></p> <p>To run a test for a specific harvester, run <code>invoke one_test shortname</code></p> <p>Make sure to not save these changes to <a href="http://test_harvesters.py" rel="nofollow">test_harvesters.py</a>!</p>
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.