SHARE v1 Development

doi:None

Title	Authors

Running the Harvesters

Installing and Running the Harvesters ====== Harvesters written for the SHARE project are written in python and are run through scrAPI - or SHARE core. If you'd like to install scrapi on your own machine and run a few of the harvesters, follow these steps! ## Getting started - To run absolutely everything, you will need to: - Install requirements. - Install Elasticsearch - Install Cassandra - Install harvesters - Install rabbitmq (optional) - To only run harvesters locally, you do not have to install rabbitmq ### Requirements - Make sure you have python installed and working properly. - Create and enter [virtual environment](http://virtualenv.readthedocs.org/en/latest/virtualenv.html) for scrapi, and go to the top level project directory. From there, run ```bash $ pip install -r requirements.txt ``` Or, if you'd like some nicer testing and debugging utilities in addition to the core requirements, run ```bash $ pip install -r dev-requirements.txt ``` This will also install the core requirements like normal. ### Installing Cassandra and Elasticsearch _note: JDK 7 must be installed for Cassandra and Elasticsearch to run_ _note: As long as you don't specify Cassandra or Elasticsearch and set RECORD_HTTP_TRANSACTIONS to ```False``` in your local.py, you shouldn't need to have them installed to get at least basic functionality working_ #### Mac OSX ```bash $ brew install cassandra $ brew install elasticsearch ``` #### Ubuntu ##### Install Cassandra 1. Check which version of Java is installed by running the following command: ```bash $ java -version ``` Use the latest version of Oracle Java 7 on all nodes. 2. Add the DataStax Community repository to the /etc/apt/sources.list.d/cassandra.sources.list ```bash $ echo "deb http://debian.datastax.com/community stable main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list ``` 3. Add the DataStax repository key to your aptitude trusted keys. ```bash $ curl -L http://debian.datastax.com/debian/repo_key | sudo apt-key add - ``` 4. Install the package. ```bash $ sudo apt-get update $ sudo apt-get install dsc20=2.0.11-1 cassandra=2.0.11 ``` ##### Install ElasticSearch 1. Download and install the Public Signing Key. ```bash $ wget -qO - https://packages.elasticsearch.org/GPG-KEY-elasticsearch | sudo apt-key add - ``` 2. Add the ElasticSearch repository to yout /etc/apt/sources.list. ```bash $ sudo add-apt-repository "deb http://packages.elasticsearch.org/elasticsearch/1.4/debian stable main" ``` 3. Install the package ```bash $ sudo apt-get update $ sudo apt-get install elasticsearch ``` __Now, just run__ ```bash $ cassandra $ elasticsearch ``` Or, if you'd like your cassandra session to be bound to your current session, run: ```bash $ cassandra -f ``` and you should be good to go. (Note, if you're developing locally, you do not have to run Rabbitmq!) ### Rabbitmq (optional) #### Mac OSX ```bash $ brew install rabbitmq ``` #### Ubuntu ```bash $ sudo apt-get install rabbitmq-server ``` ### Settings You will need to have a local copy of the settings. Copy local-dist.py into your own version of local.py - ``` cp scrapi/settings/local-dist.py scrapi/settings/local.py ``` If you installed Cassandra and Elasticsearch earlier, you will want add the following configuration to your local.py: ```python RECORD_HTTP_TRANSACTIONS = True # Only if cassandra is installed NORMALIZED_PROCESSING = ['cassandra', 'elasticsearch'] RAW_PROCESSING = ['cassandra'] ``` Otherwise, you will want to make sure your local.py has the following configuration: ```python RECORD_HTTP_TRANSACTIONS = False NORMALIZED_PROCESSING = ['storage'] RAW_PROCESSING = ['storage'] ``` This will save all harvested/normalized files to the directory ```archive/<source>/<document identifier>``` _note: Be careful with this, as if you harvest too many documents with the storage module enabled, you could start experiencing inode errors_ If you'd like to be able to run all harvesters, you'll need to [register for a PLOS API key](http://api.plos.org/registration/). Add the following line to your local.py file: ``` PLOS_API_KEY = 'your-api-key-here' ``` ### Running the scheduler (optional) - from the top-level project directory run: ```bash $ invoke beat ``` to start the scheduler, and ```bash $ invoke worker ``` to start the worker. ### Harvesters Run all harvesters with ```bash $ invoke harvesters ``` or, just one with ```bash $ invoke harvester harvester-name ``` Note: harvester-name is the same as the defined harvester "shortname". Invoke a harvester a certain number of days back with the ```--days``` argument. For example, to run a harvester 5 days in the past, run: ```bash $ invoke harvester harvester-name --days=5 ``` Invoke a harvester for a certain start date with the ```--start``` or ```-s```argument. Invoke a harvester for a certain end date with the ```--end``` or ```-e```argument. For example, to run a harvester between the dates of March 14th and March 16th 2015, run: ```bash $ invoke harvester harvester-name --start 2015-03-14 --end 2015-03-16 ``` Either --start or --end can also be used on their own. Not supplying arguments will default to starting the number of days specified in ```settings.DAYS_BACK``` and ending on the current date. If --end is given with no --start, start will default to the number of days specified in ```settings.DAYS_BACK``` before the given end date. ### Testing - To run the tests for the project, just type ```bash $ invoke test ``` and all of the tests in the 'tests/' directory will be run. To run a test for a specific harvester, run ```invoke one_test shortname``` See the [names of the current providers](https://github.com/fabianvf/scrapi/tree/develop/scrapi/harvesters) in scrapi for examples of harvesters you can invoke. ### View Locally To view which harvesters you have scraped locally, type the following url: http://localhost:9200/share_v2/ To view the results for a specific harvester, type the following url: http://localhost:9200/share_v2/_search?q=shareProperties.source:[replace shortname here] To remove all results locally, type the following bash script: ```bash $ invoke clear ``` To remove results for one harvester, replace * below with shortname. ```bash $ curl -XDELETE 'localhost:9200/share_v2/*' ``` To remove both the ```share``` and ```share_v2``` indices from elasticsearch: ```bash $ curl -XDELETE 'localhost:9200/share*' ```

OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.

This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.

Create an Account Learn More Hide this message

Main content

Running the Harvesters

Menu

Start managing your projects on the OSF today.

Main content

Links to this project

Running the Harvesters

Menu

Add new wiki page

Delete wiki page

Page permissions have changed

Wiki page deleted

Connected to the collaborative wiki

Connecting to the collaborative wiki

Collaborative wiki is unavailable

Browser unsupported

Start managing your projects on the OSF today.