Main content

Running the Harvesters

Menu

Loading wiki pages...

View
Wiki Version:
Installing and Running the Harvesters ====== Harvesters written for the SHARE project are written in python and are run through scrAPI - or SHARE core. If you'd like to install scrapi on your own machine and run a few of the harvesters, follow these steps! ## Getting started - To run absolutely everything, you will need to: - Install requirements. - Install Elasticsearch - Install Cassandra - Install harvesters - Install rabbitmq (optional) - To only run harvesters locally, you do not have to install rabbitmq ### Requirements - Make sure you have python installed and working properly. - Create and enter [virtual environment](http://virtualenv.readthedocs.org/en/latest/virtualenv.html) for scrapi, and go to the top level project directory. From there, run ```bash $ pip install -r requirements.txt ``` Or, if you'd like some nicer testing and debugging utilities in addition to the core requirements, run ```bash $ pip install -r dev-requirements.txt ``` This will also install the core requirements like normal. ### Installing Cassandra and Elasticsearch _note: JDK 7 must be installed for Cassandra and Elasticsearch to run_ _note: As long as you don't specify Cassandra or Elasticsearch and set RECORD_HTTP_TRANSACTIONS to ```False``` in your local.py, you shouldn't need to have them installed to get at least basic functionality working_ #### Mac OSX ```bash $ brew install cassandra $ brew install elasticsearch ``` #### Ubuntu ##### Install Cassandra 1. Check which version of Java is installed by running the following command: ```bash $ java -version ``` Use the latest version of Oracle Java 7 on all nodes. 2. Add the DataStax Community repository to the /etc/apt/sources.list.d/cassandra.sources.list ```bash $ echo "deb http://debian.datastax.com/community stable main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list ``` 3. Add the DataStax repository key to your aptitude trusted keys. ```bash $ curl -L http://debian.datastax.com/debian/repo_key | sudo apt-key add - ``` 4. Install the package. ```bash $ sudo apt-get update $ sudo apt-get install dsc20=2.0.11-1 cassandra=2.0.11 ``` ##### Install ElasticSearch 1. Download and install the Public Signing Key. ```bash $ wget -qO - https://packages.elasticsearch.org/GPG-KEY-elasticsearch | sudo apt-key add - ``` 2. Add the ElasticSearch repository to yout /etc/apt/sources.list. ```bash $ sudo add-apt-repository "deb http://packages.elasticsearch.org/elasticsearch/1.4/debian stable main" ``` 3. Install the package ```bash $ sudo apt-get update $ sudo apt-get install elasticsearch ``` __Now, just run__ ```bash $ cassandra $ elasticsearch ``` Or, if you'd like your cassandra session to be bound to your current session, run: ```bash $ cassandra -f ``` and you should be good to go. (Note, if you're developing locally, you do not have to run Rabbitmq!) ### Rabbitmq (optional) #### Mac OSX ```bash $ brew install rabbitmq ``` #### Ubuntu ```bash $ sudo apt-get install rabbitmq-server ``` ### Settings You will need to have a local copy of the settings. Copy local-dist.py into your own version of local.py - ``` cp scrapi/settings/local-dist.py scrapi/settings/local.py ``` If you installed Cassandra and Elasticsearch earlier, you will want add the following configuration to your local.py: ```python RECORD_HTTP_TRANSACTIONS = True # Only if cassandra is installed NORMALIZED_PROCESSING = ['cassandra', 'elasticsearch'] RAW_PROCESSING = ['cassandra'] ``` Otherwise, you will want to make sure your local.py has the following configuration: ```python RECORD_HTTP_TRANSACTIONS = False NORMALIZED_PROCESSING = ['storage'] RAW_PROCESSING = ['storage'] ``` This will save all harvested/normalized files to the directory ```archive/<source>/<document identifier>``` _note: Be careful with this, as if you harvest too many documents with the storage module enabled, you could start experiencing inode errors_ If you'd like to be able to run all harvesters, you'll need to [register for a PLOS API key](http://api.plos.org/registration/). Add the following line to your local.py file: ``` PLOS_API_KEY = 'your-api-key-here' ``` ### Running the scheduler (optional) - from the top-level project directory run: ```bash $ invoke beat ``` to start the scheduler, and ```bash $ invoke worker ``` to start the worker. ### Harvesters Run all harvesters with ```bash $ invoke harvesters ``` or, just one with ```bash $ invoke harvester harvester-name ``` Note: harvester-name is the same as the defined harvester "shortname". Invoke a harvester a certain number of days back with the ```--days``` argument. For example, to run a harvester 5 days in the past, run: ```bash $ invoke harvester harvester-name --days=5 ``` Invoke a harvester for a certain start date with the ```--start``` or ```-s```argument. Invoke a harvester for a certain end date with the ```--end``` or ```-e```argument. For example, to run a harvester between the dates of March 14th and March 16th 2015, run: ```bash $ invoke harvester harvester-name --start 2015-03-14 --end 2015-03-16 ``` Either --start or --end can also be used on their own. Not supplying arguments will default to starting the number of days specified in ```settings.DAYS_BACK``` and ending on the current date. If --end is given with no --start, start will default to the number of days specified in ```settings.DAYS_BACK``` before the given end date. ### Testing - To run the tests for the project, just type ```bash $ invoke test ``` and all of the tests in the 'tests/' directory will be run. To run a test for a specific harvester, run ```invoke one_test shortname``` See the [names of the current providers](https://github.com/fabianvf/scrapi/tree/develop/scrapi/harvesters) in scrapi for examples of harvesters you can invoke. ### View Locally To view which harvesters you have scraped locally, type the following url: http://localhost:9200/share_v2/ To view the results for a specific harvester, type the following url: http://localhost:9200/share_v2/_search?q=shareProperties.source:[replace shortname here] To remove all results locally, type the following bash script: ```bash $ invoke clear ``` To remove results for one harvester, replace * below with shortname. ```bash $ curl -XDELETE 'localhost:9200/share_v2/*' ``` To remove both the ```share``` and ```share_v2``` indices from elasticsearch: ```bash $ curl -XDELETE 'localhost:9200/share*' ```
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.