Main content



Loading wiki pages...

Wiki Version:
![Logo of Te Kāhui Roro Reo, the New Zealand Institute of Language, Brain & Behaviour.][1] ![Logo of the UC Arts Digital Lab.][2] # Creating Specialised Corpora from Digitised Historical Newspaper Archives: An Iterative Bootstrapping Approach ### Joshua Wilson Black *UC Arts Digital Lab* *The New Zealand Institute of Language, Brain & Behaviour* <> <> This repository provides access to the code required to generate specialised corpora from METS/ALTO digitised historical newspaper archives and to data generated in the reported test case of the method. The code is shared via OSF's integration with GitHub while the data from the test case is stored on OSF's storage service. Details of the GitHub repository can be found on its [page]( The data stored here consists of: - `corpora`: Candidate corpora at each stage, saved as pickled pandas dataframes. - `candidate_corpus_3_widenet.tar.gz` contains the corpus generated by the classifier after the third iteration with a threshold of 0.4, rather than 0.5. This corpus is not reported in the paper but is provided for the reader to experiment with. - `labels`: Pandas dataframes and csv files of the labels for each iteration of the corpus construction process. - `processed_data`: All 'article' items (i.e. _not_ advertisements) from the Papers Past Open Data Pilot, stored as a series of pandas dataframes. - `meta`: Metadata concerning newspaper codes, items, and their location within the processed data. - `wiki_images`: Logo images for this page. - `corpus-3`: `csv` and plain text versions (zipped) of the third candidate corpus. An interactive dashboard to explore the results over the iterations of the corpus construction process is available [here]( The code for this dashboard, which can be run locally, can be found [here]( [1]: [2]:
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.