Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
This project contains data and code for the following datasets. Graph types in these datasets were assessed using the automated screening tool Barzooka. **Validation datasets** (folder 'validation_datasets'): 1. **Internal validation dataset:** Internal validation was performed on a set of 3812 pages that were collected together with the training dataset but not used for the training itself. 2. **bioRxiv external validation dataset:** External validation was performed on a set of 1,107 preprints posted on bioRxiv in May 2019. 3. **Charité external validation dataset:** External validation for two graph types that were uncommon in the bioRxiv dataset (flow charts, pie charts) was performed on 1,000 randomly selected articles published between 2015 and 2019, by authors affiliated with Charité Universitätsmedizin – Berlin or the Berlin Institute of Health at Charité. For each of the validation datasets there are two csv files with the results of the manual assessment and the prediction from Barzooka. Additionally, there is a R script file for each validation dataset that uses the manual and predicted results to calculate the performance metrics for the class predictions. **Examining the effects of field and time** (folder 'study_results'): This dataset was derived from the subset of open access articles deposited in PubMed Central. We selected up to 1,000 articles per field per year, for each of 23 fields between 2010 and 2020 (n = 227,998). Full details of the sampling strategy are described in the preprint associated with this project. The folder contains the csv with the screening results and a data dictionary describing the variables in the dataset. Additionally, the R script 'visualizations.R' creates the publication plots from the dataset.