Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
# General instructions Here we provide the source codes created to automatically correct and validate raw data provided by https://doi.org/10.6084/m9.figshare.7100525.v4. The codes provided here are responsible for checking, correcting and validating the data content of each column of all datasets. Dataset names and brief descriptions are given below: * **sheet1.csv**: CSV file with 22-column and 439 rows (header=TRUE) containing the complete species traits data file (body size, flight dis-tance, known distribu-tion, new occurrence records, crop pollination, sociality and nest location). Corresponding final file name, after validation: **Brazilian_bees_traits.csv**. * **sheet2.csv**: CSV file with 11-column and 933 rows (header=TRUE) containing data from complete list of crops pollinated by bee species included in this dataset. Corresponding final file name, after validation: **Brazilian_bees_crop_pollinators.csv**. * **sheet3.csv**: CSV file with 19-column and 1531 rows (header=TRUE) containing data from complete measured specimens data (location, label data, speci-men collection code new occurrences data and sex information). Corresponding final file name, after validation: **Brazili-an_bees_specimens_data.csv**. Validations were performed individually for all dataset, considering the restrictions of each column. Following we provide validation rules and the corresponding code used to validate each column. > **Prerequisites**: Before running these validations, please download all files provided [here][1] and ensure that Python 3.x is installed. These command lines were designed to be run on Linux OS. Technical changes are required to run them on different OS. ## Changes and validation on sheet1.csv ### Automatic identification and correction of errors As mentioned before, *sheet1.csv* contains 22 columns, which are presented below along with the validation rules used for each of them. 1. **Id:** integer column ranging from 1 to 438 2. **Family:** categorical column (valid values stored in *values_p1_1.csv*) 3. **Tribe:** categorical column (valid values stored in *values_p1_2.csv*) 4. **Genus:** categorical column (valid values stored in *values_p1_3.csv*) 5. **Subgenus:** categorical column (valid values stored in *values_p1_4.csv*) 6. **Specific epithet:** categorical column (valid values stored in *values_p1_5.csv*) 7. **Scientific name authorship:** categorical column (valid values stored in *values_p1_6.csv*) 8. **Body size class:** categorical column (valid values stored in *values_p1_7.csv*) 9. **ITDmeasured:** continuous column 10. **Mhd:** continuous column 11. **Thd:** continuous column 12. **Mfd:** continuous column 13. **Mcd:** continuous column 14. **Location of measured specimen:** categorical column (valid values stored in *values_p1_13.csv*) 15. **Known distribution:** categorical column (no specific validation rules) 16. **New record:** categorical column (valid values stored in *values_p1_15.csv*) 17. **Locality:** categorical column (valid values stored in *values_p1_16.csv*) 18. **Crop pollinator:** categorical column (valid values stored in *values_p1_17.csv*) 19. **Sociality:** categorical column (valid values stored in *values_p1_18.csv*) 20. **Nest location:** categorical column (valid values stored in *values_p1_19.csv*) 21. **Level:** categorical column (valid values stored in *values_p1_20.csv*) 22. **Ref.:** categorical column (no specific validation rules) All the aforementioned validations can be run together using the following command line (access the folder where download files are located): ```console $ ./p1.sh ``` After running this command line, the output file will be stored in *data* folder, named as *sheet1.csv* (raw data is named as *sheet1_v0.csv* in this same folder). Within *check* folder, all changes performed by auto validation are stored in plain text files, named as *sheet1_X.log*, where *X* corresponds to the number of the column (considering the first column is numberd here as 0). Log files contain comma separated lines, where each row contains 4 fields: row number (starting by 0) where the corresponding change was done, *Id* used by the respective observation, old column value and new column value (that replaced old column value). ### Validations of resulting dataset The final dataset was validated using the *R* package named *validate* considering the rules already mentioned above. To reproduce validation routine, run the following R script (further details about *validate*, click [here][2]). > Note: It is required to download the corresponding final file (*Brazilian_bees_traits.csv*) before running this validation (and two next ones). Find the repository address at *General Instructions* section. ```{r} library(validate) sheet1 <- read.csv("Brazilian_bees_traits.csv") v <- validator(length(unique(paste0(sheet1$Genus, sheet1$Specific.epithet))) == 328, Id >= 1 & Id <= 438, Body.size.class %in% c("medium", "small", "large"), ITDmeasured >= 0.6 & ITDmeasured <= 8.7, Mhd >= 0.01184 & Mhd <= 63.01179, Thd >= 0.00652 & Thd <= 25.2884, Mfd >= 0.07122 & Mfd <= 25.88857, Mcd >= 0.03468 & Mcd <= 42.30383, Location.of.measured.specimen %in% c("MPEG", "UFMG"), New.record %in% c("yes", "no") | is.na(New.record), Locality %in% c("bocaina", "canaa dos carajas", "carajas", "nova lima", "parauapebas") | is.na(Locality), Crop.pollinator %in% c("yes", "no"), Sociality %in% c("Cleptoparasitic", "Eusocial", "Solitary"), Nest.location %in% c("soil", "termite", "cavity", "soil/termite", "exposed", "cavity/human_Made", "soil/cavity/human_Made", "soil/cavity/termite", "exposed/cavity", "ant") | is.na(Nest.location), Level %in% c("genus", "species", "subgenus", "tribe") | is.na(Level)) summary(confront(sheet1, v)) ``` ## Changes on sheet2.csv ### Automatic identification and correction of errors These are the columns of *sheet2.csv* dataset and their corresponding validation rules: 1. **Id:** integer column ranging from 1 to 932 2. **Scientific name:** categorical column (valid values stored in *values_p2_1.csv*) 3. **Interaction Type:** categorical column (valid values stored in *values_p2_2.csv*) 4. **Family:** categorical column (valid values stored in *values_p2_3.csv*) 5. **Genus:** categorical column (valid values stored in *values_p2_4.csv*) 6. **Specific epithet:** categorical column (valid values stored in *values_p2_5.csv*) 7. **Plant:** categorical column (valid values stored in *values_p2_6.csv*) 8. **Scientific name authorship:** categorical column (valid values stored in *values_p2_7.csv*) 9. **Vernacular name:** categorical column (valid values stored in *values_p2_8.csv*) 10. **English vernacular name:** categorical column (valid values stored in *values_p2_9.csv*) 11. **Pollinators ref:** categorical column (no specific validation rules) These validations can be run together using the following command line (similar to the previous section, this code produces log files, named as *sheet2_X.log* into *check* folder): ```console $ ./p2.sh ``` ### Validations of resulting dataset The final dataset validation was performed using the following R script: ```{r} library(validate) sheet2 <- read.csv("Brazilian_bees_crop_pollinators.csv") v <- validator(length(unique(Scientific.name)) <= 328, Id >= 1 & Id <= 932, Interaction.Type %in% c("Pollinates")) summary(confront(sheet2, v)) ``` ## Changes on sheet3.csv ### Automatic identification and correction of errors These are the columns of *sheet3.csv* dataset and their corresponding validation rules: 1. **Id:** integer column ranging from 1 to 1530 2. **Genus:** categorical column (valid values stored in *values_p3_1.csv*) 3. **Subgenus:** categorical column (valid values stored in *values_p3_2.csv*) 4. **Specific epithet:** categorical column (valid values stored in *values_p3_3.csv*) 5. **Location of measured specimen:** categorical column (no specific validation rules) 6. **Country:** categorical column (valid values stored in *values_p3_5.csv*) 7. **new record:** categorical column (valid values stored in *values_p3_6.csv*) 8. **State:** categorical column (valid values stored in *values_p3_7.csv*) 9. **Municipality:** categorical column (valid values stored in *values_p3_8.csv*) 10. **Day:** categorical column (valid values stored in *values_p3_9.csv*) 11. **Month:** categorical column (valid values stored in *values_p3_10.csv*) 12. **Year:** categorical column (no specific validation rules) 13. **Location:** categorical column (valid values stored in *values_p3_12.csv*) 14. **Sampling point:** categorical column (valid values stored in *values_p3_13.csv*) 15. **ITD:** continuous column 16. **ITD average:** continuous column 17. **Sex:** categorical column (valid values stored in *values_p3_16.csv*) 18. **Collector:** categorical column (no specific validation rules) 19. **Collection ID:** categorical column (no specific validation rules) These validations can be run together using the following command line (similar to the previous section, this code produces log files, named as *sheet2_X.log* into *check* folder): ```console $ ./p3.sh ``` ### Validations of resulting dataset After automatic correction, manual inspection and adjusting were required and performed for this file, as described here: - Row 0, 7th column, replaced by: “New record” - Row 0, 11th column, replaced by: “Month” - Row 0, 18th column, replaced by: “Collector” - Row 276, 15th column, replaced by: “5”. Then, the final dataset was validated using the following R script (further details about *validate*, click [here][2]). ```{r} library(validate) sheet3 <- read.csv("Brazilian_bees_specimens_data.csv") v <- validator(length(unique(paste0(sheet1$Genus, sheet1$Specific.epithet))) == 328, Id >= 1 & Id <= 1530, Location.of.measured.specimen %in% c("MPEG", "UFMG") | is.na(Location.of.measured.specimen) | (if (is.na(Location.of.measured.specimen)) Id == 40), New.record %in% c("yes", "no") | is.na(New.record), ITD >= 0.6 & ITD <= 8.8 | (if (is.na(ITD)) Id == 40), Sex %in% c("Male", "Female") | is.na(Sex)) summary(confront(sheet3, v)) ``` [1]: ../../files/ [2]: https://cran.r-project.org/web/packages/validate/vignettes/introduction.html
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.