A dataset of multi-functional ecological traits of Brazilian bees (data validation)

doi:10.17605/OSF.IO/CMJVX

Title	Authors

Home

# General instructions Here we provide the source codes created to automatically correct and validate raw data provided by https://doi.org/10.6084/m9.figshare.7100525.v4. The codes provided here are responsible for checking, correcting and validating the data content of each column of all datasets. Dataset names and brief descriptions are given below: * **sheet1.csv**: CSV file with 22-column and 439 rows (header=TRUE) containing the complete species traits data file (body size, flight dis-tance, known distribu-tion, new occurrence records, crop pollination, sociality and nest location). Corresponding final file name, after validation: **Brazilian_bees_traits.csv**. * **sheet2.csv**: CSV file with 11-column and 933 rows (header=TRUE) containing data from complete list of crops pollinated by bee species included in this dataset. Corresponding final file name, after validation: **Brazilian_bees_crop_pollinators.csv**. * **sheet3.csv**: CSV file with 19-column and 1531 rows (header=TRUE) containing data from complete measured specimens data (location, label data, speci-men collection code new occurrences data and sex information). Corresponding final file name, after validation: **Brazili-an_bees_specimens_data.csv**. Validations were performed individually for all dataset, considering the restrictions of each column. Following we provide validation rules and the corresponding code used to validate each column. > **Prerequisites**: Before running these validations, please download all files provided [here][1] and ensure that Python 3.x is installed. These command lines were designed to be run on Linux OS. Technical changes are required to run them on different OS. ## Changes and validation on sheet1.csv ### Automatic identification and correction of errors As mentioned before, *sheet1.csv* contains 22 columns, which are presented below along with the validation rules used for each of them. 1. **Id:** integer column ranging from 1 to 438 2. **Family:** categorical column (valid values stored in *values_p1_1.csv*) 3. **Tribe:** categorical column (valid values stored in *values_p1_2.csv*) 4. **Genus:** categorical column (valid values stored in *values_p1_3.csv*) 5. **Subgenus:** categorical column (valid values stored in *values_p1_4.csv*) 6. **Specific epithet:** categorical column (valid values stored in *values_p1_5.csv*) 7. **Scientific name authorship:** categorical column (valid values stored in *values_p1_6.csv*) 8. **Body size class:** categorical column (valid values stored in *values_p1_7.csv*) 9. **ITDmeasured:** continuous column 10. **Mhd:** continuous column 11. **Thd:** continuous column 12. **Mfd:** continuous column 13. **Mcd:** continuous column 14. **Location of measured specimen:** categorical column (valid values stored in *values_p1_13.csv*) 15. **Known distribution:** categorical column (no specific validation rules) 16. **New record:** categorical column (valid values stored in *values_p1_15.csv*) 17. **Locality:** categorical column (valid values stored in *values_p1_16.csv*) 18. **Crop pollinator:** categorical column (valid values stored in *values_p1_17.csv*) 19. **Sociality:** categorical column (valid values stored in *values_p1_18.csv*) 20. **Nest location:** categorical column (valid values stored in *values_p1_19.csv*) 21. **Level:** categorical column (valid values stored in *values_p1_20.csv*) 22. **Ref.:** categorical column (no specific validation rules) All the aforementioned validations can be run together using the following command line (access the folder where download files are located): ```console $ ./p1.sh ``` After running this command line, the output file will be stored in *data* folder, named as *sheet1.csv* (raw data is named as *sheet1_v0.csv* in this same folder). Within *check* folder, all changes performed by auto validation are stored in plain text files, named as *sheet1_X.log*, where *X* corresponds to the number of the column (considering the first column is numberd here as 0). Log files contain comma separated lines, where each row contains 4 fields: row number (starting by 0) where the corresponding change was done, *Id* used by the respective observation, old column value and new column value (that replaced old column value). ### Validations of resulting dataset The final dataset was validated using the *R* package named *validate* considering the rules already mentioned above. To reproduce validation routine, run the following R script (further details about *validate*, click [here][2]). > Note: It is required to download the corresponding final file (*Brazilian_bees_traits.csv*) before running this validation (and two next ones). Find the repository address at *General Instructions* section. ```{r} library(validate) sheet1 <- read.csv("Brazilian_bees_traits.csv") v <- validator(length(unique(paste0(sheet1$Genus, sheet1$Specific.epithet))) == 328, Id >= 1 & Id <= 438, Body.size.class %in% c("medium", "small", "large"), ITDmeasured >= 0.6 & ITDmeasured <= 8.7, Mhd >= 0.01184 & Mhd <= 63.01179, Thd >= 0.00652 & Thd <= 25.2884, Mfd >= 0.07122 & Mfd <= 25.88857, Mcd >= 0.03468 & Mcd <= 42.30383, Location.of.measured.specimen %in% c("MPEG", "UFMG"), New.record %in% c("yes", "no") | is.na(New.record), Locality %in% c("bocaina", "canaa dos carajas", "carajas", "nova lima", "parauapebas") | is.na(Locality), Crop.pollinator %in% c("yes", "no"), Sociality %in% c("Cleptoparasitic", "Eusocial", "Solitary"), Nest.location %in% c("soil", "termite", "cavity", "soil/termite", "exposed", "cavity/human_Made", "soil/cavity/human_Made", "soil/cavity/termite", "exposed/cavity", "ant") | is.na(Nest.location), Level %in% c("genus", "species", "subgenus", "tribe") | is.na(Level)) summary(confront(sheet1, v)) ``` ## Changes on sheet2.csv ### Automatic identification and correction of errors These are the columns of *sheet2.csv* dataset and their corresponding validation rules: 1. **Id:** integer column ranging from 1 to 932 2. **Scientific name:** categorical column (valid values stored in *values_p2_1.csv*) 3. **Interaction Type:** categorical column (valid values stored in *values_p2_2.csv*) 4. **Family:** categorical column (valid values stored in *values_p2_3.csv*) 5. **Genus:** categorical column (valid values stored in *values_p2_4.csv*) 6. **Specific epithet:** categorical column (valid values stored in *values_p2_5.csv*) 7. **Plant:** categorical column (valid values stored in *values_p2_6.csv*) 8. **Scientific name authorship:** categorical column (valid values stored in *values_p2_7.csv*) 9. **Vernacular name:** categorical column (valid values stored in *values_p2_8.csv*) 10. **English vernacular name:** categorical column (valid values stored in *values_p2_9.csv*) 11. **Pollinators ref:** categorical column (no specific validation rules) These validations can be run together using the following command line (similar to the previous section, this code produces log files, named as *sheet2_X.log* into *check* folder): ```console $ ./p2.sh ``` ### Validations of resulting dataset The final dataset validation was performed using the following R script: ```{r} library(validate) sheet2 <- read.csv("Brazilian_bees_crop_pollinators.csv") v <- validator(length(unique(Scientific.name)) <= 328, Id >= 1 & Id <= 932, Interaction.Type %in% c("Pollinates")) summary(confront(sheet2, v)) ``` ## Changes on sheet3.csv ### Automatic identification and correction of errors These are the columns of *sheet3.csv* dataset and their corresponding validation rules: 1. **Id:** integer column ranging from 1 to 1530 2. **Genus:** categorical column (valid values stored in *values_p3_1.csv*) 3. **Subgenus:** categorical column (valid values stored in *values_p3_2.csv*) 4. **Specific epithet:** categorical column (valid values stored in *values_p3_3.csv*) 5. **Location of measured specimen:** categorical column (no specific validation rules) 6. **Country:** categorical column (valid values stored in *values_p3_5.csv*) 7. **new record:** categorical column (valid values stored in *values_p3_6.csv*) 8. **State:** categorical column (valid values stored in *values_p3_7.csv*) 9. **Municipality:** categorical column (valid values stored in *values_p3_8.csv*) 10. **Day:** categorical column (valid values stored in *values_p3_9.csv*) 11. **Month:** categorical column (valid values stored in *values_p3_10.csv*) 12. **Year:** categorical column (no specific validation rules) 13. **Location:** categorical column (valid values stored in *values_p3_12.csv*) 14. **Sampling point:** categorical column (valid values stored in *values_p3_13.csv*) 15. **ITD:** continuous column 16. **ITD average:** continuous column 17. **Sex:** categorical column (valid values stored in *values_p3_16.csv*) 18. **Collector:** categorical column (no specific validation rules) 19. **Collection ID:** categorical column (no specific validation rules) These validations can be run together using the following command line (similar to the previous section, this code produces log files, named as *sheet2_X.log* into *check* folder): ```console $ ./p3.sh ``` ### Validations of resulting dataset After automatic correction, manual inspection and adjusting were required and performed for this file, as described here: - Row 0, 7th column, replaced by: “New record” - Row 0, 11th column, replaced by: “Month” - Row 0, 18th column, replaced by: “Collector” - Row 276, 15th column, replaced by: “5”. Then, the final dataset was validated using the following R script (further details about *validate*, click [here][2]). ```{r} library(validate) sheet3 <- read.csv("Brazilian_bees_specimens_data.csv") v <- validator(length(unique(paste0(sheet1$Genus, sheet1$Specific.epithet))) == 328, Id >= 1 & Id <= 1530, Location.of.measured.specimen %in% c("MPEG", "UFMG") | is.na(Location.of.measured.specimen) | (if (is.na(Location.of.measured.specimen)) Id == 40), New.record %in% c("yes", "no") | is.na(New.record), ITD >= 0.6 & ITD <= 8.8 | (if (is.na(ITD)) Id == 40), Sex %in% c("Male", "Female") | is.na(Sex)) summary(confront(sheet3, v)) ``` [1]: ../../files/ [2]: https://cran.r-project.org/web/packages/validate/vignettes/introduction.html

Compare

OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.

This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.

Create an Account Learn More Hide this message

Main content

Home

Menu

Start managing your projects on the OSF today.

Main content

Links to this project

Home

Menu

Add new wiki page

Page permissions have changed

Wiki page deleted

Connected to the collaborative wiki

Connecting to the collaborative wiki

Collaborative wiki is unavailable

Browser unsupported

Start managing your projects on the OSF today.