# General instructions
Here we provide the source codes created to automatically correct and validate raw data provided by https://doi.org/10.6084/m9.figshare.7100525.v4.
The codes provided here are responsible for checking, correcting and validating the data content of each column of all datasets. Dataset names and brief descriptions are given below:
* **sheet1.csv**: CSV file with 22-column and 439 rows (header=TRUE) containing the complete species traits data file (body size, flight dis-tance, known distribu-tion, new occurrence records, crop pollination, sociality and nest location). Corresponding final file name, after validation: **Brazilian_bees_traits.csv**.
* **sheet2.csv**: CSV file with 11-column and 933 rows (header=TRUE) containing data from complete list of crops pollinated by bee species included in this dataset. Corresponding final file name, after validation: **Brazilian_bees_crop_pollinators.csv**.
* **sheet3.csv**: CSV file with 19-column and 1531 rows (header=TRUE) containing data from complete measured specimens data (location, label data, speci-men collection code new occurrences data and sex information). Corresponding final file name, after validation: **Brazili-an_bees_specimens_data.csv**.
Validations were performed individually for all dataset, considering the restrictions of each column. Following we provide validation rules and the corresponding code used to validate each column.
> **Prerequisites**: Before running these validations, please download all files provided [here][1] and ensure that Python 3.x is installed. These command lines were designed to be run on Linux OS. Technical changes are required to run them on different OS.
## Changes and validation on sheet1.csv
### Automatic identification and correction of errors
As mentioned before, *sheet1.csv* contains 22 columns, which are presented below along with the validation rules used for each of them.
1. **Id:** integer column ranging from 1 to 438
2. **Family:** categorical column (valid values stored in *values_p1_1.csv*)
3. **Tribe:** categorical column (valid values stored in *values_p1_2.csv*)
4. **Genus:** categorical column (valid values stored in *values_p1_3.csv*)
5. **Subgenus:** categorical column (valid values stored in *values_p1_4.csv*)
6. **Specific epithet:** categorical column (valid values stored in *values_p1_5.csv*)
7. **Scientific name authorship:** categorical column (valid values stored in *values_p1_6.csv*)
8. **Body size class:** categorical column (valid values stored in *values_p1_7.csv*)
9. **ITDmeasured:** continuous column
10. **Mhd:** continuous column
11. **Thd:** continuous column
12. **Mfd:** continuous column
13. **Mcd:** continuous column
14. **Location of measured specimen:** categorical column (valid values stored in *values_p1_13.csv*)
15. **Known distribution:** categorical column (no specific validation rules)
16. **New record:** categorical column (valid values stored in *values_p1_15.csv*)
17. **Locality:** categorical column (valid values stored in *values_p1_16.csv*)
18. **Crop pollinator:** categorical column (valid values stored in *values_p1_17.csv*)
19. **Sociality:** categorical column (valid values stored in *values_p1_18.csv*)
20. **Nest location:** categorical column (valid values stored in *values_p1_19.csv*)
21. **Level:** categorical column (valid values stored in *values_p1_20.csv*)
22. **Ref.:** categorical column (no specific validation rules)
All the aforementioned validations can be run together using the following command line (access the folder where download files are located):
```console
$ ./p1.sh
```
After running this command line, the output file will be stored in *data* folder, named as *sheet1.csv* (raw data is named as *sheet1_v0.csv* in this same folder). Within *check* folder, all changes performed by auto validation are stored in plain text files, named as *sheet1_X.log*, where *X* corresponds to the number of the column (considering the first column is numberd here as 0). Log files contain comma separated lines, where each row contains 4 fields: row number (starting by 0) where the corresponding change was done, *Id* used by the respective observation, old column value and new column value (that replaced old column value).
### Validations of resulting dataset
The final dataset was validated using the *R* package named *validate* considering the rules already mentioned above. To reproduce validation routine, run the following R script (further details about *validate*, click [here][2]).
> Note: It is required to download the corresponding final file (*Brazilian_bees_traits.csv*) before running this validation (and two next ones). Find the repository address at *General Instructions* section.
```{r}
library(validate)
sheet1 <- read.csv("Brazilian_bees_traits.csv")
v <- validator(length(unique(paste0(sheet1$Genus, sheet1$Specific.epithet))) ==
328, Id >= 1 & Id <= 438, Body.size.class %in% c("medium", "small",
"large"), ITDmeasured >= 0.6 & ITDmeasured <= 8.7, Mhd >= 0.01184 &
Mhd <= 63.01179, Thd >= 0.00652 & Thd <= 25.2884, Mfd >= 0.07122 &
Mfd <= 25.88857, Mcd >= 0.03468 & Mcd <= 42.30383, Location.of.measured.specimen %in%
c("MPEG", "UFMG"), New.record %in% c("yes", "no") | is.na(New.record),
Locality %in% c("bocaina", "canaa dos carajas", "carajas", "nova lima",
"parauapebas") | is.na(Locality), Crop.pollinator %in% c("yes",
"no"), Sociality %in% c("Cleptoparasitic", "Eusocial", "Solitary"),
Nest.location %in% c("soil", "termite", "cavity", "soil/termite", "exposed",
"cavity/human_Made", "soil/cavity/human_Made", "soil/cavity/termite",
"exposed/cavity", "ant") | is.na(Nest.location), Level %in% c("genus",
"species", "subgenus", "tribe") | is.na(Level))
summary(confront(sheet1, v))
```
## Changes on sheet2.csv
### Automatic identification and correction of errors
These are the columns of *sheet2.csv* dataset and their corresponding validation rules:
1. **Id:** integer column ranging from 1 to 932
2. **Scientific name:** categorical column (valid values stored in *values_p2_1.csv*)
3. **Interaction Type:** categorical column (valid values stored in *values_p2_2.csv*)
4. **Family:** categorical column (valid values stored in *values_p2_3.csv*)
5. **Genus:** categorical column (valid values stored in *values_p2_4.csv*)
6. **Specific epithet:** categorical column (valid values stored in *values_p2_5.csv*)
7. **Plant:** categorical column (valid values stored in *values_p2_6.csv*)
8. **Scientific name authorship:** categorical column (valid values stored in *values_p2_7.csv*)
9. **Vernacular name:** categorical column (valid values stored in *values_p2_8.csv*)
10. **English vernacular name:** categorical column (valid values stored in *values_p2_9.csv*)
11. **Pollinators ref:** categorical column (no specific validation rules)
These validations can be run together using the following command line (similar to the previous section, this code produces log files, named as *sheet2_X.log* into *check* folder):
```console
$ ./p2.sh
```
### Validations of resulting dataset
The final dataset validation was performed using the following R script:
```{r}
library(validate)
sheet2 <- read.csv("Brazilian_bees_crop_pollinators.csv")
v <- validator(length(unique(Scientific.name)) <= 328, Id >= 1 & Id <=
932, Interaction.Type %in% c("Pollinates"))
summary(confront(sheet2, v))
```
## Changes on sheet3.csv
### Automatic identification and correction of errors
These are the columns of *sheet3.csv* dataset and their corresponding validation rules:
1. **Id:** integer column ranging from 1 to 1530
2. **Genus:** categorical column (valid values stored in *values_p3_1.csv*)
3. **Subgenus:** categorical column (valid values stored in *values_p3_2.csv*)
4. **Specific epithet:** categorical column (valid values stored in *values_p3_3.csv*)
5. **Location of measured specimen:** categorical column (no specific validation rules)
6. **Country:** categorical column (valid values stored in *values_p3_5.csv*)
7. **new record:** categorical column (valid values stored in *values_p3_6.csv*)
8. **State:** categorical column (valid values stored in *values_p3_7.csv*)
9. **Municipality:** categorical column (valid values stored in *values_p3_8.csv*)
10. **Day:** categorical column (valid values stored in *values_p3_9.csv*)
11. **Month:** categorical column (valid values stored in *values_p3_10.csv*)
12. **Year:** categorical column (no specific validation rules)
13. **Location:** categorical column (valid values stored in *values_p3_12.csv*)
14. **Sampling point:** categorical column (valid values stored in *values_p3_13.csv*)
15. **ITD:** continuous column
16. **ITD average:** continuous column
17. **Sex:** categorical column (valid values stored in *values_p3_16.csv*)
18. **Collector:** categorical column (no specific validation rules)
19. **Collection ID:** categorical column (no specific validation rules)
These validations can be run together using the following command line (similar to the previous section, this code produces log files, named as *sheet2_X.log* into *check* folder):
```console
$ ./p3.sh
```
### Validations of resulting dataset
After automatic correction, manual inspection and adjusting were required and performed for this file, as described here:
- Row 0, 7th column, replaced by: “New record”
- Row 0, 11th column, replaced by: “Month”
- Row 0, 18th column, replaced by: “Collector”
- Row 276, 15th column, replaced by: “5”.
Then, the final dataset was validated using the following R script (further details about *validate*, click [here][2]).
```{r}
library(validate)
sheet3 <- read.csv("Brazilian_bees_specimens_data.csv")
v <- validator(length(unique(paste0(sheet1$Genus, sheet1$Specific.epithet))) ==
328, Id >= 1 & Id <= 1530, Location.of.measured.specimen %in% c("MPEG",
"UFMG") | is.na(Location.of.measured.specimen) | (if (is.na(Location.of.measured.specimen)) Id ==
40), New.record %in% c("yes", "no") | is.na(New.record), ITD >= 0.6 &
ITD <= 8.8 | (if (is.na(ITD)) Id == 40), Sex %in% c("Male", "Female") |
is.na(Sex))
summary(confront(sheet3, v))
```
[1]: ../../files/
[2]: https://cran.r-project.org/web/packages/validate/vignettes/introduction.html