# Analysis steps This project contains the materials to replicate the analyses in: Orme et al. (2019) Nature Ecology and Evolution XX: XX - XX DOI: 10.1038/s41559-019-0889-z Note that the code in this document is not a functional master script - some of the steps are run using a cluster so there are file transfers and other steps that are needed but this document shows the order of the steps and the scripts used. ## Missing data A number of the datasets used in the study are freely available for research but require registration to access. We have not duplicated them here. The missing files are: **Bird_maps_2018/\BOTW\BOTW.gdb**: See http://datazone.birdlife.org/. Used to provide range limits for study species and the wider Atlantic Forest avifauna. **SOS Mata Atlântica forest cover maps**: See https://www.sosma.org.br/projeto/atlas-da-mata-atlantica/. Used to provide broad scale maps of forest cover for extrapolation (2013-2014 assessment) and for calculation of forest cover when surveys were conducted (2010-2011 assessment). **Instituto Florestal forest cover maps** See http://iflorestal.sp.gov.br/. Higher resolution forest cover maps used to calculate forest cover around some survey sites. ## Preparing the study grid and calculating forest cover These steps generate a grid of the proportion forest cover across the study area to be used for later modelling and to define the region of interest. 1. Use Python to create a pickle file of the most recent SOS maps of the ACF fragments cd projection_python python3 extract_fragment_geometries.py Note that this script expects to find the 2013-2014 SOS Mata Atlantica maps in the GIS_layers folder. This dataset requires permission for research use so is not provided here. The script expects to find the set of state by state shapefiles at GIS_layers/Atlantic Forest cover maps/2013-2014/*.shp. 2. Use Python to create a high resolution regular grid over the study area with ~ 9 million 1km cells across the region python3 make_grid.py 3. The outputs of those scripts are: * ACF_fragments_pickle (fragment geometries and a spatial index) * grid.pickle (shapely rectangular grid geometries) * grid_centroids.pickle (shapely Points of grid centroids) * grid_centroids_XY.pickle (coordinates of centroid points) We used a cluster to calculate forest cover on the fine grid from those outputs. The details will change depending on the cluster available but we used a cluster running PBS to batch the calculation. The command below submits an array job to the queue to split the cells into 40 subjobs and calculates the cover in parallel. This is just repeatedly calling the script calculate_cover_grid.py using different starting offsets. It takes about 2.5 hours per subjob (so would be about 4 days of unparallelised runtime). qsub calculate_cover_grid_pbs.sh 4. That step populates a folder cover_grid with the 40 separate outputs. That folder needs to be recovered from the cluster and then stitched back together using: python3 compile_cover_grid.py That script compiles a raster of proportion forest cover for modelling, some binary rasters of forest presence at 1km, 10km and 20km resolutions and vector representations of those binary rasters ## Identification of Atlantic Forest avifauna These steps identify the set of species to be studied from the global avifauna maps. We have a set of species recorded from the field in the study sites, but also want to project the models for the whole avifauna, so we cut down the Birdlife Birds of the World maps to the species that intersect the forest cover vectors from the previous step. 1. A Python script that uses the 5km resolution forest cover polygons to subset the BOTW geodatabase cd ../Bird_maps_2018 python3 subset_botw.py Again, this step uses the Birds of the World distribution dataset, which requires registration for use from Birdlife International and so is not provided here. The script expects to find the geodatabase at: Bird_maps_2018\BOTW\BOTW.gdb. 2. Compile (~ 1 hour runtime) the bird species ranges into a shapefile (cleaned_ranges.shp) that: * only includes the Atlantic Forest birds we want to model (AllSpecies2018.csv), * has a single feature per species, and * passes validity checks. R CMD BATCH --vanilla make_cleaned_ranges.R 3. Get the coastline buffer to be used to remove coastal range edges. This is based on GSHHS but needs some minor modifications to remove clearly coastal features from the ranges that occur inland according to GSHHS. The folder here includes two excerpts from the GSHHS: the full dataset is large and updated (https://www.ngdc.noaa.gov/mgg/shorelines/) so we have saved the specific subset used in this study here along with shapefiles containing specific modifications to the coastline required for this study. cd ../Coastline R CMD BATCH --vanilla create_buffered_coastline.R The script creates new_world_continental_coastline.shp and new_world_continental_coastline_buffered.shp. 4. Combine (~ 15 minute runtime) the cleaned ranges and the coastline buffer to extract multipolylines showing the continental range extents of each species. This validates the species matching against the occupancy data. cd ../Bird_maps_2018 R CMD BATCH --vanilla extract_continental_margins.R ## Calculation of forest cover at sample sites Calculate the forest cover for the sample sites using appropriate high resolution data. This requires a set of high resolution GIS data found in the GIS_layers folder and creates the file forest_cover_hi_res.csv cd ../forest_cover_data R CMD BATCH --vanilla forest_cover_calculation_hi_res.R A second script is also provided that calculates forest area, proportion forest cover and buffer overlap between sites for differing buffer distances. These files are used to explore the influence of buffer distance on model parameters in the statistical analysis below. Note that both script expect to find the 2010-2011 SOS Mata Atlantica maps in the GIS_layers folder and the Instituto Florestal 2005 forest cover map, both of which will require the data to be requested as noted above. ## Compilation of modelling data Compile the model data - species presence/absence, forest cover and distance to range edge for study sites and study species and create a datetime stamped file model_data_YYYYMMDD.csv in the root directory. cd ../Bird_maps_2018 R CMD BATCH --vanilla compile_model_data.R ## Statistical analysis ### Cross validation The following steps run the cross validation on the main GLMER model, looking at leave-one-out cross validation for each study and single site models. The first script fits all the models and, because it takes a reasonably long time to run (~ 5.5 hours), it saves the resulting models in cross_validation_models.Rdata. cd Statistical_analysis/cross_validation R CMD BATCH --vanilla fit_cross_validation_models.R The next script loads the models, checks for convergence and compiles a data frame of cross validation scores in cross_validation_scores.csv for use in generating Figure S2. R CMD BATCH --vanilla get_cross_validation_results.R ### Buffer radius The paper includes analysis of the impact of choice of buffer radius on the models. These steps perform this analysis. This is a lengthy analysis as it involves fitting the main GLMER repeatedly with different choices of buffer distance and the recalculation of forest cover for those buffers. We again used a cluster running PBS to run sets of models in parallel. To replicate this, you will need to setup the required files and folder structure on your own cluster. First, a batch array job is submitted to the cluster that runs buffer_radius.R for blocks of 5 models. qsub buffer_radius_pbs.sh That creates the output_models directory and populates it with the models with different buffer distances. Once those have been retrieved from the cluster, the following command compiles the contents of those models and generates plots. R CMD BATCH --vanilla buffer_model_compiler.R ### Main statistical analysis The main statistical analyses presented in the paper are described in Statistical_analysis/Script_Orme_etal.r. This creates the .Rdata files and the glmer_params.csv file in that directory. ### Extrapolate model to Atlantic Forest Avifauna These steps project the model onto all of the terrestrial ACF species. We use a fine grid to get distances from range edge across all species ranges and then calculate probability of occurrence under the model parameters. 1. Calculate maps of species distance to range edge across the Atlantic Coastal Forest to model probabilities of occurrence. This step is set up to run on a cluster to reduce processing time which is considerable: ~ 18 hours * 40 blocks of ~ 20 species for basically a month of data crunching. The max block run time (genera with big ranges) is getting close to 24 hours. The PBS file calculate_distances_pbs.sh runs the python file calculate_distances.py. You will need to look in this file to find the required files and directory structure to set it up on a cluster. qsub ./calculate_distances_pbs.sh This creates a directory distances containing a set of gzipped files showing distance to range edge for the study region for each species. These are very large (~30GB in total) and are not duplicated here. 2. Generate map surfaces: the following script extracts the distances to edge from the previous step and combines them with the model parameters to generate a species richness map, a predicted sum of incidence map and some other raster data sets. python3 compile_occupancy.py ## Figures The Statistical analysis folder contains R code to generate the figures from the model outputs and calculated data.