SDM with R

1st QCBS R Symposium

Outline

This tutorial illustrates how to build, evaluate and project species distribution models within R. The main steps we will go through, described bellow, are the following:

  1. R environment preparation
  2. Data preparation
    • Study area and manipulating spatial information
    • Environmental data
    • Species occurrence data
      • Presence-and-absence data
      • Presence-only and pseudo-absences data
  3. Model fitting, prediction, and evaluation
  4. Projecting models
  5. Multi-species modelling
  6. DIY: Create a SDM yourself!
  7. Bonus

R environment preparation

Download the .zip file from here and extract files to your favorite QCBS R Symposium workshop folder.

To run this workshop, please use the latest version of R. A certain number of libraries are also required and will be downloaded. Install and load all required R libraries:

install.packages("rgdal")
install.packages("raster")
install.packages("rgeos")
install.packages("dismo")
install.packages("letsR")
install.packages("biomod2")
install.packages("biogeo")
install.packages("maptools")
install.packages("stringr")
install.packages("ggplot2")
install.packages("plotly")
install.packages("xtable")
install.packages("tidyr")
library(rgdal)
library(raster)
library(rgeos)
library(dismo)
library(letsR)
library(biomod2)
library(biogeo)
library(maptools)
library(stringr)
library(ggplot2)
library(plotly)
library(xtable)
library(tidyr)

Set your working directory to the same directory of the folder workshopSDM using the function setwd() or your prefered method.

Finally, we recommend to save and create a new working environment for this tutorial.

# save all the working space
save.image("QCBS_SDMTutorial.RData")
# free the working space
rm(list = ls())
# and get it back
load("QCBS_SDMTutorial.RData")

Data preparation

Selecting your study area and manipulating spatial files

Most species distribution models are done using spatial grid information. In the next step, we will learn how to create a polygon grid of a given region, which will be used within our species distribution model framework.

For this tutorial, we have chosen the Neotropical zoogeographical region as our study area. Load the polygon shapefile of it using the readOGR function:

neotropical_shape <- rgdal::readOGR("data/shapefiles/neotropical.shp")
OGR data source with driver: ESRI Shapefile 
Source: "data/shapefiles/neotropical.shp", layer: "neotropical"
with 1 features
It has 3 fields
plot(neotropical_shape)

Now that we have our study region, we need to create a grid, intersect our shapefile and create a new grid of that region. For this, we will use the modified GridFilter function (originally from Hidasi-Neto, J.). It allows us to input a polygon shapefile, specify a resolution and the proportion of overlap for each cell to be considered.

We will apply the function to the polygon neotropical_shape you have loaded.

GridFilter <- function(shape, resol = 1, prop = 0) {
    grid <- raster(extent(shape))
    res(grid) <- resol
    proj4string(grid) <- proj4string(shape)
    gridpolygon <- rasterToPolygons(grid)
    drylandproj <- spTransform(shape, CRS("+proj=laea"))
    gridpolproj <- spTransform(gridpolygon, CRS("+proj=laea"))
    gridpolproj$layer <- c(1:length(gridpolproj$layer))
    areagrid <- gArea(gridpolproj, byid = T)
    gridpolproj <- gBuffer(gridpolproj, byid = TRUE, width = 0)
    dry.grid <- intersect(drylandproj, gridpolproj)
    areadrygrid <- gArea(dry.grid, byid = T)
    info <- cbind(dry.grid$layer, areagrid[dry.grid$layer], areadrygrid)
    dry.grid$layer <- info[, 3]/info[, 2]
    dry.grid <- spTransform(dry.grid, CRS(proj4string(shape)))
    dry.grid.filtered <- dry.grid[dry.grid$layer >= prop, ]
}

# Create a spatial polygon grid for the Neotropical region, with 5 degrees x
# 5 degrees

neotropical_grid <- GridFilter(neotropical_shape, resol = 5, prop = 0.5)

## Create an ID field for it, which will be useful in the future
neotropical_grid$ID <- 1:(length(neotropical_grid))
# Export your resulting polygon grid
writeOGR(neotropical_grid, dsn = paste(getwd(), "/data/shapefiles", sep = ""), 
    layer = "neotropical_grid_5", driver = "ESRI Shapefile", overwrite_layer = T)

This is our resulting polygon grid, where each cell refers to a site (and a row) in our data.

neotropical_grid <- readOGR("data/shapefiles/neotropical_grid_5.shp")
OGR data source with driver: ESRI Shapefile 
Source: "data/shapefiles/neotropical_grid_5.shp", layer: "neotropical_grid_5"
with 63 features
It has 5 fields
plot(neotropical_grid)

Importing environmental variables

In species distribution modeling, predictor variables are typically extracted from raster (grid) type files. Each ‘raster’ should represent a given variable of interest, which can be climatic, soil, terrain, vegetation, land use, and other types variables.

The getData function from the dismo package is able to retrieve useful climate, elevation and administrative data.

bioClimVars <- getData("worldclim", var = "bio", res = 10)
bioClimVars

You can also load your predictor variables from any directory within your computer using the raster (single file) and stack (loads multiple files) functions:

rasterFiles <- list.files("data/rasters/bio_10m_bil", pattern = "bil$", full.names = TRUE)
rasterFiles
 [1] "data/rasters/bio_10m_bil/bio1.bil" 
 [2] "data/rasters/bio_10m_bil/bio10.bil"
 [3] "data/rasters/bio_10m_bil/bio11.bil"
 [4] "data/rasters/bio_10m_bil/bio12.bil"
 [5] "data/rasters/bio_10m_bil/bio13.bil"
 [6] "data/rasters/bio_10m_bil/bio14.bil"
 [7] "data/rasters/bio_10m_bil/bio15.bil"
 [8] "data/rasters/bio_10m_bil/bio16.bil"
 [9] "data/rasters/bio_10m_bil/bio17.bil"
[10] "data/rasters/bio_10m_bil/bio18.bil"
[11] "data/rasters/bio_10m_bil/bio19.bil"
[12] "data/rasters/bio_10m_bil/bio2.bil" 
[13] "data/rasters/bio_10m_bil/bio3.bil" 
[14] "data/rasters/bio_10m_bil/bio4.bil" 
[15] "data/rasters/bio_10m_bil/bio5.bil" 
[16] "data/rasters/bio_10m_bil/bio6.bil" 
[17] "data/rasters/bio_10m_bil/bio7.bil" 
[18] "data/rasters/bio_10m_bil/bio8.bil" 
[19] "data/rasters/bio_10m_bil/bio9.bil" 
# Stack the bioclimatic variables
bioClimVars <- stack(rasterFiles)
bioClimVars
class       : RasterStack 
dimensions  : 900, 2160, 1944000, 19  (nrow, ncol, ncell, nlayers)
resolution  : 0.1666667, 0.1666667  (x, y)
extent      : -180, 180, -60, 90  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +no_defs 
names       :  bio1, bio10, bio11, bio12, bio13, bio14, bio15, bio16, bio17, bio18, bio19,  bio2,  bio3,  bio4,  bio5, ... 
min values  :  -269,   -97,  -488,     0,     0,     0,     0,     0,     0,     0,     0,     9,     8,    72,   -59, ... 
max values  :   314,   380,   289,  9916,  2088,   652,   261,  5043,  2159,  4001,  3985,   211,    95, 22673,   489, ... 

You can also plot your variables to take a look at them:

plot(bioClimVars/10)  # WorldClim data is multiplied by 10. Let's take a look at the real information.

Variable extraction

Before extracting variables, it is really important that your rasters have the same projection (i.e. type of coordinate reference system) as your other spatial files. See ?CRS for help in finding CRS definitions.

crs.geo <- CRS("+proj=longlat +ellps=WGS84 + datum=WGS84")  # geographical, datum WGS84

neotropical_grid <- spTransform(neotropical_grid, crs.geo)  # define projection system of grid data
proj4string(bioClimVars) <- crs.geo  # and of our bioclimatic rasters

Now, we can extract the predictor values for each cell of our grid!

myExpl <- data.frame(bio_1 = numeric(length(neotropical_grid)), bio_2 = numeric(length(neotropical_grid)), 
    bio_3 = numeric(length(neotropical_grid)))

# bio_1_ext is a list of cells that contains all values for each cell, for
# each predictor
bio_1_ext <- raster::extract(bioClimVars$bio1, neotropical_grid)
bio_2_ext <- raster::extract(bioClimVars$bio2, neotropical_grid)
bio_3_ext <- raster::extract(bioClimVars$bio3, neotropical_grid)

myExpl$bio_1 <- unlist(lapply(bio_1_ext, function(x) if (!is.null(x)) mean(x, 
    na.rm = TRUE) else NA))

myExpl$bio_2 <- unlist(lapply(bio_2_ext, function(x) if (!is.null(x)) mean(x, 
    na.rm = TRUE) else NA))

myExpl$bio_3 <- unlist(lapply(bio_1_ext, function(x) if (!is.null(x)) mean(x, 
    na.rm = TRUE) else NA))

write.table(myExpl, "data/matrices/NT_EnvVar_5.txt")

# Let's take a look at our table:
head(myExpl)
     bio_1     bio_2    bio_3
1 215.9267 138.49084 215.9267
2 236.4301 114.60676 236.4301
3 241.8685  99.48434 241.8685
4 239.1822  91.49031 239.1822
5 235.9922  99.52013 235.9922
6 269.0044 101.91778 269.0044

Importing species data

Using expert drawn maps

One way of acquiring species distribution information is from expert-drawn maps, such as the ones from the IUCN Red List for Threatened Species database. These maps Tare usually done used multiple methods: only on occurrence data (e.g. a polygon around known occurrences), whereas others incorporate multiple degrees of knowledge, such as habitat requirements or elevational limits of the species, in essence using an informal distribution modelling approach.

Richness maps derived from expert-drawn range maps may overestimate local richness (‘error of commission’), in relation to point-to-grid richness maps. This because they are generally drawn to include all areas where a species is known to occur without excluding areas in between where the species may not exist. They tend to map the ‘extent of occurrence’ of species that includes the, perhaps much smaller, ‘area of occupancy’ (Loislle et al., 2003; Habib et al., 2004; Hurlbert & White, 2005; Graham & Hijmans, 2006).

We advise caution when using them.

# Load your species map file
speciesGeoDist <- readShapePoly("data/shapefiles/NT_TERRESTRIAL_MAMMALS_subset.shp", 
    proj4string = crs.geo)

plot(speciesGeoDist)

Create a presence-absence matrix of species’ geographic ranges within your polygon grid shapefile. For this, we will use the function lets.presab.grid from the letsR package.

crs.geo <- CRS("+proj=longlat +ellps=WGS84 + datum=WGS84")  # geographical, datum WGS84

proj4string(neotropical_grid) <- crs.geo  # define projection system of grid data
proj4string(speciesGeoDist) <- crs.geo  # and of our bioclimatic rasters

abspres.matrix <- lets.presab.grid(shapes = speciesGeoDist, grid = neotropical_grid, 
    sample.unit = "ID", remove.sp = TRUE, presence = NULL, origin = NULL, seasonal = NULL)

# Export your presence-absence matrix
write.table(abspres.matrix$PAM, "data/matrices/NT_PAM_5.txt")

You can now visualize a map of your species richness within your polygon grid:

richnessCalc <- rowSums(abspres.matrix$PAM) + 1
colPalette <- colorRampPalette(c("#fff5f0", "#fb6a4a", "#67000d"))
colors <- c("white", colPalette(max(richnessCalc)))
plot(abspres.matrix$grid, border = "gray40", col = colors[richnessCalc])
map(add = TRUE)

We also need the coordinates of our centroids to proceed with the data analysis:

# Extract coordinates from the cell centroids
myRespCoord <- as.data.frame(coordinates(abspres.matrix$grid))
colnames(myRespCoord) <- c("Longitude_X", "Latitude_Y")

# Export it for the future
write.table(myRespCoord, "data/matrices/NT_coords_5.txt")

Using occurrence records

We can also use occurrence records available online, or collected in the field. Here, we will use information from the Global Biodiversity Information Facility and import it using the function gbif:

D_youngi.GBIF <- gbif("Diaemus", "youngi")
# get data frame with spatial coordinates (points)
Dyoungi.Occ <- subset(D_youngi.GBIF, select = c("country", "lat", "lon"))
head(Dyoungi.Occ)  # a simple data frame with coordinates
              country      lat       lon
1           Nicaragua 11.11199 -85.76016
2 Trinidad and Tobago 10.54603 -61.48618
3            Suriname  1.99449 -56.09211
4            Suriname  1.99000 -56.09000
5            Suriname  1.99400 -56.09200
6            Suriname       NA        NA

Issues with occurrence points

Occurrence data from museum and herbarium collections are very valuable for mapping biodiversity patterns, and for species distribution modelling. But these datasets contain a lot of errors and are susceptible to many issues relating data quality. Many of these errors are extensively described in “Biogeo: an R package for assessing and improving data quality of occurrence record datasets”, from Robertson et al., 2016

Let’s take a look at a few of them:

Description of errors detected in point data with explanations of the likely cause of the error. Yesson et al. (2007) described several errors, which we give in brackets in the Error column; copied from Robertson et al. 2016
Error Probable.Reason
Points plotted at zero degrees latitude and longitude. No coordinates were available in the original dataset but values of zero assigned to the coordinates
Points in sea for terrestrial species or on land for aquatic species, obvious geographical outliers. Transposed latitude and longitude coordinates; incorrect sign on the decimal degrees of the latitude or longitude coordinate; degrees and minutes were transposed before the coordinate was converted to decimal degrees; imprecise locality description used to assign coordinates; the specimen was incorrectly identified by the collector or the incorrect name was applied to the species when the data were digitized.
Point in sea but close to coast for terrestrial species, or on land but close to coast for marine species. Low precision coordinates e.g. only degrees were available or the data were originally collected on a coarse-scale grid. Imprecise locality description used to assign coordinates.
Point plotted along the prime meridian or equator. Missing coordinate for latitude or for longitude that was incorrectly assigned a value of zero.
Country name given in the record does not correspond with country where point is plotted. Likely to be the same errors as for b) above.
Elevation given in the record does not correspond with elevation obtained from a digital elevation model where point is plotted. Likely to be the same errors as for b) above, or the spatial resolution of the digital elevation model is too coarse.

One way of working around these issues is by using the package biogeo, which has many functions for quality assessment and coordinate conversion.

Data cleaning

Let’s take a look at our species matrix. What is going when we look at the sum of the columns (i.e. number of presences)?

# Load our species data
DataSpecies <- read.table("data/matrices/NT_PAM_5.txt", header = TRUE)

colSums(DataSpecies)
            Panthera.onca             Pecari.tajacu 
                       50                        55 
      Peromyscus.leucopus   Phyllostomus.latifolius 
                        3                        10 
        Philander.opossum     Phyllomys.nigrispinus 
                       39                         3 
     Peromyscus.mexicanus       Phyllomys.dasythrix 
                        3                         2 
   Oxymycterus.amazonicus      Oxymycterus.hispidus 
                        8                         5 
      Peropteryx.macrotis      Pattonomys.carrikeri 
                       47                         1 
        Oxymycterus.josei   Peropteryx.pallidoptera 
                        1                         5 
        Oxymycterus.hiska        Peromyscus.aztecus 
                        3                         3 
        Peromyscus.beatae          Philander.deltae 
                        3                         1 
      Phylloderma.stenops         Peromyscus.furvus 
                       43                         1 
        Oxymycterus.rufus        Peromyscus.grandis 
                        7                         2 
    Phaenomys.ferrugineus       Peropteryx.kappleri 
                        1                        44 
    Phyllomys.blainvillii    Phyllomys.brasiliensis 
                        3                         2 
       Peromyscus.levipes       Philander.mondolfii 
                        1                         5 
       Philander.frenatus Phyllomys.mantiqueirensis 
                        7                         1 
   Phyllostomus.elongatus       Oxymycterus.nasutus 
                       41                         5 
     Peromyscus.hylocetes       Oxymycterus.hucucha 
                        1                         2 
      Oxymycterus.roberti   Oxymycterus.dasytrichus 
                        7                         3 
    Oxymycterus.angularis          Phyllomys.medius 
                        3                         3 
      Peromyscus.megalops       Pattonomys.occasius 
                        2                         3 
    Phyllostomus.hastatus           Phyllomys.kerri 
                       44                         1 
     Oxymycterus.quaestor     Peropteryx.leucoptera 
                        9                        28 
        Phyllomys.thomasi    Peromyscus.yucatanicus 
                        1                         2 
   Ozotoceros.bezoarticus         Peromyscus.gratus 
                       21                         1 
      Philander.andersoni     Pearsonomys.annectens 
                       12                         2 
      Oxymycterus.delator      Philander.mcilhennyi 
                        5                         4 
   Peromyscus.maniculatus       Peromyscus.gymnotis 
                        1                         1 
       Perognathus.flavus    Oxymycterus.paramensis 
                        1                         7 
         Oxymycterus.inca     Phyllostomus.discolor 
                        5                        46 
    Peromyscus.difficilis      Peromyscus.melanotis 
                        1                         1 
        Phyllomys.lamarum   Pattonomys.semivillosus 
                        3                         1 
          Phyllomys.lundi       Peromyscus.stirtoni 
                        1                         2 
        Phyllomys.pattoni         Phyllomys.sulinus 
                        4                         3 
   Peromyscus.melanophrys      Oxymycterus.caparoae 
                        2                         2 

We can work on that by removing species that occur in less than four sites.

# Remove all columns that have less than 4 presences
to.remove <- colnames(DataSpecies)[colSums(DataSpecies) <= 4]
`%ni%` <- Negate(`%in%`)
DataSpecies <- subset(DataSpecies, select = names(DataSpecies) %ni% to.remove)

Sometimes, we also have problems with the separators between species names. We can use the gsub() function to solve this type of issue.

# Replace '.' per '.'
names(DataSpecies) <- gsub(x = names(DataSpecies), pattern = "\\.", replacement = ".")

write.table(DataSpecies, "data/matrices/NT_PAM_5.txt")

Species distribution modelling

Initialising and formatting direct input data

First, input data needs to be rearranged for usage in biomod2 using BIOMOD_FormatingData(). Its most important arguments are:

  1. resp.var contains species data (for a single species) in binary format (ones for presences, zeros for true absences and NA for indeterminated) that will be used to build the species distribution models. It may be a vector or a spatial points object.
  2. expl.var contains a matrix, data.frame, SpatialPointsDataFrame or RasterStack containing your explanatory variables.
  3. resp.xy two columns matrix containing the X and Y coordinates of resp.var (only consider if resp.var is a vector)
  4. resp.name contains the response variable name (species).
# Recall your variables again:

## Load prepared species data
myResp <- read.table("data/matrices/NT_PAM_5.txt", header = TRUE)

## Load environmental variables
myExpl <- read.table("data/matrices/NT_EnvVar_5.txt", header = TRUE)

# Define the species of interest
myRespName <- names(myResp)[1]

# Load coordinates
myRespCoord <- read.table("data/matrices/NT_coords_5.txt", header = TRUE)
myBiomodData <- BIOMOD_FormatingData(resp.var = as.data.frame(myResp[,myRespName]),
                                     expl.var = myExpl,
                                     resp.xy = as.data.frame(myRespCoord),
                                     resp.name = myRespName,
                                     #PA.nb.rep = 2,
                                     #PA.nb.absences = 200,
                                     #PA.strategy = 'random',
                                     na.rm = TRUE)

-=-=-=-=-=-=-=-=-=-=-= Panthera.onca Data Formating -=-=-=-=-=-=-=-=-=-=-=

> No pseudo absences selection !
      ! No data has been set aside for modeling evaluation
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Done -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Now, you may check if your data is correctly formatted by printing and plotting the myBiomodData object.

myBiomodData

-=-=-=-=-=-=-=-=-=-=-=-=-= 'BIOMOD.formated.data' -=-=-=-=-=-=-=-=-=-=-=-=-=

sp.name =  Panthera.onca

     50 presences,  13 true absences and  0 undifined points in dataset


     3 explanatory variables

     bio_1            bio_2            bio_3       
 Min.   : 54.06   Min.   : 80.49   Min.   : 54.06  
 1st Qu.:182.15   1st Qu.: 99.89   1st Qu.:182.15  
 Median :236.43   Median :117.75   Median :236.43  
 Mean   :211.69   Mean   :115.79   Mean   :211.69  
 3rd Qu.:256.98   3rd Qu.:128.26   3rd Qu.:256.98  
 Max.   :269.00   Max.   :162.52   Max.   :269.00  

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
plot(myBiomodData)

Building models

Defining algorithm parameters using default options

We are ready for running our set of models on our species. This step is the core of the modeling procedure.

First, we define the of different options for each modeling technique using the function BIOMOD_ModelingOptions(). If we do not change its parameters, it will take the default values (run Print_Default_ModelingOptions() to see them) defined within biomod2.

For example, you could change the parameters of the GLM function to compute quadratic a GLM and select best model with ‘BIC’ criterium:

myBiomodOptions <- BIOMOD_ModelingOptions(GLM = list(type = "quadratic", interaction.level = 0, 
    myFormula = NULL, test = "BIC", family = "binomial", control = glm.control(epsilon = 1e-08, 
        maxit = 1000, trace = FALSE)))

You could also set your own GLM formula.

# you can prefer to establish your own GLM formula
myBiomodOptions <- BIOMOD_ModelingOptions(
                    GLM = list(myFormula = formula(Panthera.onca ~ bio3 + log(bio1) + poly(bio2,2) + bio4 + bio1:bio4))

But, for the sake of simplicity, we will keep all options default.

myBiomodOption <- BIOMOD_ModelingOptions()

# The created object is then given to BIOMOD_Modeling in the next step.

Computing models

Now, we proceed to running the set of models on our species. We may choose between 12 difeerent algorithms (‘GLM’, ‘GBM’, ‘GAM’, ‘CTA’, ‘ANN’, ‘SRE’, ‘FDA’, ‘MARS’, ‘RF’, ‘MAXENT.Phillips’ and ‘MAXENT.Tsuruoka’).

As we do not have evaluation data, we will make 3-fold cross-validation (number controlled by NbRunEval argument) of our models by randomly splitting our data set into 2 subsets DataSplit.

Take a look below:

myBiomodModelOut <- BIOMOD_Modeling(
    # BIOMOD_FormatingData()-class object
    data = myBiomodData,
    
    # The set of models to be calibrated on the data
    models = c('GBM', 'GLM', 'MAXENT.Tsuruoka', 'RF', 'MAXENT.Phillips'), 
    
    # BIOMOD.models.options object 
    models.options = myBiomodOption, 
    
    # Number of evaluation runs
    NbRunEval=3, 
    
    # % of data used to calibrate the models, the remaining part will be used for testing
    DataSplit=70,
    
    # If this argument is kept to NULL, each observation (presence or absence) has the same weight (independent of the number of presences and absences).
    Prevalence=0.5, 
    
    # Number of permutations to estimate variable importance
    VarImport=5,
    
    # Types of evaluations methods to be used. The complete list is within ?BIOMOD_Modeling()
    models.eval.meth = c('TSS','ROC'),
    
    # Keep all results and outputs on hard drive or not. Keep it always TRUE
    SaveObj = TRUE,
    
    # If true, all model predictions will be scaled with a binomial GLM
    rescal.all.models = FALSE,
    
    # If true, models calibrated and evaluated with the whole dataset are done.
    # Building models with all information available may be usefull in some particular 
    # cases (i.e. rare species with few presences points).
    do.full.models = FALSE,
    
    # character, the ID (=name) of modeling procedure. A random number by default
    modeling.id = paste(myRespName,"CurrentClim",sep="")) # We call it 'CurrentClim' because the bioclimatic variables we are working with are from the present climate.


Loading required library...

Checking Models arguments...

! java software seems not be corectly installed
  > MAXENT.Phillips modelling was switched off!
Creating suitable Workdir...

    > Automatic weights creation to rise a 0.5 prevalence


-=-=-=-=-=-=-=-=-=-=-= Panthera.onca Modeling Summary -=-=-=-=-=-=-=-=-=-=-=

 3  environmental variables ( bio_1 bio_2 bio_3 )
Number of evaluation repetitions : 3
Models selected : GBM GLM MAXENT.Tsuruoka RF 

Total number of model runs : 12 

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


-=-=-=- Run :  Panthera.onca_AllData 


-=-=-=--=-=-=- Panthera.onca_AllData_RUN1 

Model=Generalised Boosting Regression 
     2500 maximum different trees and  3  Fold Cross-Validation
    Evaluating Model stuff...
    Evaluating Predictor Contributions... 

Model=GLM ( quadratic with no interaction )
    Stepwise procedure using AIC criteria
    selected formula : Panthera.onca ~ I(bio_1^2) + I(bio_2^2) + bio_1
<environment: 0x0000000020f37ca0>

    Evaluating Model stuff...
    Evaluating Predictor Contributions... 

Model evaluation

Extracting evaluation scores

We are able now to capture our model evaluations using the get_evaluations function. In this tutorial, we have focused on two evaluations methods: Receiving Operator Curves and True Skill Statistic.

# get all models evaluation
myBiomodModelEval <- get_evaluations(myBiomodModelOut)  #contains model evaluation scores

# print the dimnames of this object
dimnames(myBiomodModelEval)
[[1]]
[1] "TSS" "ROC"

[[2]]
[1] "Testing.data" "Cutoff"       "Sensitivity"  "Specificity" 

[[3]]
[1] "GBM"             "GLM"             "MAXENT.Tsuruoka" "RF"             

[[4]]
[1] "RUN1" "RUN2" "RUN3"

[[5]]
Panthera.onca_AllData 
            "AllData" 
hist(myBiomodModelEval)

write.table(myBiomodModelEval, file = "data/results/EvalAll_Model_Eval")

Plotting evaluation scores

We can also plot evaluation scores for sensitivity and specifity for all models.

# print the Sensitivity of all models --> TSS and ROC scores for SENSITIVITY
mySensitivity_All_Models <- myBiomodModelEval[, "Sensitivity", , , ]
mySensitivity_All_Models

mySensAll <- as.data.frame(t(as.data.frame(mySensitivity_All_Models)))

MySensHist.TSS <- ggplot(mySensAll, aes(x = TSS)) + geom_histogram(aes(y = ..density..), 
    binwidth = density(mySensAll$TSS)$bw) + geom_density(fill = "red", alpha = 0.2)

MySensHist.ROC <- ggplot(mySensAll, aes(x = ROC)) + geom_histogram(aes(y = ..density..), 
    binwidth = density(mySensAll$ROC)$bw) + geom_density(fill = "red", alpha = 0.2)

MySensHist.TSS
MySensHist.ROC

# save results
write.table(mySensitivity_All_Models, file = "data/results/Sensitivity_All_Models")
, , RUN1

    GBM GLM MAXENT.Tsuruoka  RF
TSS 100 100             100 100
ROC 100 100             100 100

, , RUN2

    GBM GLM MAXENT.Tsuruoka RF
TSS  75 100             100 75
ROC  75 100             100 75

, , RUN3

    GBM GLM MAXENT.Tsuruoka RF
TSS 100 100             100 75
ROC 100 100             100 75

# Proportion of PRESENCES correctly predicted (true positive rate)
sensitivity <- read.table("data/results/Sensitivity_All_Models")

summary(sensitivity)
    GBM.RUN1        GLM.RUN1  MAXENT.Tsuruoka.RUN1    RF.RUN1     
 Min.   :86.67   Min.   :80   Min.   :100          Min.   :73.33  
 1st Qu.:86.67   1st Qu.:80   1st Qu.:100          1st Qu.:73.33  
 Median :86.67   Median :80   Median :100          Median :73.33  
 Mean   :86.67   Mean   :80   Mean   :100          Mean   :73.33  
 3rd Qu.:86.67   3rd Qu.:80   3rd Qu.:100          3rd Qu.:73.33  
 Max.   :86.67   Max.   :80   Max.   :100          Max.   :73.33  
    GBM.RUN2        GLM.RUN2     MAXENT.Tsuruoka.RUN2    RF.RUN2  
 Min.   :86.67   Min.   :86.67   Min.   :86.67        Min.   :80  
 1st Qu.:86.67   1st Qu.:86.67   1st Qu.:86.67        1st Qu.:80  
 Median :86.67   Median :86.67   Median :86.67        Median :80  
 Mean   :86.67   Mean   :86.67   Mean   :86.67        Mean   :80  
 3rd Qu.:86.67   3rd Qu.:86.67   3rd Qu.:86.67        3rd Qu.:80  
 Max.   :86.67   Max.   :86.67   Max.   :86.67        Max.   :80  
    GBM.RUN3     GLM.RUN3  MAXENT.Tsuruoka.RUN3    RF.RUN3  
 Min.   :80   Min.   :80   Min.   :80           Min.   :80  
 1st Qu.:80   1st Qu.:80   1st Qu.:80           1st Qu.:80  
 Median :80   Median :80   Median :80           Median :80  
 Mean   :80   Mean   :80   Mean   :80           Mean   :80  
 3rd Qu.:80   3rd Qu.:80   3rd Qu.:80           3rd Qu.:80  
 Max.   :80   Max.   :80   Max.   :80           Max.   :80  
str(sensitivity)
'data.frame':   2 obs. of  12 variables:
 $ GBM.RUN1            : num  86.7 86.7
 $ GLM.RUN1            : int  80 80
 $ MAXENT.Tsuruoka.RUN1: int  100 100
 $ RF.RUN1             : num  73.3 73.3
 $ GBM.RUN2            : num  86.7 86.7
 $ GLM.RUN2            : num  86.7 86.7
 $ MAXENT.Tsuruoka.RUN2: num  86.7 86.7
 $ RF.RUN2             : int  80 80
 $ GBM.RUN3            : int  80 80
 $ GLM.RUN3            : int  80 80
 $ MAXENT.Tsuruoka.RUN3: int  80 80
 $ RF.RUN3             : int  80 80
sensitivity  #table of sensitivity by TSS or ROC per run per model
    GBM.RUN1 GLM.RUN1 MAXENT.Tsuruoka.RUN1 RF.RUN1 GBM.RUN2 GLM.RUN2
TSS   86.667       80                  100  73.333   86.667   86.667
ROC   86.667       80                  100  73.333   86.667   86.667
    MAXENT.Tsuruoka.RUN2 RF.RUN2 GBM.RUN3 GLM.RUN3 MAXENT.Tsuruoka.RUN3
TSS               86.667      80       80       80                   80
ROC               86.667      80       80       80                   80
    RF.RUN3
TSS      80
ROC      80
sensTSS.plot <- ggplot(sensData, aes(x = model1, y = TSS, fill = model1)) + 
    geom_boxplot() + guides(fill = FALSE) + ggtitle("Sensitivity using TSS scores") + 
    theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1), 
        panel.background = element_rect(fill = NA)) + labs(x = "Model Type", 
    y = "TSS (%)")

sensTSS.plot <- ggplotly(sensTSS.plot)

sensROC.plot <- ggplot(sensData, aes(x = model2, y = ROC, fill = model1)) + 
    geom_boxplot() + guides(fill = FALSE) + ggtitle("Sensitivity using ROC scores") + 
    theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1), 
        panel.background = element_rect(fill = NA)) + labs(x = "Model Type", 
    y = "ROC (%)")

sensROC.plot <- ggplotly(sensROC.plot)

sensTSS.plot
sensROC.plot

Discussion: How do the models compare by their sensitivity scores?

##### plot SPECIFICITY #####

# Proportion of ABSENCES correctly predicted (true negative rate)

specificity <- read.table("data/results/Specificity_All_Models")
summary(specificity)
    GBM.RUN1      GLM.RUN1   MAXENT.Tsuruoka.RUN1    RF.RUN1   
 Min.   :100   Min.   :100   Min.   :100          Min.   :100  
 1st Qu.:100   1st Qu.:100   1st Qu.:100          1st Qu.:100  
 Median :100   Median :100   Median :100          Median :100  
 Mean   :100   Mean   :100   Mean   :100          Mean   :100  
 3rd Qu.:100   3rd Qu.:100   3rd Qu.:100          3rd Qu.:100  
 Max.   :100   Max.   :100   Max.   :100          Max.   :100  
    GBM.RUN2     GLM.RUN2   MAXENT.Tsuruoka.RUN2    RF.RUN2  
 Min.   :75   Min.   :100   Min.   :100          Min.   :75  
 1st Qu.:75   1st Qu.:100   1st Qu.:100          1st Qu.:75  
 Median :75   Median :100   Median :100          Median :75  
 Mean   :75   Mean   :100   Mean   :100          Mean   :75  
 3rd Qu.:75   3rd Qu.:100   3rd Qu.:100          3rd Qu.:75  
 Max.   :75   Max.   :100   Max.   :100          Max.   :75  
    GBM.RUN3      GLM.RUN3   MAXENT.Tsuruoka.RUN3    RF.RUN3  
 Min.   :100   Min.   :100   Min.   :100          Min.   :75  
 1st Qu.:100   1st Qu.:100   1st Qu.:100          1st Qu.:75  
 Median :100   Median :100   Median :100          Median :75  
 Mean   :100   Mean   :100   Mean   :100          Mean   :75  
 3rd Qu.:100   3rd Qu.:100   3rd Qu.:100          3rd Qu.:75  
 Max.   :100   Max.   :100   Max.   :100          Max.   :75  
str(specificity)
'data.frame':   2 obs. of  12 variables:
 $ GBM.RUN1            : int  100 100
 $ GLM.RUN1            : int  100 100
 $ MAXENT.Tsuruoka.RUN1: int  100 100
 $ RF.RUN1             : int  100 100
 $ GBM.RUN2            : int  75 75
 $ GLM.RUN2            : int  100 100
 $ MAXENT.Tsuruoka.RUN2: int  100 100
 $ RF.RUN2             : int  75 75
 $ GBM.RUN3            : int  100 100
 $ GLM.RUN3            : int  100 100
 $ MAXENT.Tsuruoka.RUN3: int  100 100
 $ RF.RUN3             : int  75 75
specificity  #table of specificity by TSS or ROC per run per model
    GBM.RUN1 GLM.RUN1 MAXENT.Tsuruoka.RUN1 RF.RUN1 GBM.RUN2 GLM.RUN2
TSS      100      100                  100     100       75      100
ROC      100      100                  100     100       75      100
    MAXENT.Tsuruoka.RUN2 RF.RUN2 GBM.RUN3 GLM.RUN3 MAXENT.Tsuruoka.RUN3
TSS                  100      75      100      100                  100
ROC                  100      75      100      100                  100
    RF.RUN3
TSS      75
ROC      75
specTSS <- specificity[1, ]
specTran <- t(specTSS)
modelNames <- rownames(specTran)
model1 <- substr(modelNames, 1, 3)
specPlot1 <- data.frame(specTran, model1)

specROC <- specificity[2, ]
specTran <- t(specROC)
modelNames <- rownames(specTran)
model2 <- substr(modelNames, 1, 3)
specPlot2 <- data.frame(specTran, model2)

specData <- cbind(specPlot1, specPlot2)

Let’s take a look at our plots for specificity:

specTSS.plot <- ggplot(specData, aes(x = model1, y = TSS, fill = model1)) + 
    geom_boxplot() + guides(fill = FALSE) + ggtitle("Specificity using TSS scores") + 
    theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1), 
        panel.background = element_rect(fill = NA)) + labs(x = "Model Type", 
    y = "TSS (%)")

specROC.plot <- ggplot(specData, aes(x = model2, y = ROC, fill = model1)) + 
    geom_boxplot() + guides(fill = FALSE) + ggtitle("Specificity using ROC scores") + 
    theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1), 
        panel.background = element_rect(fill = NA)) + labs(x = "Model Type", 
    y = "ROC (%)")

# Make them interactive!
specTSS.plot <- ggplotly(specTSS.plot)
specROC.plot <- ggplotly(specROC.plot)

# Plot them
specTSS.plot
specROC.plot

We can also print the variable importance for all models

# print variable importances
get_variables_importance(myBiomodModelOut)
, , RUN1, AllData

        GBM   GLM MAXENT.Tsuruoka    RF
bio_1 0.922 0.919           0.238 0.297
bio_2 0.072 0.472           0.054 0.090
bio_3 0.000 0.000           0.279 0.337

, , RUN2, AllData

        GBM   GLM MAXENT.Tsuruoka    RF
bio_1 0.929 0.961           0.238 0.307
bio_2 0.059 0.072           0.065 0.035
bio_3 0.000 0.000           0.283 0.234

, , RUN3, AllData

        GBM   GLM MAXENT.Tsuruoka    RF
bio_1 0.908 0.895           0.244 0.233
bio_2 0.029 0.000           0.070 0.028
bio_3 0.000 0.000           0.295 0.274

This is another common way of saving these variables in disk for later use.

### save models evaluation scores and variables importance on hard drive
capture.output(get_evaluations(myBiomodModelOut), file = file.path(myRespName, 
    paste(myRespName, "_formal_models_eval.txt", sep = "")))

capture.output(get_variables_importance(myBiomodModelOut), file = file.path(myRespName, 
    paste(myRespName, "_formal_models_var_imp.txt", sep = "")))

Finally, we can also see a few response plots

# 4. Plot response curves

# 4.1 Load the models for which we want to extract the predicted response
# curves
myGLMs <- BIOMOD_LoadModels(myBiomodModelOut, models = c("GLM"))

# 4.2 plot 2D response plots
myRespPlot2D <- response.plot2(models = myGLMs, Data = get_formal_data(myBiomodModelOut, 
    "expl.var"), show.variables = get_formal_data(myBiomodModelOut, "expl.var.names"), 
    do.bivariate = FALSE, fixed.var.metric = "median", col = c("blue", "red"), 
    legend = TRUE, data_species = get_formal_data(myBiomodModelOut, "resp.var"))

# 4.2 plot 3D response plots here only for a lone model
myRespPlot3D <- response.plot2(models = myGLMs[1], Data = get_formal_data(myBiomodModelOut, 
    "expl.var"), show.variables = get_formal_data(myBiomodModelOut, "expl.var.names"), 
    do.bivariate = TRUE, fixed.var.metric = "median", data_species = get_formal_data(myBiomodModelOut, 
        "resp.var"), display_title = FALSE)

### all the values used to produce this plot are stored into the returned
### object you can redo plots by yourself and customised them

Projecting your outputs

### Make projections on current variable
myBiomodProj <- BIOMOD_Projection(modeling.output = myBiomodModelOut, new.env = myExpl, 
    xy.new.env = myRespCoord, proj.name = "current", selected.models = "all", 
    binary.meth = "TSS", compress = TRUE, clamping.mask = F, output.format = ".RData")

-=-=-=-=-=-=-=-=-=-=-=-=-= Do Models Projections -=-=-=-=-=-=-=-=-=-=-=-=-=

    > Building clamping mask

*** in setMethod('BinaryTransformation', signature(data='numeric')
*** in setMethod('BinaryTransformation', signature(data='data.frame')
*** in setMethod('BinaryTransformation', signature(data='numeric')
*** in setMethod('BinaryTransformation', signature(data='data.frame')
*** in setMethod('BinaryTransformation', signature(data='numeric')
*** in setMethod('BinaryTransformation', signature(data='data.frame')
*** in setMethod('BinaryTransformation', signature(data='numeric')
*** in setMethod('BinaryTransformation', signature(data='data.frame')
*** in setMethod('BinaryTransformation', signature(data='numeric')
*** in setMethod('BinaryTransformation', signature(data='data.frame')
*** in setMethod('BinaryTransformation', signature(data='numeric')
*** in setMethod('BinaryTransformation', signature(data='data.frame')
    > Projecting Panthera.onca_AllData_RUN1_GBM ...
    > Projecting Panthera.onca_AllData_RUN1_GLM ...
myCurrentProj <- get_predictions(myBiomodProj)

myCurrentProj
, , RUN1, AllData

      GBM GLM MAXENT.Tsuruoka   RF
 [1,] 921 850             853  930
 [2,] 861 869             935  998
 [3,] 753 809             956  674
 [4,] 650 670             961  894
 [5,] 774 715             951  936
 [6,] 947 987             974 1000
 [7,] 931 907             965 1000
 [8,] 649 212             926  912
 [9,] 899 932             973  992
[10,] 947 970             969 1000
[11,] 946 967             969 1000
[12,] 899 962             974  992
[13,] 897 908             974  852
[14,] 786 559             941  936
[15,] 899 953             973  992
[16,] 899 932             973  992
[17,] 899 968             979  852
[18,] 899 974             978  992

 [ reached getOption("max.print") -- omitted 45 row(s) and 2 matrix slice(s) ]
# Plot your predictions like this
plot(myBiomodProj)

Ensemble of forecasting

One powerful feature from biomod2comes with BIOMOD_EnsembleModeling, which combines individual models to build an ensemble of forecast. In the following example, we exclude all models having a TSS score lower than 0.5.

  ### Building ensemble-models
  myBiomodEM <- BIOMOD_EnsembleModeling(
    # a "BIOMOD.models.out" returned by BIOMOD_Modeling
    modeling.output = myBiomodModelOut,
    
    # a character vector (either 'all' or a sub-selection of model names) that defines the models kept for building the ensemble model
    chosen.models = 'all',
    
    # The value chosen for this parameter will control the number of ensemble models built.
    # 'all' means a total consensus model will be built
    em.by='all',
    
    # Evaluation metric used to build ensemble models
    eval.metric = 'all',
    
    # If not NULL, the minimum scores below which models will be excluded of the ensemble-models building
    eval.metric.quality.threshold = c(0.7, 0.5),
    
    # Estimates the ... across predictions
    prob.mean = T,
    prob.cv = T,
    prob.ci = T,
    prob.ci.alpha = 0.05,
    prob.median = T,
    committee.averaging = T,
    prob.mean.weight = T,
    prob.mean.weight.decay = 'proportional')

-=-=-=-=-=-=-=-=-=-=-=-=-= Build Ensemble Models -=-=-=-=-=-=-=-=-=-=-=-=-=

   ! all models available will be included in ensemble.modeling
   > Evaluation & Weighting methods summary :
      TSS over 0.7
      ROC over 0.5


  > mergedAlgo_mergedRun_mergedData ensemble modeling
   ! Models projections for whole zonation required...
    > Projecting Panthera.onca_AllData_RUN1_GBM ...
### Make ensemble-models projections on current variable
myBiomodEF <- BIOMOD_EnsembleForecasting(projection.output = myBiomodProj, binary.meth = "TSS", 
    total.consensus = TRUE, EM.output = myBiomodEM)
# Plot your results
plot(myBiomodEF)

Multi-species modelling

Now, that we know how to do single species modelling, what changes if we want to model the distribution of n species?

biomod2 has no function that allows multi-species to be modelled. But, we can use the help of other R functions to deal with this! For example, we can create a for loop that will execute the modelling parts for each species within our species data frame. See below, (but do not execute it yet!)

Another (faster, and perhaps, more elegant) way for doing multi-species SDM using biomod2, is to use a apply family function. Like this one:

# Define your species names
sp.names <- names(myResp)

MyBiomodSF <- function(sp.n) {
    myRespName = sp.n
    cat("\n", myRespName, "modelling...")
    
    ### definition of data i.e keep only the column of our species
    myResp <- as.numeric(myResp[, myRespName])
    myRespCoord = myRespCoord
    
    ### Initialisation
    myBiomodData <- BIOMOD_FormatingData(resp.var = myResp, expl.var = myExpl, 
        resp.xy = myRespCoord, resp.name = myRespName)
    
    ### Options definition
    myBiomodOption <- BIOMOD_ModelingOptions()
    
    ### Modelling
    myBiomodModelOut <- BIOMOD_Modeling(myBiomodData, models = c("RF", "MAXENT.Tsuruoka"), 
        models.options = myBiomodOption, NbRunEval = 3, DataSplit = 80, Prevalence = 0.5, 
        VarImport = 3, models.eval.meth = c("TSS", "ROC"), SaveObj = TRUE, rescal.all.models = FALSE, 
        do.full.models = FALSE, modeling.id = paste(myRespName, "CurrentClim", 
            sep = ""))
    
    ### save models evaluation scores and variables importance on hard drive
    capture.output(get_evaluations(myBiomodModelOut), file = file.path(myRespName, 
        paste(myRespName, "_formal_models_eval.txt", sep = "")))
    capture.output(get_variables_importance(myBiomodModelOut), file = file.path(myRespName, 
        paste(myRespName, "_formal_models_var_imp.txt", sep = "")))
    
    ### Building ensemble-models
    myBiomodEM <- BIOMOD_EnsembleModeling(modeling.output = myBiomodModelOut, 
        chosen.models = "all", em.by = "all", eval.metric = c("TSS"), eval.metric.quality.threshold = c(0.5), 
        prob.mean = T, prob.cv = T, prob.ci = T, prob.ci.alpha = 0.05, prob.median = T, 
        committee.averaging = T, prob.mean.weight = T, prob.mean.weight.decay = "proportional")
    
    ### Make projections on current variable
    myBiomodProj <- BIOMOD_Projection(modeling.output = myBiomodModelOut, new.env = myExpl, 
        xy.new.env = myRespCoord, proj.name = "current", selected.models = "all", 
        binary.meth = "TSS", compress = TRUE, clamping.mask = F, output.format = ".RData")
    
    ### Make ensemble-models projections on current variable
    myBiomodEF <- BIOMOD_EnsembleForecasting(projection.output = myBiomodProj, 
        binary.meth = "TSS", total.consensus = TRUE, EM.output = myBiomodEM)
}

And then you use lapply() and apply our MyBiomodSF function to all sp.names:

sp.names <- names(myResp)

myLapply_SFModelsOut <- lapply(sp.names, MyBiomodSF)

Produce a richness map

### Make ensemble-models projections on current variable load the first speces
### binary maps which will define the mask

alphaMap <- reclassify(subset(myExpl, 1), c(-Inf, Inf, 0))

alphaMap <- get(load(paste(getwd(), "/", myRespName[1], "/proj_current/", "proj_current_", 
    myRespName[1], "_ensemble_TSSbin.RData", sep = "")))[, 1]

DIY: Create your own SDM!

Hopefully, now you can create your own SDM!

Thank you for coming!

Pedro Henrique Braga and Julia Nordlund

April 23, 2017