SDM with R
SDM with R
1st QCBS R Symposium
Outline
This tutorial illustrates how to build, evaluate and project species distribution models within R. The main steps we will go through, described bellow, are the following:
- R environment preparation
- Data preparation
- Study area and manipulating spatial information
- Environmental data
- Species occurrence data
- Presence-and-absence data
- Presence-only and pseudo-absences data
- Model fitting, prediction, and evaluation
- Projecting models
- Multi-species modelling
- DIY: Create a SDM yourself!
- Bonus
R environment preparation
Download the .zip file from here and extract files to your favorite QCBS R Symposium workshop folder.
To run this workshop, please use the latest version of R. A certain number of libraries are also required and will be downloaded. Install and load all required R libraries:
install.packages("rgdal")
install.packages("raster")
install.packages("rgeos")
install.packages("dismo")
install.packages("letsR")
install.packages("biomod2")
install.packages("biogeo")
install.packages("maptools")
install.packages("stringr")
install.packages("ggplot2")
install.packages("plotly")
install.packages("xtable")
install.packages("tidyr")
library(rgdal)
library(raster)
library(rgeos)
library(dismo)
library(letsR)
library(biomod2)
library(biogeo)
library(maptools)
library(stringr)
library(ggplot2)
library(plotly)
library(xtable)
library(tidyr)
Set your working directory to the same directory of the folder workshopSDM
using the function setwd()
or your prefered method.
Finally, we recommend to save and create a new working environment for this tutorial.
# save all the working space
save.image("QCBS_SDMTutorial.RData")
# free the working space
rm(list = ls())
# and get it back
load("QCBS_SDMTutorial.RData")
Data preparation
Selecting your study area and manipulating spatial files
Most species distribution models are done using spatial grid information. In the next step, we will learn how to create a polygon grid of a given region, which will be used within our species distribution model framework.
For this tutorial, we have chosen the Neotropical zoogeographical region as our study area. Load the polygon shapefile of it using the readOGR
function:
neotropical_shape <- rgdal::readOGR("data/shapefiles/neotropical.shp")
OGR data source with driver: ESRI Shapefile
Source: "data/shapefiles/neotropical.shp", layer: "neotropical"
with 1 features
It has 3 fields
plot(neotropical_shape)
Now that we have our study region, we need to create a grid, intersect our shapefile and create a new grid of that region. For this, we will use the modified GridFilter
function (originally from Hidasi-Neto, J.). It allows us to input a polygon shapefile, specify a resolution and the proportion of overlap for each cell to be considered.
We will apply the function to the polygon neotropical_shape
you have loaded.
GridFilter <- function(shape, resol = 1, prop = 0) {
grid <- raster(extent(shape))
res(grid) <- resol
proj4string(grid) <- proj4string(shape)
gridpolygon <- rasterToPolygons(grid)
drylandproj <- spTransform(shape, CRS("+proj=laea"))
gridpolproj <- spTransform(gridpolygon, CRS("+proj=laea"))
gridpolproj$layer <- c(1:length(gridpolproj$layer))
areagrid <- gArea(gridpolproj, byid = T)
gridpolproj <- gBuffer(gridpolproj, byid = TRUE, width = 0)
dry.grid <- intersect(drylandproj, gridpolproj)
areadrygrid <- gArea(dry.grid, byid = T)
info <- cbind(dry.grid$layer, areagrid[dry.grid$layer], areadrygrid)
dry.grid$layer <- info[, 3]/info[, 2]
dry.grid <- spTransform(dry.grid, CRS(proj4string(shape)))
dry.grid.filtered <- dry.grid[dry.grid$layer >= prop, ]
}
# Create a spatial polygon grid for the Neotropical region, with 5 degrees x
# 5 degrees
neotropical_grid <- GridFilter(neotropical_shape, resol = 5, prop = 0.5)
## Create an ID field for it, which will be useful in the future
neotropical_grid$ID <- 1:(length(neotropical_grid))
# Export your resulting polygon grid
writeOGR(neotropical_grid, dsn = paste(getwd(), "/data/shapefiles", sep = ""),
layer = "neotropical_grid_5", driver = "ESRI Shapefile", overwrite_layer = T)
This is our resulting polygon grid, where each cell refers to a site (and a row) in our data.
neotropical_grid <- readOGR("data/shapefiles/neotropical_grid_5.shp")
OGR data source with driver: ESRI Shapefile
Source: "data/shapefiles/neotropical_grid_5.shp", layer: "neotropical_grid_5"
with 63 features
It has 5 fields
plot(neotropical_grid)
Importing environmental variables
In species distribution modeling, predictor variables are typically extracted from raster (grid) type files. Each ‘raster’ should represent a given variable of interest, which can be climatic, soil, terrain, vegetation, land use, and other types variables.
The getData
function from the dismo
package is able to retrieve useful climate, elevation and administrative data.
bioClimVars <- getData("worldclim", var = "bio", res = 10)
bioClimVars
You can also load your predictor variables from any directory within your computer using the raster
(single file) and stack
(loads multiple files) functions:
rasterFiles <- list.files("data/rasters/bio_10m_bil", pattern = "bil$", full.names = TRUE)
rasterFiles
[1] "data/rasters/bio_10m_bil/bio1.bil"
[2] "data/rasters/bio_10m_bil/bio10.bil"
[3] "data/rasters/bio_10m_bil/bio11.bil"
[4] "data/rasters/bio_10m_bil/bio12.bil"
[5] "data/rasters/bio_10m_bil/bio13.bil"
[6] "data/rasters/bio_10m_bil/bio14.bil"
[7] "data/rasters/bio_10m_bil/bio15.bil"
[8] "data/rasters/bio_10m_bil/bio16.bil"
[9] "data/rasters/bio_10m_bil/bio17.bil"
[10] "data/rasters/bio_10m_bil/bio18.bil"
[11] "data/rasters/bio_10m_bil/bio19.bil"
[12] "data/rasters/bio_10m_bil/bio2.bil"
[13] "data/rasters/bio_10m_bil/bio3.bil"
[14] "data/rasters/bio_10m_bil/bio4.bil"
[15] "data/rasters/bio_10m_bil/bio5.bil"
[16] "data/rasters/bio_10m_bil/bio6.bil"
[17] "data/rasters/bio_10m_bil/bio7.bil"
[18] "data/rasters/bio_10m_bil/bio8.bil"
[19] "data/rasters/bio_10m_bil/bio9.bil"
# Stack the bioclimatic variables
bioClimVars <- stack(rasterFiles)
bioClimVars
class : RasterStack
dimensions : 900, 2160, 1944000, 19 (nrow, ncol, ncell, nlayers)
resolution : 0.1666667, 0.1666667 (x, y)
extent : -180, 180, -60, 90 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +no_defs
names : bio1, bio10, bio11, bio12, bio13, bio14, bio15, bio16, bio17, bio18, bio19, bio2, bio3, bio4, bio5, ...
min values : -269, -97, -488, 0, 0, 0, 0, 0, 0, 0, 0, 9, 8, 72, -59, ...
max values : 314, 380, 289, 9916, 2088, 652, 261, 5043, 2159, 4001, 3985, 211, 95, 22673, 489, ...
You can also plot your variables to take a look at them:
plot(bioClimVars/10) # WorldClim data is multiplied by 10. Let's take a look at the real information.
Variable extraction
Before extracting variables, it is really important that your rasters have the same projection (i.e. type of coordinate reference system) as your other spatial files. See ?CRS
for help in finding CRS definitions.
crs.geo <- CRS("+proj=longlat +ellps=WGS84 + datum=WGS84") # geographical, datum WGS84
neotropical_grid <- spTransform(neotropical_grid, crs.geo) # define projection system of grid data
proj4string(bioClimVars) <- crs.geo # and of our bioclimatic rasters
Now, we can extract the predictor values for each cell of our grid!
myExpl <- data.frame(bio_1 = numeric(length(neotropical_grid)), bio_2 = numeric(length(neotropical_grid)),
bio_3 = numeric(length(neotropical_grid)))
# bio_1_ext is a list of cells that contains all values for each cell, for
# each predictor
bio_1_ext <- raster::extract(bioClimVars$bio1, neotropical_grid)
bio_2_ext <- raster::extract(bioClimVars$bio2, neotropical_grid)
bio_3_ext <- raster::extract(bioClimVars$bio3, neotropical_grid)
myExpl$bio_1 <- unlist(lapply(bio_1_ext, function(x) if (!is.null(x)) mean(x,
na.rm = TRUE) else NA))
myExpl$bio_2 <- unlist(lapply(bio_2_ext, function(x) if (!is.null(x)) mean(x,
na.rm = TRUE) else NA))
myExpl$bio_3 <- unlist(lapply(bio_1_ext, function(x) if (!is.null(x)) mean(x,
na.rm = TRUE) else NA))
write.table(myExpl, "data/matrices/NT_EnvVar_5.txt")
# Let's take a look at our table:
head(myExpl)
bio_1 bio_2 bio_3
1 215.9267 138.49084 215.9267
2 236.4301 114.60676 236.4301
3 241.8685 99.48434 241.8685
4 239.1822 91.49031 239.1822
5 235.9922 99.52013 235.9922
6 269.0044 101.91778 269.0044
Importing species data
Using expert drawn maps
One way of acquiring species distribution information is from expert-drawn maps, such as the ones from the IUCN Red List for Threatened Species database. These maps Tare usually done used multiple methods: only on occurrence data (e.g. a polygon around known occurrences), whereas others incorporate multiple degrees of knowledge, such as habitat requirements or elevational limits of the species, in essence using an informal distribution modelling approach.
Richness maps derived from expert-drawn range maps may overestimate local richness (‘error of commission’), in relation to point-to-grid richness maps. This because they are generally drawn to include all areas where a species is known to occur without excluding areas in between where the species may not exist. They tend to map the ‘extent of occurrence’ of species that includes the, perhaps much smaller, âarea of occupancyâ (Loislle et al., 2003; Habib et al., 2004; Hurlbert & White, 2005; Graham & Hijmans, 2006).
We advise caution when using them.
# Load your species map file
speciesGeoDist <- readShapePoly("data/shapefiles/NT_TERRESTRIAL_MAMMALS_subset.shp",
proj4string = crs.geo)
plot(speciesGeoDist)
Create a presence-absence matrix of species’ geographic ranges within your polygon grid shapefile. For this, we will use the function lets.presab.grid
from the letsR
package.
crs.geo <- CRS("+proj=longlat +ellps=WGS84 + datum=WGS84") # geographical, datum WGS84
proj4string(neotropical_grid) <- crs.geo # define projection system of grid data
proj4string(speciesGeoDist) <- crs.geo # and of our bioclimatic rasters
abspres.matrix <- lets.presab.grid(shapes = speciesGeoDist, grid = neotropical_grid,
sample.unit = "ID", remove.sp = TRUE, presence = NULL, origin = NULL, seasonal = NULL)
# Export your presence-absence matrix
write.table(abspres.matrix$PAM, "data/matrices/NT_PAM_5.txt")
You can now visualize a map of your species richness within your polygon grid:
richnessCalc <- rowSums(abspres.matrix$PAM) + 1
colPalette <- colorRampPalette(c("#fff5f0", "#fb6a4a", "#67000d"))
colors <- c("white", colPalette(max(richnessCalc)))
plot(abspres.matrix$grid, border = "gray40", col = colors[richnessCalc])
map(add = TRUE)
We also need the coordinates of our centroids to proceed with the data analysis:
# Extract coordinates from the cell centroids
myRespCoord <- as.data.frame(coordinates(abspres.matrix$grid))
colnames(myRespCoord) <- c("Longitude_X", "Latitude_Y")
# Export it for the future
write.table(myRespCoord, "data/matrices/NT_coords_5.txt")
Using occurrence records
We can also use occurrence records available online, or collected in the field. Here, we will use information from the Global Biodiversity Information Facility and import it using the function gbif
:
D_youngi.GBIF <- gbif("Diaemus", "youngi")
# get data frame with spatial coordinates (points)
Dyoungi.Occ <- subset(D_youngi.GBIF, select = c("country", "lat", "lon"))
head(Dyoungi.Occ) # a simple data frame with coordinates
country lat lon
1 Nicaragua 11.11199 -85.76016
2 Trinidad and Tobago 10.54603 -61.48618
3 Suriname 1.99449 -56.09211
4 Suriname 1.99000 -56.09000
5 Suriname 1.99400 -56.09200
6 Suriname NA NA
Issues with occurrence points
Occurrence data from museum and herbarium collections are very valuable for mapping biodiversity patterns, and for species distribution modelling. But these datasets contain a lot of errors and are susceptible to many issues relating data quality. Many of these errors are extensively described in “Biogeo: an R package for assessing and improving data quality of occurrence record datasets”, from Robertson et al., 2016
Let’s take a look at a few of them:
Error | Probable.Reason |
---|---|
Points plotted at zero degrees latitude and longitude. | No coordinates were available in the original dataset but values of zero assigned to the coordinates |
Points in sea for terrestrial species or on land for aquatic species, obvious geographical outliers. | Transposed latitude and longitude coordinates; incorrect sign on the decimal degrees of the latitude or longitude coordinate; degrees and minutes were transposed before the coordinate was converted to decimal degrees; imprecise locality description used to assign coordinates; the specimen was incorrectly identified by the collector or the incorrect name was applied to the species when the data were digitized. |
Point in sea but close to coast for terrestrial species, or on land but close to coast for marine species. | Low precision coordinates e.g. only degrees were available or the data were originally collected on a coarse-scale grid. Imprecise locality description used to assign coordinates. |
Point plotted along the prime meridian or equator. | Missing coordinate for latitude or for longitude that was incorrectly assigned a value of zero. |
Country name given in the record does not correspond with country where point is plotted. | Likely to be the same errors as for b) above. |
Elevation given in the record does not correspond with elevation obtained from a digital elevation model where point is plotted. | Likely to be the same errors as for b) above, or the spatial resolution of the digital elevation model is too coarse. |
One way of working around these issues is by using the package biogeo
, which has many functions for quality assessment and coordinate conversion.
Data cleaning
Let’s take a look at our species matrix. What is going when we look at the sum of the columns (i.e. number of presences)?
# Load our species data
DataSpecies <- read.table("data/matrices/NT_PAM_5.txt", header = TRUE)
colSums(DataSpecies)
Panthera.onca Pecari.tajacu
50 55
Peromyscus.leucopus Phyllostomus.latifolius
3 10
Philander.opossum Phyllomys.nigrispinus
39 3
Peromyscus.mexicanus Phyllomys.dasythrix
3 2
Oxymycterus.amazonicus Oxymycterus.hispidus
8 5
Peropteryx.macrotis Pattonomys.carrikeri
47 1
Oxymycterus.josei Peropteryx.pallidoptera
1 5
Oxymycterus.hiska Peromyscus.aztecus
3 3
Peromyscus.beatae Philander.deltae
3 1
Phylloderma.stenops Peromyscus.furvus
43 1
Oxymycterus.rufus Peromyscus.grandis
7 2
Phaenomys.ferrugineus Peropteryx.kappleri
1 44
Phyllomys.blainvillii Phyllomys.brasiliensis
3 2
Peromyscus.levipes Philander.mondolfii
1 5
Philander.frenatus Phyllomys.mantiqueirensis
7 1
Phyllostomus.elongatus Oxymycterus.nasutus
41 5
Peromyscus.hylocetes Oxymycterus.hucucha
1 2
Oxymycterus.roberti Oxymycterus.dasytrichus
7 3
Oxymycterus.angularis Phyllomys.medius
3 3
Peromyscus.megalops Pattonomys.occasius
2 3
Phyllostomus.hastatus Phyllomys.kerri
44 1
Oxymycterus.quaestor Peropteryx.leucoptera
9 28
Phyllomys.thomasi Peromyscus.yucatanicus
1 2
Ozotoceros.bezoarticus Peromyscus.gratus
21 1
Philander.andersoni Pearsonomys.annectens
12 2
Oxymycterus.delator Philander.mcilhennyi
5 4
Peromyscus.maniculatus Peromyscus.gymnotis
1 1
Perognathus.flavus Oxymycterus.paramensis
1 7
Oxymycterus.inca Phyllostomus.discolor
5 46
Peromyscus.difficilis Peromyscus.melanotis
1 1
Phyllomys.lamarum Pattonomys.semivillosus
3 1
Phyllomys.lundi Peromyscus.stirtoni
1 2
Phyllomys.pattoni Phyllomys.sulinus
4 3
Peromyscus.melanophrys Oxymycterus.caparoae
2 2
We can work on that by removing species that occur in less than four sites.
# Remove all columns that have less than 4 presences
to.remove <- colnames(DataSpecies)[colSums(DataSpecies) <= 4]
`%ni%` <- Negate(`%in%`)
DataSpecies <- subset(DataSpecies, select = names(DataSpecies) %ni% to.remove)
Sometimes, we also have problems with the separators between species names. We can use the gsub()
function to solve this type of issue.
# Replace '.' per '.'
names(DataSpecies) <- gsub(x = names(DataSpecies), pattern = "\\.", replacement = ".")
write.table(DataSpecies, "data/matrices/NT_PAM_5.txt")
Species distribution modelling
Initialising and formatting direct input data
First, input data needs to be rearranged for usage in biomod2 using BIOMOD_FormatingData()
. Its most important arguments are:
resp.var
contains species data (for a single species) in binary format (ones for presences, zeros for true absences and NA for indeterminated) that will be used to build the species distribution models. It may be a vector or a spatial points object.expl.var
contains a matrix, data.frame, SpatialPointsDataFrame or RasterStack containing your explanatory variables.resp.xy
two columns matrix containing the X and Y coordinates ofresp.var
(only consider if resp.var is a vector)resp.name
contains the response variable name (species).
# Recall your variables again:
## Load prepared species data
myResp <- read.table("data/matrices/NT_PAM_5.txt", header = TRUE)
## Load environmental variables
myExpl <- read.table("data/matrices/NT_EnvVar_5.txt", header = TRUE)
# Define the species of interest
myRespName <- names(myResp)[1]
# Load coordinates
myRespCoord <- read.table("data/matrices/NT_coords_5.txt", header = TRUE)
myBiomodData <- BIOMOD_FormatingData(resp.var = as.data.frame(myResp[,myRespName]),
expl.var = myExpl,
resp.xy = as.data.frame(myRespCoord),
resp.name = myRespName,
#PA.nb.rep = 2,
#PA.nb.absences = 200,
#PA.strategy = 'random',
na.rm = TRUE)
-=-=-=-=-=-=-=-=-=-=-= Panthera.onca Data Formating -=-=-=-=-=-=-=-=-=-=-=
> No pseudo absences selection !
! No data has been set aside for modeling evaluation
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Done -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Now, you may check if your data is correctly formatted by printing and plotting the myBiomodData
object.
myBiomodData
-=-=-=-=-=-=-=-=-=-=-=-=-= 'BIOMOD.formated.data' -=-=-=-=-=-=-=-=-=-=-=-=-=
sp.name = Panthera.onca
50 presences, 13 true absences and 0 undifined points in dataset
3 explanatory variables
bio_1 bio_2 bio_3
Min. : 54.06 Min. : 80.49 Min. : 54.06
1st Qu.:182.15 1st Qu.: 99.89 1st Qu.:182.15
Median :236.43 Median :117.75 Median :236.43
Mean :211.69 Mean :115.79 Mean :211.69
3rd Qu.:256.98 3rd Qu.:128.26 3rd Qu.:256.98
Max. :269.00 Max. :162.52 Max. :269.00
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
plot(myBiomodData)
Building models
Defining algorithm parameters using default options
We are ready for running our set of models on our species. This step is the core of the modeling procedure.
First, we define the of different options for each modeling technique using the function BIOMOD_ModelingOptions()
. If we do not change its parameters, it will take the default values (run Print_Default_ModelingOptions()
to see them) defined within biomod2
.
For example, you could change the parameters of the GLM function to compute quadratic a GLM and select best model with ‘BIC’ criterium:
myBiomodOptions <- BIOMOD_ModelingOptions(GLM = list(type = "quadratic", interaction.level = 0,
myFormula = NULL, test = "BIC", family = "binomial", control = glm.control(epsilon = 1e-08,
maxit = 1000, trace = FALSE)))
You could also set your own GLM formula.
# you can prefer to establish your own GLM formula
myBiomodOptions <- BIOMOD_ModelingOptions(
GLM = list(myFormula = formula(Panthera.onca ~ bio3 + log(bio1) + poly(bio2,2) + bio4 + bio1:bio4))
But, for the sake of simplicity, we will keep all options default.
myBiomodOption <- BIOMOD_ModelingOptions()
# The created object is then given to BIOMOD_Modeling in the next step.
Computing models
Now, we proceed to running the set of models on our species. We may choose between 12 difeerent algorithms (‘GLM’, ‘GBM’, ‘GAM’, ‘CTA’, ‘ANN’, ‘SRE’, ‘FDA’, ‘MARS’, ‘RF’, ‘MAXENT.Phillips’ and ‘MAXENT.Tsuruoka’).
As we do not have evaluation data, we will make 3-fold cross-validation (number controlled by NbRunEval
argument) of our models by randomly splitting our data set into 2 subsets DataSplit
.
Take a look below:
myBiomodModelOut <- BIOMOD_Modeling(
# BIOMOD_FormatingData()-class object
data = myBiomodData,
# The set of models to be calibrated on the data
models = c('GBM', 'GLM', 'MAXENT.Tsuruoka', 'RF', 'MAXENT.Phillips'),
# BIOMOD.models.options object
models.options = myBiomodOption,
# Number of evaluation runs
NbRunEval=3,
# % of data used to calibrate the models, the remaining part will be used for testing
DataSplit=70,
# If this argument is kept to NULL, each observation (presence or absence) has the same weight (independent of the number of presences and absences).
Prevalence=0.5,
# Number of permutations to estimate variable importance
VarImport=5,
# Types of evaluations methods to be used. The complete list is within ?BIOMOD_Modeling()
models.eval.meth = c('TSS','ROC'),
# Keep all results and outputs on hard drive or not. Keep it always TRUE
SaveObj = TRUE,
# If true, all model predictions will be scaled with a binomial GLM
rescal.all.models = FALSE,
# If true, models calibrated and evaluated with the whole dataset are done.
# Building models with all information available may be usefull in some particular
# cases (i.e. rare species with few presences points).
do.full.models = FALSE,
# character, the ID (=name) of modeling procedure. A random number by default
modeling.id = paste(myRespName,"CurrentClim",sep="")) # We call it 'CurrentClim' because the bioclimatic variables we are working with are from the present climate.
Loading required library...
Checking Models arguments...
! java software seems not be corectly installed
> MAXENT.Phillips modelling was switched off!
Creating suitable Workdir...
> Automatic weights creation to rise a 0.5 prevalence
-=-=-=-=-=-=-=-=-=-=-= Panthera.onca Modeling Summary -=-=-=-=-=-=-=-=-=-=-=
3 environmental variables ( bio_1 bio_2 bio_3 )
Number of evaluation repetitions : 3
Models selected : GBM GLM MAXENT.Tsuruoka RF
Total number of model runs : 12
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
-=-=-=- Run : Panthera.onca_AllData
-=-=-=--=-=-=- Panthera.onca_AllData_RUN1
Model=Generalised Boosting Regression
2500 maximum different trees and 3 Fold Cross-Validation
Evaluating Model stuff...
Evaluating Predictor Contributions...
Model=GLM ( quadratic with no interaction )
Stepwise procedure using AIC criteria
selected formula : Panthera.onca ~ I(bio_1^2) + I(bio_2^2) + bio_1
<environment: 0x0000000020f37ca0>
Evaluating Model stuff...
Evaluating Predictor Contributions...
Model evaluation
Extracting evaluation scores
We are able now to capture our model evaluations using the get_evaluations
function. In this tutorial, we have focused on two evaluations methods: Receiving Operator Curves and True Skill Statistic.
# get all models evaluation
myBiomodModelEval <- get_evaluations(myBiomodModelOut) #contains model evaluation scores
# print the dimnames of this object
dimnames(myBiomodModelEval)
[[1]]
[1] "TSS" "ROC"
[[2]]
[1] "Testing.data" "Cutoff" "Sensitivity" "Specificity"
[[3]]
[1] "GBM" "GLM" "MAXENT.Tsuruoka" "RF"
[[4]]
[1] "RUN1" "RUN2" "RUN3"
[[5]]
Panthera.onca_AllData
"AllData"
hist(myBiomodModelEval)
write.table(myBiomodModelEval, file = "data/results/EvalAll_Model_Eval")
Plotting evaluation scores
We can also plot evaluation scores for sensitivity and specifity for all models.
# print the Sensitivity of all models --> TSS and ROC scores for SENSITIVITY
mySensitivity_All_Models <- myBiomodModelEval[, "Sensitivity", , , ]
mySensitivity_All_Models
mySensAll <- as.data.frame(t(as.data.frame(mySensitivity_All_Models)))
MySensHist.TSS <- ggplot(mySensAll, aes(x = TSS)) + geom_histogram(aes(y = ..density..),
binwidth = density(mySensAll$TSS)$bw) + geom_density(fill = "red", alpha = 0.2)
MySensHist.ROC <- ggplot(mySensAll, aes(x = ROC)) + geom_histogram(aes(y = ..density..),
binwidth = density(mySensAll$ROC)$bw) + geom_density(fill = "red", alpha = 0.2)
MySensHist.TSS
MySensHist.ROC
# save results
write.table(mySensitivity_All_Models, file = "data/results/Sensitivity_All_Models")
, , RUN1
GBM GLM MAXENT.Tsuruoka RF
TSS 100 100 100 100
ROC 100 100 100 100
, , RUN2
GBM GLM MAXENT.Tsuruoka RF
TSS 75 100 100 75
ROC 75 100 100 75
, , RUN3
GBM GLM MAXENT.Tsuruoka RF
TSS 100 100 100 75
ROC 100 100 100 75
# Proportion of PRESENCES correctly predicted (true positive rate)
sensitivity <- read.table("data/results/Sensitivity_All_Models")
summary(sensitivity)
GBM.RUN1 GLM.RUN1 MAXENT.Tsuruoka.RUN1 RF.RUN1
Min. :86.67 Min. :80 Min. :100 Min. :73.33
1st Qu.:86.67 1st Qu.:80 1st Qu.:100 1st Qu.:73.33
Median :86.67 Median :80 Median :100 Median :73.33
Mean :86.67 Mean :80 Mean :100 Mean :73.33
3rd Qu.:86.67 3rd Qu.:80 3rd Qu.:100 3rd Qu.:73.33
Max. :86.67 Max. :80 Max. :100 Max. :73.33
GBM.RUN2 GLM.RUN2 MAXENT.Tsuruoka.RUN2 RF.RUN2
Min. :86.67 Min. :86.67 Min. :86.67 Min. :80
1st Qu.:86.67 1st Qu.:86.67 1st Qu.:86.67 1st Qu.:80
Median :86.67 Median :86.67 Median :86.67 Median :80
Mean :86.67 Mean :86.67 Mean :86.67 Mean :80
3rd Qu.:86.67 3rd Qu.:86.67 3rd Qu.:86.67 3rd Qu.:80
Max. :86.67 Max. :86.67 Max. :86.67 Max. :80
GBM.RUN3 GLM.RUN3 MAXENT.Tsuruoka.RUN3 RF.RUN3
Min. :80 Min. :80 Min. :80 Min. :80
1st Qu.:80 1st Qu.:80 1st Qu.:80 1st Qu.:80
Median :80 Median :80 Median :80 Median :80
Mean :80 Mean :80 Mean :80 Mean :80
3rd Qu.:80 3rd Qu.:80 3rd Qu.:80 3rd Qu.:80
Max. :80 Max. :80 Max. :80 Max. :80
str(sensitivity)
'data.frame': 2 obs. of 12 variables:
$ GBM.RUN1 : num 86.7 86.7
$ GLM.RUN1 : int 80 80
$ MAXENT.Tsuruoka.RUN1: int 100 100
$ RF.RUN1 : num 73.3 73.3
$ GBM.RUN2 : num 86.7 86.7
$ GLM.RUN2 : num 86.7 86.7
$ MAXENT.Tsuruoka.RUN2: num 86.7 86.7
$ RF.RUN2 : int 80 80
$ GBM.RUN3 : int 80 80
$ GLM.RUN3 : int 80 80
$ MAXENT.Tsuruoka.RUN3: int 80 80
$ RF.RUN3 : int 80 80
sensitivity #table of sensitivity by TSS or ROC per run per model
GBM.RUN1 GLM.RUN1 MAXENT.Tsuruoka.RUN1 RF.RUN1 GBM.RUN2 GLM.RUN2
TSS 86.667 80 100 73.333 86.667 86.667
ROC 86.667 80 100 73.333 86.667 86.667
MAXENT.Tsuruoka.RUN2 RF.RUN2 GBM.RUN3 GLM.RUN3 MAXENT.Tsuruoka.RUN3
TSS 86.667 80 80 80 80
ROC 86.667 80 80 80 80
RF.RUN3
TSS 80
ROC 80
sensTSS.plot <- ggplot(sensData, aes(x = model1, y = TSS, fill = model1)) +
geom_boxplot() + guides(fill = FALSE) + ggtitle("Sensitivity using TSS scores") +
theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1),
panel.background = element_rect(fill = NA)) + labs(x = "Model Type",
y = "TSS (%)")
sensTSS.plot <- ggplotly(sensTSS.plot)
sensROC.plot <- ggplot(sensData, aes(x = model2, y = ROC, fill = model1)) +
geom_boxplot() + guides(fill = FALSE) + ggtitle("Sensitivity using ROC scores") +
theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1),
panel.background = element_rect(fill = NA)) + labs(x = "Model Type",
y = "ROC (%)")
sensROC.plot <- ggplotly(sensROC.plot)
sensTSS.plot
sensROC.plot
Discussion: How do the models compare by their sensitivity scores?
##### plot SPECIFICITY #####
# Proportion of ABSENCES correctly predicted (true negative rate)
specificity <- read.table("data/results/Specificity_All_Models")
summary(specificity)
GBM.RUN1 GLM.RUN1 MAXENT.Tsuruoka.RUN1 RF.RUN1
Min. :100 Min. :100 Min. :100 Min. :100
1st Qu.:100 1st Qu.:100 1st Qu.:100 1st Qu.:100
Median :100 Median :100 Median :100 Median :100
Mean :100 Mean :100 Mean :100 Mean :100
3rd Qu.:100 3rd Qu.:100 3rd Qu.:100 3rd Qu.:100
Max. :100 Max. :100 Max. :100 Max. :100
GBM.RUN2 GLM.RUN2 MAXENT.Tsuruoka.RUN2 RF.RUN2
Min. :75 Min. :100 Min. :100 Min. :75
1st Qu.:75 1st Qu.:100 1st Qu.:100 1st Qu.:75
Median :75 Median :100 Median :100 Median :75
Mean :75 Mean :100 Mean :100 Mean :75
3rd Qu.:75 3rd Qu.:100 3rd Qu.:100 3rd Qu.:75
Max. :75 Max. :100 Max. :100 Max. :75
GBM.RUN3 GLM.RUN3 MAXENT.Tsuruoka.RUN3 RF.RUN3
Min. :100 Min. :100 Min. :100 Min. :75
1st Qu.:100 1st Qu.:100 1st Qu.:100 1st Qu.:75
Median :100 Median :100 Median :100 Median :75
Mean :100 Mean :100 Mean :100 Mean :75
3rd Qu.:100 3rd Qu.:100 3rd Qu.:100 3rd Qu.:75
Max. :100 Max. :100 Max. :100 Max. :75
str(specificity)
'data.frame': 2 obs. of 12 variables:
$ GBM.RUN1 : int 100 100
$ GLM.RUN1 : int 100 100
$ MAXENT.Tsuruoka.RUN1: int 100 100
$ RF.RUN1 : int 100 100
$ GBM.RUN2 : int 75 75
$ GLM.RUN2 : int 100 100
$ MAXENT.Tsuruoka.RUN2: int 100 100
$ RF.RUN2 : int 75 75
$ GBM.RUN3 : int 100 100
$ GLM.RUN3 : int 100 100
$ MAXENT.Tsuruoka.RUN3: int 100 100
$ RF.RUN3 : int 75 75
specificity #table of specificity by TSS or ROC per run per model
GBM.RUN1 GLM.RUN1 MAXENT.Tsuruoka.RUN1 RF.RUN1 GBM.RUN2 GLM.RUN2
TSS 100 100 100 100 75 100
ROC 100 100 100 100 75 100
MAXENT.Tsuruoka.RUN2 RF.RUN2 GBM.RUN3 GLM.RUN3 MAXENT.Tsuruoka.RUN3
TSS 100 75 100 100 100
ROC 100 75 100 100 100
RF.RUN3
TSS 75
ROC 75
specTSS <- specificity[1, ]
specTran <- t(specTSS)
modelNames <- rownames(specTran)
model1 <- substr(modelNames, 1, 3)
specPlot1 <- data.frame(specTran, model1)
specROC <- specificity[2, ]
specTran <- t(specROC)
modelNames <- rownames(specTran)
model2 <- substr(modelNames, 1, 3)
specPlot2 <- data.frame(specTran, model2)
specData <- cbind(specPlot1, specPlot2)
Let’s take a look at our plots for specificity:
specTSS.plot <- ggplot(specData, aes(x = model1, y = TSS, fill = model1)) +
geom_boxplot() + guides(fill = FALSE) + ggtitle("Specificity using TSS scores") +
theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1),
panel.background = element_rect(fill = NA)) + labs(x = "Model Type",
y = "TSS (%)")
specROC.plot <- ggplot(specData, aes(x = model2, y = ROC, fill = model1)) +
geom_boxplot() + guides(fill = FALSE) + ggtitle("Specificity using ROC scores") +
theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1),
panel.background = element_rect(fill = NA)) + labs(x = "Model Type",
y = "ROC (%)")
# Make them interactive!
specTSS.plot <- ggplotly(specTSS.plot)
specROC.plot <- ggplotly(specROC.plot)
# Plot them
specTSS.plot
specROC.plot
We can also print the variable importance for all models
# print variable importances
get_variables_importance(myBiomodModelOut)
, , RUN1, AllData
GBM GLM MAXENT.Tsuruoka RF
bio_1 0.922 0.919 0.238 0.297
bio_2 0.072 0.472 0.054 0.090
bio_3 0.000 0.000 0.279 0.337
, , RUN2, AllData
GBM GLM MAXENT.Tsuruoka RF
bio_1 0.929 0.961 0.238 0.307
bio_2 0.059 0.072 0.065 0.035
bio_3 0.000 0.000 0.283 0.234
, , RUN3, AllData
GBM GLM MAXENT.Tsuruoka RF
bio_1 0.908 0.895 0.244 0.233
bio_2 0.029 0.000 0.070 0.028
bio_3 0.000 0.000 0.295 0.274
This is another common way of saving these variables in disk for later use.
### save models evaluation scores and variables importance on hard drive
capture.output(get_evaluations(myBiomodModelOut), file = file.path(myRespName,
paste(myRespName, "_formal_models_eval.txt", sep = "")))
capture.output(get_variables_importance(myBiomodModelOut), file = file.path(myRespName,
paste(myRespName, "_formal_models_var_imp.txt", sep = "")))
Finally, we can also see a few response plots
# 4. Plot response curves
# 4.1 Load the models for which we want to extract the predicted response
# curves
myGLMs <- BIOMOD_LoadModels(myBiomodModelOut, models = c("GLM"))
# 4.2 plot 2D response plots
myRespPlot2D <- response.plot2(models = myGLMs, Data = get_formal_data(myBiomodModelOut,
"expl.var"), show.variables = get_formal_data(myBiomodModelOut, "expl.var.names"),
do.bivariate = FALSE, fixed.var.metric = "median", col = c("blue", "red"),
legend = TRUE, data_species = get_formal_data(myBiomodModelOut, "resp.var"))
# 4.2 plot 3D response plots here only for a lone model
myRespPlot3D <- response.plot2(models = myGLMs[1], Data = get_formal_data(myBiomodModelOut,
"expl.var"), show.variables = get_formal_data(myBiomodModelOut, "expl.var.names"),
do.bivariate = TRUE, fixed.var.metric = "median", data_species = get_formal_data(myBiomodModelOut,
"resp.var"), display_title = FALSE)
### all the values used to produce this plot are stored into the returned
### object you can redo plots by yourself and customised them
Projecting your outputs
### Make projections on current variable
myBiomodProj <- BIOMOD_Projection(modeling.output = myBiomodModelOut, new.env = myExpl,
xy.new.env = myRespCoord, proj.name = "current", selected.models = "all",
binary.meth = "TSS", compress = TRUE, clamping.mask = F, output.format = ".RData")
-=-=-=-=-=-=-=-=-=-=-=-=-= Do Models Projections -=-=-=-=-=-=-=-=-=-=-=-=-=
> Building clamping mask
*** in setMethod('BinaryTransformation', signature(data='numeric')
*** in setMethod('BinaryTransformation', signature(data='data.frame')
*** in setMethod('BinaryTransformation', signature(data='numeric')
*** in setMethod('BinaryTransformation', signature(data='data.frame')
*** in setMethod('BinaryTransformation', signature(data='numeric')
*** in setMethod('BinaryTransformation', signature(data='data.frame')
*** in setMethod('BinaryTransformation', signature(data='numeric')
*** in setMethod('BinaryTransformation', signature(data='data.frame')
*** in setMethod('BinaryTransformation', signature(data='numeric')
*** in setMethod('BinaryTransformation', signature(data='data.frame')
*** in setMethod('BinaryTransformation', signature(data='numeric')
*** in setMethod('BinaryTransformation', signature(data='data.frame')
> Projecting Panthera.onca_AllData_RUN1_GBM ...
> Projecting Panthera.onca_AllData_RUN1_GLM ...
myCurrentProj <- get_predictions(myBiomodProj)
myCurrentProj
, , RUN1, AllData
GBM GLM MAXENT.Tsuruoka RF
[1,] 921 850 853 930
[2,] 861 869 935 998
[3,] 753 809 956 674
[4,] 650 670 961 894
[5,] 774 715 951 936
[6,] 947 987 974 1000
[7,] 931 907 965 1000
[8,] 649 212 926 912
[9,] 899 932 973 992
[10,] 947 970 969 1000
[11,] 946 967 969 1000
[12,] 899 962 974 992
[13,] 897 908 974 852
[14,] 786 559 941 936
[15,] 899 953 973 992
[16,] 899 932 973 992
[17,] 899 968 979 852
[18,] 899 974 978 992
[ reached getOption("max.print") -- omitted 45 row(s) and 2 matrix slice(s) ]
# Plot your predictions like this
plot(myBiomodProj)
Ensemble of forecasting
One powerful feature from biomod2
comes with BIOMOD_EnsembleModeling
, which combines individual models to build an ensemble of forecast. In the following example, we exclude all models having a TSS score lower than 0.5.
### Building ensemble-models
myBiomodEM <- BIOMOD_EnsembleModeling(
# a "BIOMOD.models.out" returned by BIOMOD_Modeling
modeling.output = myBiomodModelOut,
# a character vector (either 'all' or a sub-selection of model names) that defines the models kept for building the ensemble model
chosen.models = 'all',
# The value chosen for this parameter will control the number of ensemble models built.
# 'all' means a total consensus model will be built
em.by='all',
# Evaluation metric used to build ensemble models
eval.metric = 'all',
# If not NULL, the minimum scores below which models will be excluded of the ensemble-models building
eval.metric.quality.threshold = c(0.7, 0.5),
# Estimates the ... across predictions
prob.mean = T,
prob.cv = T,
prob.ci = T,
prob.ci.alpha = 0.05,
prob.median = T,
committee.averaging = T,
prob.mean.weight = T,
prob.mean.weight.decay = 'proportional')
-=-=-=-=-=-=-=-=-=-=-=-=-= Build Ensemble Models -=-=-=-=-=-=-=-=-=-=-=-=-=
! all models available will be included in ensemble.modeling
> Evaluation & Weighting methods summary :
TSS over 0.7
ROC over 0.5
> mergedAlgo_mergedRun_mergedData ensemble modeling
! Models projections for whole zonation required...
> Projecting Panthera.onca_AllData_RUN1_GBM ...
### Make ensemble-models projections on current variable
myBiomodEF <- BIOMOD_EnsembleForecasting(projection.output = myBiomodProj, binary.meth = "TSS",
total.consensus = TRUE, EM.output = myBiomodEM)
# Plot your results
plot(myBiomodEF)
Multi-species modelling
Now, that we know how to do single species modelling, what changes if we want to model the distribution of n species?
biomod2
has no function that allows multi-species to be modelled. But, we can use the help of other R functions to deal with this! For example, we can create a for
loop that will execute the modelling parts for each species within our species data frame. See below, (but do not execute it yet!)
Another (faster, and perhaps, more elegant) way for doing multi-species SDM using biomod2, is to use a apply family function. Like this one:
# Define your species names
sp.names <- names(myResp)
MyBiomodSF <- function(sp.n) {
myRespName = sp.n
cat("\n", myRespName, "modelling...")
### definition of data i.e keep only the column of our species
myResp <- as.numeric(myResp[, myRespName])
myRespCoord = myRespCoord
### Initialisation
myBiomodData <- BIOMOD_FormatingData(resp.var = myResp, expl.var = myExpl,
resp.xy = myRespCoord, resp.name = myRespName)
### Options definition
myBiomodOption <- BIOMOD_ModelingOptions()
### Modelling
myBiomodModelOut <- BIOMOD_Modeling(myBiomodData, models = c("RF", "MAXENT.Tsuruoka"),
models.options = myBiomodOption, NbRunEval = 3, DataSplit = 80, Prevalence = 0.5,
VarImport = 3, models.eval.meth = c("TSS", "ROC"), SaveObj = TRUE, rescal.all.models = FALSE,
do.full.models = FALSE, modeling.id = paste(myRespName, "CurrentClim",
sep = ""))
### save models evaluation scores and variables importance on hard drive
capture.output(get_evaluations(myBiomodModelOut), file = file.path(myRespName,
paste(myRespName, "_formal_models_eval.txt", sep = "")))
capture.output(get_variables_importance(myBiomodModelOut), file = file.path(myRespName,
paste(myRespName, "_formal_models_var_imp.txt", sep = "")))
### Building ensemble-models
myBiomodEM <- BIOMOD_EnsembleModeling(modeling.output = myBiomodModelOut,
chosen.models = "all", em.by = "all", eval.metric = c("TSS"), eval.metric.quality.threshold = c(0.5),
prob.mean = T, prob.cv = T, prob.ci = T, prob.ci.alpha = 0.05, prob.median = T,
committee.averaging = T, prob.mean.weight = T, prob.mean.weight.decay = "proportional")
### Make projections on current variable
myBiomodProj <- BIOMOD_Projection(modeling.output = myBiomodModelOut, new.env = myExpl,
xy.new.env = myRespCoord, proj.name = "current", selected.models = "all",
binary.meth = "TSS", compress = TRUE, clamping.mask = F, output.format = ".RData")
### Make ensemble-models projections on current variable
myBiomodEF <- BIOMOD_EnsembleForecasting(projection.output = myBiomodProj,
binary.meth = "TSS", total.consensus = TRUE, EM.output = myBiomodEM)
}
And then you use lapply()
and apply our MyBiomodSF
function to all sp.names
:
sp.names <- names(myResp)
myLapply_SFModelsOut <- lapply(sp.names, MyBiomodSF)
Produce a richness map
### Make ensemble-models projections on current variable load the first speces
### binary maps which will define the mask
alphaMap <- reclassify(subset(myExpl, 1), c(-Inf, Inf, 0))
alphaMap <- get(load(paste(getwd(), "/", myRespName[1], "/proj_current/", "proj_current_",
myRespName[1], "_ensemble_TSSbin.RData", sep = "")))[, 1]
DIY: Create your own SDM!
Hopefully, now you can create your own SDM!
Thank you for coming!