What is this?

This is a short course/tutorial I developed to help researchers cover some of the families of phylogenetic comparative analyses of trait evolution and correlation, diversification rates, as well as community structure. My objective is to provide a short background in statistical and comparative phylogenetic methods, to further allow researchers to explore their research questions on their own.

A big part of the inspiration for this workshop came from the book Phylogenies in Ecology, by Marc W. Cadotte and Jonathan Davies, as well as from courses and workshops on ecophylogenetics taught by Dr. Will Pearse, Dr. Jonathan Davies and Dr. Steven Kembel.

I am progressively adding more material to this document. You can access all topics addressed here in the outline on the right side of this page.

If you would like to provide feedback on this material, do not hesitate to get in touch!

I will highly appreciate your comments!

Kind regards,

Pedro Henrique P. Braga

pedrohbraga.github.io

ph.pereirabraga [at] gmail [dot] com.

Preparation for this tutorial

I kindly ask you to run the following code to download the datasets, and install and load the libraries we will use through this course:

# Check for needed packages, and install the missing ones
required.libraries <- c("ape", "picante", "pez", "phytools",
                        "vegan", "adephylo", "phylobase",
                        "geiger", "mvMORPH", "OUwie", 
                        "hisse", "BAMMtools",
                        "phylosignal", "Biostrings",
                        "devtools",
                        "ggplot2",
                        "fulltext")

needed.libraries <- required.libraries[!(required.libraries %in% installed.packages()[,"Package"])]

if(length(needed.libraries)) install.packages(needed.libraries)

# Load all required libraries at once
lapply(required.libraries, require, character.only = TRUE)

### Install missing libraries

source("https://bioconductor.org/biocLite.R")
biocLite("ggtree")

###

set.seed(1)

# Download files from Marc Cadotte's and Jonathan Davies' book:

temp <- tempfile()
download.file("http://www.utsc.utoronto.ca/~mcadotte/ewExternalFiles/resources.zip", temp)
dir.create("data/Jasper")
JasperData <- c("jasper_tree.phy", "jasper_data.csv")
lapply(JasperData, 
       FUN = function(x) 
         unzip(zipfile = temp, 
               files = paste("resources/data/Jasper/", x, sep = ""), 
               exdir = "data/Jasper"))

# Download the tree files from this paper

download.file("https://onlinelibrary.wiley.com/action/
downloadSupplement?doi=10.1111%2Fj.1461-0248.2009.01307.x&attachmentId=179085359", 
              destfile = "data/ele_1307_sm_sa1.tre")

# If the download of the above file does not work, 
# download the first supplementary material from the following paper
# and place it within your 'data' directory:

# https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1461-0248.2009.01307.x

Why are phylogenies important?

Since many decades we recognize that many ecolgical patterns are processes are not independent of the evolution of the lineages involved in generating these patterns.

Here, we are taking it seriously what Theodosius Dobzhansky said in 1973: “nothing in biology makes sense except in the light of evolution”!

A brief introduction to phylogenies

Phylogenetic trees (i.e., evolutionary tree or cladogram) are branching diagrams illustrating the evolutionary relationships among taxa.

Phylogenetic trees can be constructed based on species’ genes and/or on traits. Phylogenetic reconstruction through analyses of molecular sequences usually consists of three distinct procedures: (i) sequence alignment; (ii) character coding; and (iii) tree building. There is a range of available options for each of these steps, with no simple objective method for choosing between them.

The root of a phylogenetic tree indicates that an ancestral lineage gave rise to all organisms on the tree. A branch point indicates where two lineages diverged. A lineage that evolved early and remains unbranched is a basal taxon. When two lineages stem from the same branch point, they are sister taxa. A branch with more than two lineages is a polytomy. Extracted from: https://courses.lumenlearning.com/wm-biology1/chapter/reading-structure-of-phylogenetic-trees/

In this course, we will not cover the steps necessary to reconstruct phylogenies. We will thus assume that a phylogenetic relationship is already available for the taxa of interest.

`ape` and the `phylo` object in R

The core of phylogenetics analyses in R is the library ape - an acronym that stands for Analysis of Phylogenetics and Evolution. You can run the following commands to install, load and call help for ape:

# install.packages("ape")

library(ape)

# help(package = ape)

From ape, we depict two important functions to read and load trees into R: read.nexus and read.tree:

read.nexus reads NEXUS formatted trees, which hasis organized into major units known as blocks and usually have the following basic structure:

#NEXUS
[!This is the data file used for my dissertation]
BEGIN TAXA;
    TaxLabels Scarabaeus Drosophila Aranaeus;
END;

BEGIN TREES;
    Translate beetle Scarabaeus, fly Drosophila, spider Aranaeus;
    Tree tree1 = ((1,2),3);
    Tree tree2 = ((beetle,fly),spider);
    Tree tree3 = ((Scarabaeus,Drosophila),Aranaeus);
END;

read.tree reads Newick formatted trees, which commonly have the following format:

(Bovine:0.69395,(Hylobates:0.36079,(Pongo:0.33636,(G._Gorilla:0.17147, (P._paniscus:0.19268,H._sapiens:0.11927):0.08386):0.06124):0.15057):0.54939, Rodent:1.21460)

When phylogenies are created, loaded and handled in ape, their objects are represented as a list of class phylo. Let’s try reading the following text string of a tree (is it in Newick or in NEXUS format?) and see what R tells us when we call it:

tree.primates <- read.tree(text="((((Homo:0.21, Pongo:0.21):0.28, Macaca:0.49):0.13, 
                           Ateles:0.62):0.38, Galago:1.00);")

tree.primates


Phylogenetic tree with 5 tips and 4 internal nodes.

Tip labels:
[1] "Homo"     "Pongo"    "Macaca"   "\nAteles" "Galago"  

Rooted; includes branch lengths.

A class phylo object distributes the information of phylogenetic trees into six main components:

str(tree.primates)

List of 4
 $ edge       : int [1:8, 1:2] 6 7 8 9 9 8 7 6 7 8 ...
 $ edge.length: num [1:8] 0.38 0.13 0.28 0.21 0.21 0.49 0.62 1
 $ Nnode      : int 4
 $ tip.label  : chr [1:5] "Homo" "Pongo" "Macaca" "\nAteles" ...
 - attr(*, "class")= chr "phylo"
 - attr(*, "order")= chr "cladewise"

Since phylo is a list, we can access its elements by using the $ symbol. We can access the matrix edge by running phylo_object$edge, which hasthe beginning and ending node number for all the nodes and tips of a given tree. In summary, this allows us to keep track of the internal and external nodes of the tree:

tree.primates$edge

     [,1] [,2]
[1,]    6    7
[2,]    7    8
[3,]    8    9
[4,]    9    1
[5,]    9    2
[6,]    8    3
[7,]    7    4
[8,]    6    5

The vector tip.label element contains all the labels for all tips in the tree (i.e. the names of your OTUs):

tree.primates$tip.label

[1] "Homo"     "Pongo"    "Macaca"   "\nAteles" "Galago"

There is a little error in the tip labels, right? We can easily rewrite the labels with this:

tree.primates$tip.label <- c("Homo", "Pongo", "Macaca", "Ateles", "Galago")

Another important element is the vector edge.length, which indicates the length of each edge with relation to the root or base of the tree:

tree.primates$edge.length

[1] 0.38 0.13 0.28 0.21 0.21 0.49 0.62 1.00

Finally, Nnode contains the number of internal nodes in the tree:

tree.primates$Nnode

[1] 4

One can see all the important information a phylo object carries from a tree by plotting it with all its elements:

plot(tree.primates, 
     edge.width = 2, 
     label.offset = 0.05, 
     type = "cladogram")

nodelabels()
tiplabels()
add.scale.bar()

There are other classes that store trees in R, such as phylo4 and phylo4d objects. We will talk about them later in the course.

Sometimes, you may be interested in simulating trees. There are various ways of creating trees through simulations in R. Let’s take a look at phytools’s pbtree:

sampleTree <- pbtree(n = 20, nsim = 1)

plot(sampleTree)

Handling and preparing phylogenies

Now that we know the main characteristics of a phylogenetic tree and of a phylo object, let’s dive into more details on the structure of phylogenetic trees.

Dicotomous and polytomous trees

While some phylogenetic analyses are able to handle polytomies (i.e. more than two OTUs originating from the same node in a tree; or an unresolved clade), one may be interesting in solving the polytomies to produce a binary tree.

t1 <- read.tree(text = "((A,B,C),D);") 
# See how A, B and C are within the same round brackets, 
# *i.e.* departing from the same node.

plot(t1, 
     type = "cladogram")

One may check if a tree is binary by using is.binary.tree function and, in case it is FALSE, randomly resolve polytomies using the multi2d function:

is.binary.tree (t1)

[1] FALSE

t2 <- multi2di(t1)

plot(t2, 
     type = "cladogram")

Ultrametric, non-ultrametric and uninformative trees

Ultrametric trees (or approximates of them) can be used to infer both braching and temporal patterns of evolutionary history for a given clade. For a tree to be ultrametric, its matrix has to follow the following strong assumptions:

Each tip has to be uniquely labelled;
Each internal node of the tree diverges towards at least two “children”, i.e. is dicotomous;
Along any path from the root to a tip (or leaf), the numbers labelling internal nodes strictly decrease, i.e. time must strictly increase along the path from the root;
The distance from the root to all tips of the tree is constant for all clades.

In many ultrametric trees, one can assume that all extant species are sampled at the present time, and branch lengths represent time calibrated in millions of years.

This information is obtained when we infer plausible history from data that reflects time since divergence (i.e. reconstruct evolutionary history based on the molecular clock theory).

When a tree violates one of the above assumptions (especially, number 4), a tree is then considered to be non-ultrametric. For instance, a non-ultrametric tree will have branches with different distances towards the tips. While this information reflects different rates of evolution across the phylogeny, many analyses will require an ultrametric tree (i.e. where models assume constant variance and equal means at the tips).

In these cases, one may be interested in “smoothing out” these differences across the evolutionary history to transform a non-ultrametric tree into a ultrametric tree.

One method is to use the mean path length (MPL) that assumes random mutation and a fixed molecular clock to calculate the age of a node as the mean of all the distances from this node to all tips diverging from it. To run this, we use the ape function chronoMPL. MPL sometimes returns negative branch lengths, meaning that it should be used with caution.

Another function we can use is chronos, also from the ape package. chronos uses the penalized maximum likelihood method to estimate divergences times.

Let’s try these methods by first creating a random tree, then tansforming its edge lengths using both chronoMPL and chronos functions, and plotting them side-by-side:

tree.NonUlt <- rtree(7)
tree.Ult.MPL <- chronoMPL(tree.NonUlt)
tree.Ult.S <- chronos(tree.NonUlt)


Setting initial dates...
Fitting in progress... get a first set of estimates
         Penalised log-lik = -9.270834 
Optimising rates... dates... -9.270834 

Done.

par(mfrow=c(1,3))
plot(tree.NonUlt, edge.width = 2,
     cex = 2,
     main = "A) Non-ultrametric",
     cex.main = 2)

plot(tree.Ult.MPL, edge.width = 2,
     cex = 2,
     main = "B) Ultrametric chronoMPL",
     cex.main = 2)

plot(tree.Ult.S, edge.width = 2,
     cex = 2,
     main = "C) Ultrametric chronos",
     cex.main = 2)

You may verify if any tree is ultrametric using is.ultrametric:

is.ultrametric(tree.NonUlt)

[1] FALSE

is.ultrametric(tree.Ult.MPL)

[1] TRUE

is.ultrametric(tree.Ult.S)

[1] TRUE

Sometimes, trees may have no branch lengths available, but only the divergence relationships. We often refer to these trees as being non-informative trees (despite the fact that they can still give us some information about branching events):

tree.NonInf <- rtree(7)
tree.NonInf$edge.length <- NULL

plot(tree.NonInf, edge.width = 2,
     cex = 2,
     main = "D) Uninformative tree",
     cex.main = 2)

Edge length and rate transformations

Very oten, we may also be interested in altering the edge lengths of a given tree for a number of reasons. We may be interested, for example, in chaning the edge distance from molecular distances to time. Alternatively, we may want to change edge lengths to meet specific evolutionary models - let’s say, we may seek to assess whether a given ecological pattern might be explained by recent evolutionary changes [i.e. resembling a Pagel’s $\delta$ time-dependent model, (Pagel 1999)], by branching events (i.e., diversification rates; by changing Pagel’s $\kappa$) or by the phylogenetic structure of a tree (by rescaling a tree using Pagel’s $\lambda$).

We can use the function rescale to perform changes in the branches of a phylogenetic tree.

Pagel’s $\delta$ time-dependent model rescales a tree by raising all node depths to an estimated power greater than 1 ($\delta$). A value greater than 1 means that recent evolution has been relatively fast; and if delta is less than 1, recent evolution has been comparatively slow.

We can try the $\delta$ transformation in a random ultrametric tree:

library(geiger)

tree.Ult1 <- rcoal(25)
tree.Ult1.Delta.01 <- rescale(tree.Ult1, 
                              model = "delta", 0.1) # Setting power of 0.1
tree.Ult1.Delta.1 <- rescale(tree.Ult1, 
                             model = "delta", 1) # with power 1; compare this tree with the original tree
tree.Ult1.Delta.10 <- rescale(tree.Ult1, 
                              model = "delta", 10) # with power 10

# Representing all three trees

par(mfrow = c(1,3), cex.main = 2)

plot(tree.Ult1.Delta.01, 
     show.tip.label = FALSE,
     edge.width = 2,
     cex = 2,
     main = expression(paste("A) ", delta," = 0.1")))

plot(tree.Ult1.Delta.1, 
     show.tip.label = FALSE,
     edge.width = 2,
     cex = 2,
     main = expression(paste("B) ", delta," = 1")))

plot(tree.Ult1.Delta.10, 
     show.tip.label = FALSE,
     edge.width = 2,
     cex = 2,
     main = expression(paste("C) ", delta," = 10")))

Note how $\delta$ = 1 returns the original tree, $\delta$ < 1 extends the branch lengths of the edges, and $\delta$ > 1 shortens the edge lengths of the tree.

Pagel’s $\kappa$ punctuational (speciational) model of trait evolution focus on the number of speciation events between species to assess whether different rates of evolution are associated with branching events. This transformation raises all branch lengths to an estimated power.

tree.Ult1.kappa.01 <- rescale(tree.Ult1, 
                              model = "kappa", 0.1) # Kappa of 0.1
tree.Ult1.kappa.1 <- rescale(tree.Ult1, 
                             model = "kappa", 1) # Kappa of 1
tree.Ult1.kappa.2 <- rescale(tree.Ult1, 
                             model = "kappa", 2) # Kappa of 2

# Represent these trees

par(mfrow = c(1,3), 
    cex.main = 2)

plot(tree.Ult1.kappa.01, 
     show.tip.label = FALSE,
     edge.width = 2,
     cex = 2,
     main = expression(paste("D) ", kappa," = 0.1")))

plot(tree.Ult1.kappa.1, 
     show.tip.label = FALSE,
     edge.width = 2,
     cex = 2,
     main = expression(paste("E) ", kappa," = 1")))

plot(tree.Ult1.kappa.2, 
     show.tip.label = FALSE,
     edge.width = 2,
     cex = 2,
     main = expression(paste("F) ", kappa," = 2")))

Finally, by rescaling a tree using Pagel’s $\lambda$, one may change the structure of a phylogenetic tree to reduce the overall contribution of internal edges relative to the terminal edges of a given tree.

A Pagel’s $\lambda$ of closer to 0 removes such structure from the tree, making it resemble a “star” phylogeny; while the closest $\lambda$ is to 1, the more the rescaled tree resembles the original. Let’s try it:

tree.Ult1.lambda.01 <- rescale(tree.Ult1, model = "lambda", 0.1)
tree.Ult1.lambda.05 <- rescale(tree.Ult1, model = "lambda", 0.5)
tree.Ult1.lambda.1 <- rescale(tree.Ult1, model = "lambda", 1)

# Represent the above trees

par(mfrow = c(1,3), 
    cex.main = 2)

plot(tree.Ult1.lambda.01, 
     show.tip.label = FALSE,
     edge.width = 2,
     cex = 2,
     main = expression(paste("G) ", lambda," = 0.1")))

plot(tree.Ult1.lambda.05, 
     show.tip.label = FALSE,
     edge.width = 2,
     cex = 2,
     main = expression(paste("H) ", lambda," = 0.5")))

plot(tree.Ult1.lambda.1, 
     show.tip.label = FALSE,
     edge.width = 2,
     cex = 2,
     main = expression(paste("I) ", lambda," = 1")))

The above examples are our opening door to start understanding three key features of evolutionary change: the tempo (δ), mode (κ) and the phylogenetic signal (λ) of the evolution (of a trait) (Pagel, 1994, 1999).

You can compare all our transformations with the figure below:

Representing trees and other class objects

There are multiple ways for representing trees in R. You can access the help page of any R function we are using by running the function name preceeded by question mark (?), e.g. ?plot.phylo.

treeVertebrates <- read.tree(text = "(((((((Cow, Pig), Whale), 
                             (Bat,(Lemur, Human))), 
                             (Robin, Iguana)), Coelacanth), 
                             Gold fish), Shark);")

par(mfrow = c(2,2))

# Using the plot function, which is based on plot.phylo
plot(treeVertebrates, no.margin = TRUE, 
     edge.width = 2)

# Plotting with roundPhylogram
roundPhylogram(treeVertebrates)

# Removing the root of the tree
plot(unroot(treeVertebrates), 
     type="unrooted", 
     no.margin = TRUE, lab4ut = "axial", 
     edge.width=2)

# Posting a type fan tree
plotTree(treeVertebrates, 
         type="fan", 
         fsize=0.7, 
         lwd=1,
    ftype="i")

One package that allows a fast and easily changeable visualization of phylogenetic trees is the ggtree package, which is based on the popular ggplot2 package. To better work with ggtree, you should have a basic familiarity with the Grammar of Graphics and with ggplot2. I recommend the QCBS R Workshop material on ggplot2, which helped me acquire a general knowledge on this package.

You can access the documentation for ggtree here.

We can use ggtree, for example, to attribute colors to different taxonomic groups (in this case, genera of bats).

library(ggtree)

data(chiroptera, package="ape")

# Separate the genus names from species names
groupInfo <- split(chiroptera$tip.label, 
                   gsub("_\\w+", "", chiroptera$tip.label))

head(groupInfo)

$Acerodon
[1] "Acerodon_celebensis" "Acerodon_humilis"    "Acerodon_jubatus"   
[4] "Acerodon_leucotis"   "Acerodon_mackloti"  

$Aethalops
[1] "Aethalops_alecto"

$Alionycteris
[1] "Alionycteris_paucidentata"

$Ametrida
[1] "Ametrida_centurio"

$Amorphochilus
[1] "Amorphochilus_schnablii"

$Anoura
[1] "Anoura_caudifer"  "Anoura_cultrata"  "Anoura_geoffroyi"
[4] "Anoura_latidens"

# Group species accordingly to their genera 
chiroptera <- groupOTU(chiroptera, 
                       groupInfo)

# Plot tree using ggtree
ggtree(chiroptera, 
       aes(color = group), 
       layout='circular') + 
  geom_tiplab(size = 1, 
              aes(angle = angle))

The `picante` package

One of the core packages for hypothesis testing in ecophylogenetics is picante (Kembel et a. 2010). picante mainly works with three types of objects: the phylogeny data as a phylo class object; the community presence-absence (binary) or abundance matrix; and a species-trait matrix.

The `pez` package

pez is a package that contains many novel and wrappers functions for phylogenetic analyses in community data (Pearse et al. 2015). It has useful cuntions that will help us keep the different datasets we will be working on matching.

Other classes of objects

phylo4d and phylo4 are recent classes developed based on the phylo class that seeks to standardize and unify the representation of comparative data and phylogenetic trees in R.

Exercises

We are short in time! I will add a few exercises here later.

Community Phylogenetic Patterns

Ecologists share the goal of understanding and describing the processes that create the uneven distribution patterns of species, which create the variation in abundance, identity and diversity we know today.

Among many factors that determine the presence of taxa in communities are the abiotic and biotic conditions of such communities (e.g., Diamond 1975; Westoby and Wright 2006; but see Hubbell 2001).

Since taxa that recently diverged tend to be ecologically similar (Darwin 1859; Lord et al. 1995; Wiens and Graham 2005), a direct link may exist between the evolutionary relatedness of organisms in a community, the characters they possess, and the ecological processes that determine their distribution and abundance.

In fact, many communities exhibit nonrandom patterns of evolutionary relatedness among co-occurring species (reviewed in Webb et al. 2002): a phenomenon we refer to as phylogenetic community structure.

Preparing data for community phylogenetic analyses

For this analysis, we will use the data on the Jasper Ridge plant communities available in the book “Phylogenies in Ecology”, written by Marc Cadotte and Jonathan Davies (2016).

JasperPlants.tree <- read.tree("data/Jasper/resources/data/Jasper/jasper_tree.phy")

JasperPlants.comm <- read.csv("data/Jasper/resources/data/Jasper/jasper_data.csv", 
                   row.names = 1)

Before proceeding to our analyses, we have to make sure that both the community and phylogenetic data match. For this, the names of the tips should match the species names in the dataset. This also means that species present in one dataset should not be absent in the other.

(You thought you were done preparing your dataset, eh?)

For this, we can use the function drop.tip from the ape package:

JasperPlants.cleanTree <- drop.tip(phy = JasperPlants.tree, 
                              tip = setdiff(JasperPlants.tree$tip.label,
                                      colnames(JasperPlants.comm)))

You can also match your datasets using the function match.phylo.comm from picante, or the function comparative.comm from pez.

# Using picante
JasperPlants.picCleanTree <- match.phylo.comm(phy = JasperPlants.tree, 
                                              comm = JasperPlants.comm)$phy

JasperPlants.picCleanComm <- match.phylo.comm(phy = JasperPlants.tree, 
                                              comm = JasperPlants.comm)$comm


# Do the same using pez
# JasperPlants.pezCleanComm <- comparative.comm(phy = JasperPlants.tree, 
#                                               comm = as.matrix(JasperPlants.comm))

plot(JasperPlants.picCleanTree,
     cex = 0.4)

JasperPlants.picCleanComm[1:4, 1:4]

     Geranium_dissectum Erodium_botrys Anagallis_arvensis
J11                  20             10                  5
J110                  0             15                  0
J12                  20             20                  0
J13                   0             40                  0
     Epilobium_brachycarpum
J11                       0
J110                      0
J12                       1
J13                       0

Phylogenetic $\alpha$ diversity (PD)

One simple way to measure evolutionary diversity within a community is through Daniel Faith’s PD metric (Faith, 1992; http://goo.gl/wM08Oy).

PD sums the branch lenghts of all co-occurring species in a given site, from the tips to the root of the phylogenetic tree.

Higher PD values are given to communities that has more evolutionary divergent taxa and older history, while lower PD values represent assemblages that have taxa with more recent evolutionar history.

Faith’s PD can be implemented using the pd function in the picante package.

In addition to Faith’s PD, the pd function also returns species richness (SR).

SR is the same as observed richness ($S_{obs}$) or $\alpha$ diversity.

Jasper.PD <- pd(JasperPlants.picCleanComm, JasperPlants.picCleanTree,
                include.root = FALSE)

head(Jasper.PD) # See the first six rows of the resulting matrix

           PD SR
J11  831.4887 13
J110 501.8703  9
J12  769.9249 14
J13  712.0532 13
J14  640.5896 11
J15  640.2884 13

And how PD estimates compares with SR?

cor.test(Jasper.PD$PD, Jasper.PD$SR)


    Pearson's product-moment correlation

data:  Jasper.PD$PD and Jasper.PD$SR
t = 9.0965, df = 28, p-value = 7.457e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7321097 0.9338472
sample estimates:
      cor 
0.8643903

plot(Jasper.PD$PD, 
     Jasper.PD$SR, 
     xlab = "Phylogenetic Diversity", ylab = "Species Richness", 
     pch = 16)

Randomizations and null models

We are often interested in assessing whether or not observed patterns differ from null expectation. picante has many functions that allow us to test hypotheses under different types of null models:

Null Model	Description

taxa.labels	Shuffles taxa labels across tips of phylogeny (across all taxa included in phylogeny)

richness	Randomizes community data matrix abundances within samples (maintains sample species richness)

frequency	Randomizes community data matrix abundances within species (maintains species occurence frequency)

sample.pool	Randomizes community data matrix by drawing species from pool of species occurring in at least one community (sample pool) with equal probability

phylogeny.pool	Randomize community data matrix by drawing species from pool of species occurring in at least one community (sample pool) with equal probability

independentswap	Randomizes community data matrix with the independent swap algorithm (Gotelli 2000) maintaining species occurrence frequency and sample species richness

trialswap	Randomizes community data matrix with the trial-swap algorithm (Miklos & Podani 2004) maintaining species occurrence frequency and sample species richness

We will start using null models to calculate the standardized effect size (ses) of PD for each community. For this, we can use the ses.pd function in picante:

# This is a computationally intensive process and take some time to run
Jasper.ses.pd <- ses.pd(samp = JasperPlants.picCleanComm, 
                        tree = JasperPlants.picCleanTree, 
                        null.model = "taxa.labels", 
                        runs = 1000)

head(Jasper.ses.pd)

     ntaxa   pd.obs pd.rand.mean pd.rand.sd pd.obs.rank   pd.obs.z
J11     13 831.4887     793.8102   64.00612         713  0.5886711
J110     9 501.8703     624.4736   58.22656          23 -2.1056247
J12     14 769.9249     832.2727   63.65673         165 -0.9794380
J13     13 712.0532     791.8934   64.23287         114 -1.2429803
J14     11 640.5896     712.3946   60.74966         123 -1.1819819
J15     13 640.2884     793.5430   61.60145           8 -2.4878408
        pd.obs.p runs
J11  0.712287712 1000
J110 0.022977023 1000
J12  0.164835165 1000
J13  0.113886114 1000
J14  0.122877123 1000
J15  0.007992008 1000

Phylogenetic community structure

Another way of looking into the phylogenetic structure of communities is to quantify at what degree closely related taxa co-occur.

The basis for patterns of phylogenetic community structure can lie in a simple framework (table 1; Webb et al. 2002): when species traits responsible for their physiological tolerances are conserved, an environmental filtering that limits the range of viable ecological strategies at a given site is expected to select co-occurring species more related than expected by chance, i.e. generate a pattern known as phylogenetic clustering.

On the other hand, competitive exclusion can limit the ecological similarity of co-occurring species, generating a pattern of phylogenetic overdispersion or phylogenetic evenness.

Finally, when traits are diverging faster across the evolutionary time, the effects of habitat filtering should be weaker, producing evenly dispersed patterns of relatedness; while competition or limiting similarity is expected to produce random or clustered patterns.

If communities are assembled independently with respect to traits (e.g., Hubbell 2001), then patterns of relatedness should be resemble random expectation.

Over the recent years, other processes were shown to drive phylogenetic patterns of community structure, such as, isolation, time since isolation, diversification, scale-dependent assembly and local and regional composition and richness.

Assembly.process	Under.niche.conservatism	Under.niche.divergence
Limiting similarity	Even dispersion	Random or clustered dispersion
Habitat filtering	Clustered dispersion	Even dispersion
Neutral assembly	Random dispersion	Random dispersion

Phylogenetic resemblance matrix

An important step prior to the estimation of the phylogenetic structure of communities is to create a phylogenetic resemblance matrix, or a phylogenetic variance-covariance matrix. A phylogenetic variance-covariance matrix is nothing more than the calculated distances between taxa in a tree.

In R, we commonly refer to this matrix as a matrix of cophenetic distances among taxa and we can use the function cophenetic.phylo from picante to calculate it:

# Create a Phylogenetic Distance Matrix {picante}

JasperPlants.cophenDist <- cophenetic.phylo(JasperPlants.picCleanTree)

Net relatedness index (NRI)

One common metric to evaluate the phylogenetic structure of commmunities is the net relatedness index (NRI), which is based on the mean phylogenetic distances (MPD) calculated from the pairwise cophenetic branch lengths distances of the co-occurring taxa in each community.

NRI is obtained by multiplying the standardized effect size mean phylogenetic distances (calculated via a null model of interest) by $-1$.

Negative NRI values indicate phylogenetic overdispersion, i.e. co-occurring taxa is less closely related than expected by a given null model; while positive NRI values represent phylogenetic clustering, i.e. co-occurring taxa are more closely related than expected by a null model.

In R, we will use the function ses.mpd from picante to calculate the NRI of our communities:

# Estimate Standardized Effect Size of the MPD on the cophenetic matrix 
JasperPlants.ses.mpd <- ses.mpd(JasperPlants.picCleanComm, 
                   JasperPlants.cophenDist, 
                   null.model = "taxa.labels", 
                   abundance.weighted = FALSE, 
                   runs = 100) 

head(JasperPlants.ses.mpd)

     ntaxa  mpd.obs mpd.rand.mean mpd.rand.sd mpd.obs.rank  mpd.obs.z
J11     13 199.4659      195.3307    8.919828           61  0.4635981
J110     9 146.6226      195.6875   12.463972            3 -3.9365339
J12     14 195.5006      196.5919    7.863150           40 -0.1387885
J13     13 192.1528      196.3371    8.969154           24 -0.4665146
J14     11 185.2070      196.8119   11.133347           14 -1.0423496
J15     13 185.1665      195.9533    8.014001           12 -1.3459983
      mpd.obs.p runs
J11  0.60396040  100
J110 0.02970297  100
J12  0.39603960  100
J13  0.23762376  100
J14  0.13861386  100
J15  0.11881188  100

# Calculate NRI
JasperPlants.NRI <- as.matrix(-1 * 
                                ((JasperPlants.ses.mpd[,2] - JasperPlants.ses.mpd[,3]) / JasperPlants.ses.mpd[,4]))
rownames(JasperPlants.NRI) <- row.names(JasperPlants.ses.mpd)
colnames(JasperPlants.NRI) <- "NRI"


head(JasperPlants.NRI)

            NRI
J11  -0.4635981
J110  3.9365339
J12   0.1387885
J13   0.4665146
J14   1.0423496
J15   1.3459983

Nearest taxon index (NTI)

Another way to assess the phylogenetic structure of communities is through the calculation of the nearest taxon index (NTI). NTI differs from NRI by being based on the mean nearest neighbour phylogenetic distance (MNTD), instead of the overall mean pairwise distances (MPD).

MNTD is obtained by calculating the mean phylogenetic distance of each taxon of a given community to its nearest neighbour in a tree. By doing this, one can observe that NTI is more sensitive to the clustering or overdispersion patterns happening closer to the tips of a tree, while NRI represents an overall pattern of structuring across the whole phylogenetic tree.

As in NRI, higher values of NRI indicate phylogenetic clustering and lower values of NRI indicate phylogenetic overdispersion.

# Estimate Standardized Effect Size of the MNTD on the cophenetic matrix 
JasperPlants.ses.mntd <- ses.mntd(JasperPlants.picCleanComm, 
                   JasperPlants.cophenDist, 
                   null.model = "taxa.labels", 
                   abundance.weighted = FALSE, 
                   runs = 100) 

head(JasperPlants.ses.mntd)

     ntaxa mntd.obs mntd.rand.mean mntd.rand.sd mntd.obs.rank mntd.obs.z
J11     13 76.65059       80.65936     11.55970            41 -0.3467888
J110     9 71.32136       91.26041     16.29823            11 -1.2233870
J12     14 67.04067       77.98757     10.25801            17 -1.0671567
J13     13 65.27537       82.00706     10.58893             9 -1.5801110
J14     11 74.54219       85.16600     14.93720            25 -0.7112315
J15     13 47.69848       80.04591     10.58640             1 -3.0555649
     mntd.obs.p runs
J11  0.40594059  100
J110 0.10891089  100
J12  0.16831683  100
J13  0.08910891  100
J14  0.24752475  100
J15  0.00990099  100

# Calculate NTI
JasperPlants.NTI <- as.matrix(-1 * 
                                ((JasperPlants.ses.mntd[,2] - JasperPlants.ses.mntd[,3]) / JasperPlants.ses.mntd[,4]))
rownames(JasperPlants.NTI) <- row.names(JasperPlants.ses.mntd)
colnames(JasperPlants.NTI) <- "NTI"


head(JasperPlants.NTI)

           NTI
J11  0.3467888
J110 1.2233870
J12  1.0671567
J13  1.5801110
J14  0.7112315
J15  3.0555649

Phylogenetic beta diversity (PBD)

One may not be interested only in the patterns of $\alpha$ diversity, but in understanding how communities are evolutionary distinct from each other, i.e. knowing the pairwise phylogenetic $\beta$ diversity.

Many metrics are available to assess the phylogenetic $\beta$ diversity of communities. Here, we will do this by calculating the PhyloSor index, which is the phylogenetic analog of the Sørensen index (Bryant et al., 2008; Swenson, 2011; Feng et.al., 2012). PhyloSor can be calculated using the function phylosor from picante:

Jasper.pbd <- phylosor(samp = JasperPlants.picCleanComm, 
                        tree = JasperPlants.picCleanTree)

as.matrix(Jasper.pbd)[1:7,1:7]

           J11      J110       J12       J13       J14       J15       J16
J11  0.0000000 0.6381010 0.7884823 0.8098831 0.8232718 0.7058630 0.8401553
J110 0.6381010 0.0000000 0.7032579 0.7367845 0.7567888 0.7165036 0.6524526
J12  0.7884823 0.7032579 0.0000000 0.8166769 0.8852419 0.8188865 0.8371251
J13  0.8098831 0.7367845 0.8166769 0.0000000 0.8727346 0.7124985 0.7688849
J14  0.8232718 0.7567888 0.8852419 0.8727346 0.0000000 0.7543848 0.8751975
J15  0.7058630 0.7165036 0.8188865 0.7124985 0.7543848 0.0000000 0.6949025
J16  0.8401553 0.6524526 0.8371251 0.7688849 0.8751975 0.6949025 0.0000000

We can go one step further and test whether the observed PBD values differ from null expectation under different null models. First, we will use the function phylosor.rnd to obtain PhyloSor values of phylogenetic beta-diversity obtained by randomization:

Jasper.pbd.rndTaxa <- phylosor.rnd(samp = JasperPlants.picCleanComm, 
                        tree = JasperPlants.picCleanTree,
                        cstSor = TRUE,
                        null.model = "taxa.labels", # Pay attention to the null models!
                        runs = 100)

Apparently, the a standardized effect size function is not available within picante. Thus, we will create our own null model to test our hypothesis:

# I first ran ses.pd to build this in a more-or-less similar format

ses.phylosor <- function(obs, rand){
  rand <- t(as.data.frame(lapply(rand, as.vector)))
  
  phySor.obs <- as.numeric(obs)
  
  phySor.mean <- apply(rand, MARGIN = 2, 
                      FUN = mean, 
                      na.rm = TRUE)
  
  phySor.sd <- apply(rand, 
                    MARGIN = 2, 
                    FUN = sd, 
                    na.rm = TRUE)
  
  phySor.ses <- (phySor.obs - phySor.mean)/phySor.sd
  
  phySor.obs.rank <- apply(X = rbind(phySor.obs, rand), 
                          MARGIN = 2, 
                          FUN = rank)[1, ]
  
  phySor.obs.rank <- ifelse(is.na(phySor.mean), NA, phySor.obs.rank)
  
  data.frame(phySor.obs, 
             phySor.mean, 
             phySor.sd, 
             phySor.obs.rank, 
             phySor.ses, 
             phySor.obs.p = phySor.obs.rank/(dim(rand)[1] + 1))
}

  
JasperPlants.ses.pbd  <- ses.phylosor(Jasper.pbd,
                                      Jasper.pbd.rndTaxa)

# Project the table using knitr::kable
round(JasperPlants.ses.pbd[45:55,], 3) %>%
  kable("html") %>%
  kable_styling()

	phySor.obs	phySor.mean	phySor.sd	phySor.obs.rank	phySor.ses	phySor.obs.p
45	0.527	0.589	0.060	17	-1.043	0.168
46	0.647	0.601	0.055	83	0.823	0.822
47	0.589	0.584	0.063	49	0.083	0.485
48	0.628	0.624	0.058	51	0.078	0.505
49	0.652	0.736	0.040	3	-2.136	0.030
50	0.687	0.712	0.050	31	-0.500	0.307
51	0.770	0.757	0.048	58	0.268	0.574
52	0.671	0.724	0.044	13	-1.191	0.129
53	0.665	0.692	0.045	31	-0.600	0.307
54	0.575	0.647	0.048	7	-1.476	0.069
55	0.649	0.671	0.051	32	-0.431	0.317

Decomposing phylogenetic beta diversity

Our questions about phylogenetic can become even more complex!

One may also be interested in decomposing the turnover and nestedness components of beta diversity (Baselga, 2010; Leprieur et al., 2012). Baselga (2010) proposed an additive partitioning framework that consists in decomposing the pair-wise Sørensen’s dissimilarity index into two additive components accounting for (i) ‘true’ turnover of species and (ii) richness differences between nested communities. In 2012, Leprieur et al. (2012) extended Baselga’s (2010) framework to decompose PhyloSor into two components accounting for ‘true’ phylogenetic turnover and phylogenetic diversity gradients, respectively.

Still not covered in this tutorial!

Trait Evolution

One central concept in ecophylogenetics is the non‐independence among species traits because of their phylogenetic relatedness (Felsenstein 1985; Revell, Harmon & Collar 2008).

For this part of the workshop, we will use our newly acquired knowledge in the Edge length and rate transformations section about Pagel’s $\lambda$, $\kappa$ and $\delta$. We will also see about alternative evolutionary models and learn how to test whether phylogenetic signal is present in species traits, as well as, fit traits to different models of evolution.

Dataset preparation

We are now introducing a third-type of data in our phylogenetic analyses: traits. Here we will use a different dataset.

For this part of the course, we will fetch our dataset from papers using a very useful R package: fulltext. fulltext is maintained by Scott Chamberlain, Will Pearse and Katrin Leinweber, and is able to get files directly from the supplementary material of published papers, with only the DOI references!

# The idea of using this dataset was borrowed from Will Pearse's minicourse on ecophylogenetics.
# Nevertheless, we will use it for further analyses here!

library('fulltext')

MCDB.comm <- read.csv(ft_get_si("E092-201", "MCDB_communities.csv", "esa_data_archives"))
MCDB.species <- read.csv(ft_get_si("E092-201", "MCDB_species.csv", "esa_data_archives"))
Mammalian.tree <- read.nexus("data/ele_1307_sm_sa1.tre")[[1]] # fulltext is not fetching files from Wiley
PanTHERIA.traits <- read.delim(ft_get_si("E090-184", "PanTHERIA_1-0_WR05_Aug2008.txt", "esa_archives"))

What have we just done?

In MCDB.comm, we have loaded 7977 records of species composition and abundance of mammalian communities (excluding bats) for 1000 sites throughout the world, which was compiled from published literature.

MCDB.species contains 700 records that includes the name of species and their taxonomic information.

This data is part of the ecological archive of the paper by Thibault et al. (2011)

(Katherine M. Thibault, Sarah R. Supp, Mikaelle Giffin, Ethan P. White, and S. K. Morgan Ernest. 2011. Species composition and abundance of mammalian communities. Ecology 92:2316.)

Mammalian.tree loaded the first tree described in the published paper by Fritz et al. 2009

(Fritz, S. A., Bininda-Emonds, O. R. P. and Purvis, A. (2009), Geographical variation in predictors of mammalian extinction risk: big is bad, but only in the tropics. Ecology Letters 2(6):538–549. doi: 10.1111/J.1461-0248.2009.01307)

And, finally, PanTHERIA.traits contain a species-level data set of life history, ecology, and geography of extant and recently extinct mammals. The metadata for this dataset can be consulted in its website.

Now that we know about our dataset, we can explore each one of the loaded objects:

head(MCDB.comm)

  Site_ID Initial_year Species_ID Presence_only Abundance Mass
1    1008         2002       CHPE             0        52 NULL
2    1008         2002       CHSX             0         4 NULL
3    1008         2002       DIDE             0        12 NULL
4    1008         2002       DIME             0        20 NULL
5    1008         2002       PELO             0         5 NULL
6    1008         2002       PEPA             0         9 NULL

head(MCDB.species)

  Species_ID      Family     Genus       Species Species_level
1       ABBE Abrocomidae  Abrocoma     bennettii             1
2       ABLO  Cricetidae Abrothrix    longipilis             1
3       ABOL  Cricetidae Abrothrix     olivaceus             1
4       ACSP     Muridae    Acomys spinosissimus             1
5       ACWI     Muridae    Acomys       wilsoni             1
6       ACPY Acrobatidae Acrobates      pygmaeus             1

str(Mammalian.tree)

List of 4
 $ edge       : int [1:7522, 1:2] 5021 5022 5023 5023 5024 5024 5022 5021 5025 5026 ...
 $ edge.length: num [1:7522] 102.6 49.9 13.7 0.2 13.5 ...
 $ Nnode      : int 2503
 $ tip.label  : chr [1:5020] "Tachyglossus_aculeatus" "Zaglossus_bruijni" "Zaglossus_bartoni" "Ornithorhynchus_anatinus" ...
 - attr(*, "class")= chr "phylo"
 - attr(*, "order")= chr "cladewise"

str(PanTHERIA.traits)

'data.frame':   5416 obs. of  55 variables:
 $ MSW05_Order                  : Factor w/ 29 levels "Afrosoricida",..: 2 3 3 3 3 2 2 2 23 23 ...
 $ MSW05_Family                 : Factor w/ 153 levels "Abrocomidae",..: 17 18 18 18 18 12 12 12 117 117 ...
 $ MSW05_Genus                  : Factor w/ 1230 levels "Abditomys","Abeomelomys",..: 151 152 152 152 152 117 117 117 139 139 ...
 $ MSW05_Species                : Factor w/ 3802 levels "abae","abbreviatus",..: 979 34 290 1804 1946 1191 1362 1662 828 948 ...
 $ MSW05_Binomial               : Factor w/ 5416 levels "Abditomys latidens",..: 563 564 565 566 567 409 410 411 476 477 ...
 $ X1.1_ActivityCycle           : num  3 1 2 2 2 2 -999 2 3 -999 ...
 $ X5.1_AdultBodyMass_g         : num  492714 10392 9659 11989 31757 ...
 $ X8.1_AdultForearmLen_mm      : num  -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
 $ X13.1_AdultHeadBodyLen_mm    : num  -999 745 828 872 1055 ...
 $ X2.1_AgeatEyeOpening_d       : num  -999 -999 7.5 11.9 14 ...
 $ X3.1_AgeatFirstBirth_d       : num  1652 -999 -999 365 548 ...
 $ X18.1_BasalMetRate_mLO2hr    : num  40293 -999 -999 3699 11254 ...
 $ X5.2_BasalMetRateMass_g      : num  407000 -999 -999 10450 33100 ...
 $ X6.1_DietBreadth             : num  3 6 6 1 1 3 2 2 -999 -999 ...
 $ X7.1_DispersalAge_d          : num  -999 330 -999 255 180 ...
 $ X9.1_GestationLen_d          : num  386.5 65 61.2 61.7 63.5 ...
 $ X12.1_HabitatBreadth         : num  1 1 1 1 1 -999 -999 1 1 -999 ...
 $ X22.1_HomeRange_km2          : num  196.32 1.01 2.95 18.88 159.86 ...
 $ X22.2_HomeRange_Indiv_km2    : num  -999 1.01 3.13 19.91 43.13 ...
 $ X14.1_InterbirthInterval_d   : num  614 -999 365 365 365 ...
 $ X15.1_LitterSize             : num  0.98 4.5 3.74 5.72 4.98 1.22 1 1.22 1.01 -999 ...
 $ X16.1_LittersPerYear         : num  1 -999 -999 -999 2 1 1 1 -999 -999 ...
 $ X17.1_MaxLongevity_m         : num  480 137 192 262 354 ...
 $ X5.3_NeonateBodyMass_g       : num  36751 -999 212 200 412 ...
 $ X13.2_NeonateHeadBodyLen_mm  : num  -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
 $ X21.1_PopulationDensity_n.km2: num  0.98 0.74 0.22 0.25 0.01 0.54 -999 0.75 4.89 -999 ...
 $ X10.1_PopulationGrpSize      : num  11 -999 -999 -999 -999 -999 -999 21 -999 -999 ...
 $ X23.1_SexualMaturityAge_d    : num  1948 250 371 373 679 ...
 $ X10.2_SocialGrpSize          : num  10 -999 -999 -999 -999 40 110 40 2.05 -999 ...
 $ X24.1_TeatNumber             : int  -999 8 8 8 9 -999 -999 -999 -999 -999 ...
 $ X12.2_Terrestriality         : num  1 1 1 1 1 -999 -999 1 2 -999 ...
 $ X6.2_TrophicLevel            : int  1 2 2 3 3 1 1 1 -999 -999 ...
 $ X25.1_WeaningAge_d           : num  389.4 52.9 61.3 43.7 44.8 ...
 $ X5.4_WeaningBodyMass_g       : num  -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
 $ X13.3_WeaningHeadBodyLen_mm  : num  -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
 $ References                   : Factor w/ 2589 levels "-999","1;13;134;484;543;820;883;890;1154;1155;1156;1157;1158;1297;1849;2018;2151;2620;2655;3079",..: 1470 1564 1678 1152 1146 1478 1529 1513 393 1 ...
 $ X5.5_AdultBodyMass_g_EXT     : num  -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
 $ X16.2_LittersPerYear_EXT     : num  -999 -999 1.1 1.1 -999 -999 -999 -999 1.05 -999 ...
 $ X5.6_NeonateBodyMass_g_EXT   : num  -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
 $ X5.7_WeaningBodyMass_g_EXT   : num  -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 ...
 $ X26.1_GR_Area_km2            : num  -999 10581413 25739527 17099094 50803440 ...
 $ X26.2_GR_MaxLat_dd           : num  -999 16.7 47 71.4 83.3 ...
 $ X26.3_GR_MinLat_dd           : num  -999 -28.73 -4.71 8.02 11.48 ...
 $ X26.4_GR_MidRangeLat_dd      : num  -999 -6 21.1 39.7 47.4 ...
 $ X26.5_GR_MaxLong_dd          : num  -999 43.5 108.5 -67.1 179.7 ...
 $ X26.6_GR_MinLong_dd          : num  -999 -17.5 -17.1 -168.1 -171.8 ...
 $ X26.7_GR_MidRangeLong_dd     : num  -999 13 45.7 -117.6 3.9 ...
 $ X27.1_HuPopDen_Min_n.km2     : int  -999 0 0 0 0 1 0 1 0 -999 ...
 $ X27.2_HuPopDen_Mean_n.km2    : num  -999 35.2 79.3 27.3 37.9 ...
 $ X27.3_HuPopDen_5p_n.km2      : num  -999 1 0 0 0 8 0 4 0 -999 ...
 $ X27.4_HuPopDen_Change        : num  -999 0.14 0.1 0.06 0.04 0.09 0.05 0.11 0.05 -999 ...
 $ X28.1_Precip_Mean_mm         : num  -999 90.8 44.6 53 34.8 ...
 $ X28.2_Temp_Mean_01degC       : num  -999 236.51 217.23 58.18 4.82 ...
 $ X30.1_AET_Mean_mm            : num  -999 923 438 503 313 ...
 $ X30.2_PET_Mean_mm            : num  -999 1534 1359 728 561 ...

Now, let’s treat our dataset:

(Yes, we spend a lot of time making our dataset ready for analysis!)

# Starting with the community file:
# Let's make a species-site matrix

MCDB.comm$Abundance <- as.numeric(as.character(MCDB.comm$Abundance))

head(MCDB.comm$Abundance)

[1] 52  4 12 20  5  9

# Add a new column with real species names, extracted and matched from the MCDB.species dataset

MCDB.comm$Species <- with(MCDB.species, 
                          paste(Genus, Species)[match(MCDB.comm$Species_ID, Species_ID)])

# Have you seen "with" before? Click on the link below 
# and see how this could have been done without "with". 
# Tell us the difference!

MCDB.comm$Species <- gsub("  ", " ", MCDB.comm$Species)

head(MCDB.comm)

  Site_ID Initial_year Species_ID Presence_only Abundance Mass
1    1008         2002       CHPE             0        52 NULL
2    1008         2002       CHSX             0         4 NULL
3    1008         2002       DIDE             0        12 NULL
4    1008         2002       DIME             0        20 NULL
5    1008         2002       PELO             0         5 NULL
6    1008         2002       PEPA             0         9 NULL
                   Species
1 Chaetodipus penicillatus
2         Chaetodipus spp.
3        Dipodomys deserti
4       Dipodomys merriami
5 Perognathus longimembris
6       Perognathus parvus

Looking good? It still does not look like a species by site matrix, right? Let’s transform the sample dataset into a matrix using picante’s sample2matrix function:

MCDB.comm <- with(MCDB.comm,
                  as.matrix(sample2matrix(
                    data.frame(Site_ID, # This will be our row.names
                               as.numeric(as.character(Abundance)),
                               Species))))

# How is this looking now?
MCDB.comm[1:6, 1:6]

     Abrocoma bennettii Abrothrix longipilis Abrothrix olivaceus
1001                  0                    0                   0
1002                  0                    0                   0
1003                  0                    0                   0
1004                  0                    0                   0
1005                  0                    0                   0
1006                  0                    0                   0
     Acomys spinosissimus Acomys wilsoni Acrobates pygmaeus
1001                    0              0                  0
1002                    0              0                  0
1003                    0              0                  0
1004                    0              0                  0
1005                    0              0                  0
1006                    0              0                  0

Now, let’s go to the phylogenetic tree:

# We have noticed that the labels of the phylogenetic tree 
# has a different separator than the community dataset

# Let's replace the "_" in the phylogenetic tree by " "

Mammalian.tree$tip.label <- gsub("_", " ", Mammalian.tree$tip.label)

# Solve potential polytomies

Mammalian.tree <- multi2di(Mammalian.tree)

Mammalian.tree


Phylogenetic tree with 5020 tips and 5019 internal nodes.

Tip labels:
    Tachyglossus aculeatus, Zaglossus bruijni, Zaglossus bartoni, Ornithorhynchus anatinus, Anomalurus beecrofti, Anomalurus derbianus, ...

Rooted; includes branch lengths.

And, now, the trait dataset. Let’s select only a few characters of interest:

PanTHERIA.traits <- with(PanTHERIA.traits,
                         data.frame(BodyMass = log10(X5.1_AdultBodyMass_g),
                                    HeadLength = (X13.1_AdultHeadBodyLen_mm),
                                    LatitudinalRange = (X26.4_GR_MidRangeLat_dd),
                                    mPrecipRange = (X28.1_Precip_Mean_mm),
                                    mTempRange = (X28.2_Temp_Mean_01degC),
                                    HomeRange = (X22.1_HomeRange_km2),
                                    Terrestriality = (X12.2_Terrestriality),
                                    row.names = MSW05_Binomial))

Warning in data.frame(BodyMass = log10(X5.1_AdultBodyMass_g), HeadLength =
(X13.1_AdultHeadBodyLen_mm), : NaNs produced

str(PanTHERIA.traits)

'data.frame':   5416 obs. of  7 variables:
 $ BodyMass        : num  5.69 4.02 3.98 4.08 4.5 ...
 $ HeadLength      : num  -999 745 828 872 1055 ...
 $ LatitudinalRange: num  -999 -6 21.1 39.7 47.4 ...
 $ mPrecipRange    : num  -999 90.8 44.6 53 34.8 ...
 $ mTempRange      : num  -999 236.51 217.23 58.18 4.82 ...
 $ HomeRange       : num  196.32 1.01 2.95 18.88 159.86 ...
 $ Terrestriality  : num  1 1 1 1 1 -999 -999 1 2 -999 ...

# Remove "-999"" and "NaN" values

PanTHERIA.traits[PanTHERIA.traits == -999] <- NA
PanTHERIA.traits <- na.omit(PanTHERIA.traits)

str(PanTHERIA.traits)

'data.frame':   433 obs. of  7 variables:
 $ BodyMass        : num  4.02 3.98 4.08 4.5 2.39 ...
 $ HeadLength      : num  745 828 872 1055 224 ...
 $ LatitudinalRange: num  -6 21.14 39.7 47.38 -6.19 ...
 $ mPrecipRange    : num  90.8 44.6 53 34.8 144 ...
 $ mTempRange      : num  236.51 217.23 58.18 4.82 239.14 ...
 $ HomeRange       : num  1.01 2.95 1.89e+01 1.60e+02 9.96e-03 ...
 $ Terrestriality  : num  1 1 1 1 2 1 1 2 2 2 ...
 - attr(*, "na.action")= 'omit' Named int  1 6 7 8 9 10 11 12 13 14 ...
  ..- attr(*, "names")= chr  "Camelus dromedarius" "Bos frontalis" "Bos grunniens" "Bos javanicus" ...

Finally, let’s assure that all datasets are matching in species number and names. This time, instead of using the match.phylo.data and match.phylo.comm functions from picante, we will use the comparative.comm from pez:

Mammal.CompData <- comparative.comm(Mammalian.tree, 
                    MCDB.comm, 
                    PanTHERIA.traits)

Warning in comparative.comm(Mammalian.tree, MCDB.comm, PanTHERIA.traits):
Mismatch between phylogeny and other data, dropping 4876 tips

Warning in comparative.comm(Mammalian.tree, MCDB.comm, PanTHERIA.traits):
Mismatch between community matrix and other data, dropping 556 columns

Warning in comparative.comm(Mammalian.tree, MCDB.comm, PanTHERIA.traits):
Mismatch between traits and other data, dropping 289 columns

# Also, for the moment, there is no reason to have sites 
# that have no species recorded in them

Mammal.CompData <- Mammal.CompData[rowSums(comm(Mammal.CompData)) > 0, ]
Mammal.CompData <- Mammal.CompData[,colSums(comm(Mammal.CompData)) > 0]

str(Mammal.CompData)

List of 6
 $ phy    :List of 4
  ..$ edge       : int [1:194, 1:2] 99 100 101 102 103 104 105 106 107 108 ...
  ..$ edge.length: num [1:194] 48.4 2.5 4.3 2.9 6.4 0.2 4.4 7.7 58.1 12.2 ...
  ..$ Nnode      : int 97
  ..$ tip.label  : chr [1:98] "Napaeozapus insignis" "Zapus hudsonius" "Apodemus sylvaticus" "Otomys irroratus" ...
  ..- attr(*, "class")= chr "phylo"
  ..- attr(*, "order")= chr "cladewise"
 $ comm   : num [1:694, 1:98] 0 0 0 0 0 0 0 0 0 0 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:694] "1002" "1005" "1008" "1009" ...
  .. ..$ : chr [1:98] "Napaeozapus insignis" "Zapus hudsonius" "Apodemus sylvaticus" "Otomys irroratus" ...
 $ data   :'data.frame':    98 obs. of  7 variables:
  ..$ BodyMass        : num [1:98] 1.35 1.27 1.34 2.06 1.57 ...
  ..$ HeadLength      : num [1:98] 89.7 85.9 87.5 163.5 101.6 ...
  ..$ LatitudinalRange: num [1:98] 44.4 48.34 47.53 -25.97 1.98 ...
  ..$ mPrecipRange    : num [1:98] 70.3 58.2 57.2 50 151.1 ...
  ..$ mTempRange      : num [1:98] 40.2 25.1 90.8 163.9 250.2 ...
  ..$ HomeRange       : num [1:98] 0.01 0.00185 0.00711 0.0013 0.00163 ...
  ..$ Terrestriality  : num [1:98] 1 1 2 1 1 2 1 2 1 2 ...
 $ env    : NULL
 $ dropped:List of 5
  ..$ comm.sp.lost   : chr(0) 
  ..$ comm.sites.lost: chr(0) 
  ..$ phy.sp.lost    : chr(0) 
  ..$ traits.sp.lost : chr(0) 
  ..$ env.sites.lost : chr(0) 
 $ names  :function (x)  
 - attr(*, "class")= chr [1:2] "comparative.comm" "comparative.data"

Phylogenetic signal

One way to assess the existence and degree of phylogenetic non‐independence in species traits is through the calculation of the phylogenetic signal of a given trait. Phylogenetic signal can be defined as the tendency for related species to resemble each other more than they resemble species drawn at a null or alternative evolutionary model (Blomberg & Garland 2002).

Phylogenetic signal has been commonly used to investigate questions in a wide range of research areas: How strongly are certain traits correlated with each other (Felsenstein 1985)? Which processes drive community assembly (Webb et al. 2002)? Are niches conserved along phylogenies (Losos 2008) and is vulnerability to climate change clustered in the phylogeny (Thuiller et al. 2011)?

Let’s learn about a few ways of measuring phylogenetic signal:

Moran’s I and Moran’s Correlogram

One simple way of measuring phylogenetic signal is through Moran’s I (Gittleman and Kot, 1990; Moran, 1950), which estimates the existing autocorrelation between species and their traits as a function of their phylogenetic relatedness.

Positive Moran’s $I$ values indicate positive phylogenetic correlation between species; negative values indicate that closely related species are more different than distantly related species; and an $I$ of $0$ indicates absence of autocorrelation.

For this, we will use geiger’s Morain.I function. See an example below:

# We will first calculate a proximity matrix, where we can 
# change the weight of distantly related species by changing 
# the parameter alpha

alphaI <- 1
wI <- 1/cophenetic(Mammal.CompData$phy)^alphaI
diag(wI) <- 0

wI[1:5, 1:5]

                     Napaeozapus insignis Zapus hudsonius
Napaeozapus insignis          0.000000000     0.040983607
Zapus hudsonius               0.040983607     0.000000000
Apodemus sylvaticus           0.007112376     0.007112376
Otomys irroratus              0.007112376     0.007112376
Praomys tullbergi             0.007112376     0.007112376
                     Apodemus sylvaticus Otomys irroratus
Napaeozapus insignis         0.007112376      0.007112376
Zapus hudsonius              0.007112376      0.007112376
Apodemus sylvaticus          0.000000000      0.017006803
Otomys irroratus             0.017006803      0.000000000
Praomys tullbergi            0.017006803      0.017006803
                     Praomys tullbergi
Napaeozapus insignis       0.007112376
Zapus hudsonius            0.007112376
Apodemus sylvaticus        0.017006803
Otomys irroratus           0.017006803
Praomys tullbergi          0.000000000

# We can then calculate Moran's I

Mammal.BodyMass.I <- Moran.I(x = Mammal.CompData$data$BodyMass,
                             weight = wI,
                             scaled = TRUE)

Mammal.BodyMass.I

$observed
[1] 0.09687207

$expected
[1] -0.01030928

$sd
[1] 0.01367237

$p.value
[1] 4.440892e-15

What is the meaning of this?

We have just calculated the global Moran’s I, which allowed us to assess the existence of overall phylogenetic autocorrelation for a given trait. However, we also know that the degree of autocorrelation can change across different phylogenetic distances.

We can build a Moran’s correlogram to assess how phylogenetic autocorrelation changes across different phylogenetic distance classes.

For this, we will use the functions phyloCorrelogram and plot.phylocorrelogram from adephylo:

# Calculate the correlogram under a phylo4d object
Mammal.BodyMass.MoranCor <- phyloCorrelogram(phylo4d(x = Mammal.CompData$phy, 
                                 tip.data = Mammal.CompData$data$BodyMass), 
                           n.points = 10, 
                           ci.bs = 10)

# Represent the correlogram
plot.phylocorrelogram(Mammal.BodyMass.MoranCor)

How do you interprete this correlogram?

Blomberg’s K

A common way to measure phylogenetic signal is through Blomberg’s K statistic (Blomberg et al. 2003), which compares the observed signal in a trait to its expected signal under multiple estimations under a Brownian motion model of trait evolution on a phylogeny.

K works as a mean square ratio (MSE), where the numerator is the error assuming that the trait evolves independently from phylogenetic structure, and the denominator is corrected by the phylogenetic covariances.

K values equal to 1 correspond to a Brownian motion process. K values greater than 1 indicate strong phylogenetic signal and that traits are being conserved through the evolutionary time. Alternatively, K values closer to zero correspond to a random or convergent pattern of evolution.

(Mammal.BodyMass.K <- phylosig(Mammal.CompData$phy, 
               Mammal.CompData$data$BodyMass, 
               method = "K", test = TRUE))

[1] "x has no names; assuming x is in the same order as tree$tip.label"

$K
[1] 0.9142054

$P
[1] 0.001

How about other traits?

(Mammal.HeadLength.K <- phylosig(Mammal.CompData$phy, 
               Mammal.CompData$data$HeadLength, 
               method = "K", test = TRUE))

[1] "x has no names; assuming x is in the same order as tree$tip.label"

$K
[1] 0.901623

$P
[1] 0.001

(Mammal.mTempRange.K <- phylosig(Mammal.CompData$phy, 
               Mammal.CompData$data$mTempRange, 
               method = "K", test = TRUE))

[1] "x has no names; assuming x is in the same order as tree$tip.label"

$K
[1] 0.3389474

$P
[1] 0.001

Look at how can we interprete K in the text above the code, and tell if there is phylogenetic signal for each one of the above traits.

Evolutionary models

Brownian motion model

For continuous traits, we commonly assume a Brownian motion (BM) model of evolutionary change. This model was proposed by Felsenstein (1985) and represents traits divergence independently across the evolutionary time, due to genetic drift (and sometimes other processes, such as natural selection), in resemblance to a random walk (or to a drunkard’s walk).

The BM model assumes that the correlation among trait values is proportional to the extent of shared ancestry for pairs of species.

It is a stochastic model in which changes from one time to the next are random draws from a normal distribution with mean $0$ and variance $σ^2 × Δt$. This means that variance under BM motion increases linearly through time with instantaneous rate $σ^2$.

BM is a model that can be easily simulated:

# From Liam Revell's tutorial

t <- 0:100  # time
sig2 <- 0.01

## first, simulate a set of random deviates
x <- rnorm(n = length(t) - 1, 
           sd = sqrt(sig2))

## now compute their cumulative sum
x <- c(0, cumsum(x))
plot(t, x, 
     type = "l", 
     ylim = c(-2, 2))

Now, with multiple simulations and changing $\sigma^2$:

nsim <- 100
X <- matrix(rnorm(n = nsim * (length(t) - 1), 
                  sd = sqrt(sig2)), 
            nsim, 
            length(t) - 1)
X <- cbind(rep(0, nsim), 
           t(apply(X, 1, cumsum)))

par(mfrow = c(1,3))

plot(t, X[1, ], 
     xlab = "time", 
     ylab = "phenotype", 
     ylim = c(-2, 2), 
     type = "l")

apply(X[2:nsim, ], 1, 
      function(x, t) 
        lines(t, x), t = t)

NULL

##

X <- matrix(rnorm(n = nsim * (length(t) - 1), 
                  sd = sqrt(sig2/10)), 
            nsim, 
            length(t) - 1)

X <- cbind(rep(0, nsim), 
           t(apply(X, 1, cumsum)))

plot(t, X[1, ], 
     xlab = "time", ylab = "phenotype", 
     ylim = c(-2, 2), 
     type = "l")

apply(X[2:nsim, ], 1, 
      function(x, t) lines(t, x), t = t)

NULL

##

nsim = 100

X <- matrix(rnorm(mean = 0.02, 
                  n = nsim * (length(t) - 1), sd = sqrt(sig2/4)), 
    nsim, length(t) - 1)
X <- cbind(rep(0, nsim), 
           t(apply(X, 1, cumsum)))

plot(t, X[1, ], 
     xlab = "time", ylab = "phenotype", 
     ylim = c(-1, 3), type = "l")

apply(X[2:nsim, ], 1, 
      function(x, t) lines(t, x), t = t)

NULL

If we want to think on the base of a tree:

library(phytools)

t <- 100  # total time
n <- 30  # total taxa
b <- (log(n) - log(2))/t
tree <- pbtree(b = b, n = n, t = t, type = "discrete")

simulating with both taxa-stop (n) and time-stop (t) is
performed via rejection sampling & may be slow

  18 trees rejected before finding a tree

## simulate Brownian evolution on a tree with fastBM
x <- fastBM(tree, sig2 = sig2, internal = TRUE)
## visualize Brownian evolution on a tree

phenogram(tree, x, spread.labels = TRUE, spread.cost = c(1, 0))

Ornstein-Uhlenbeck model

The Ornstein-Uhlenbeck (OU) evolutionary model is based on stabilizing selection (Hansen, 1997; but see Butler and King 2004), where a given species trait is determined by its common ancestral state ($a$) and the strengh of the selection ($\alpha$) that attracts traits towards its optimal value ($\theta$).

#Ornstein-Uhlenbeck (OU)
nsim=50
time.steps = 50

ou<-matrix(ncol=nsim,nrow=time.steps)
ou[1,]<-0

alpha=-0.05

for(i in 2:time.steps){
    ou[i,] = ou[i-1,] + alpha*(ou[i-1,]) + rnorm(nsim, mean=0, sd=1) 
}

plot(1:time.steps, ou[,1],
     xlim=c(0,time.steps),
     ylim=c(-max(abs(ou)),
            max(abs(ou))),
     type="n",
     xlab="Time", ylab="Trait value",
     bty="n")

matlines(1:time.steps,
         ou,
         lty=1,
         col="#FF303080", 
         lwd=2)

We can again use rescale on a tree to see how it stretches to an OU model:

tree <- rcoal(10)

par1 <- par(mfrow = c(1, 2))

plot(tree)

tree_scaled <- rescale(tree, model = "OU")
plot(tree_scaled(2))

par(par1)

Early-burst model

The early-burst evolutionary model, also known as ACDC model (accelerating-decelerating; Blomberg et al. 2003) simulates scenarios where evolutionary rates increase and decrease exponentially through time, under the model $r[t] = r[0] * exp(a * t)$, where $r[0]$ is the initial rate, $a$ is the rate change parameter, and $t$ is time.

Finally, we can rescale a tree to see an EB model of evolution:

We can see all three simulations together with this:

Fitting evolutionary models to trait data

Knowing if there is phylogenetic signal in traits is nice, but it is just knowing, right? Perhaps, a more informative way of assessing the phylogenetic non-independence in species traits is to properly test how traits fit to different models of trait evolution.

Here, we will use the functions fitContinuous from geiger to compare the relative fit of our data to different evolutionary models:

EvoModels <- c("BM", 
               "OU", 
               "EB")

fitEvoModels <- lapply(EvoModels,
           FUN = function(x) fitContinuous(multi2di(Mammal.CompData$phy), 
                                           Mammal.CompData$data[, "BodyMass", drop = FALSE], 
                                           model = x, 
                                           ncores = 2))

Warning in fitContinuous(multi2di(Mammal.CompData$phy), Mammal.CompData$data[, : Parameter estimates appear at bounds:
    a

Let’s take sometime to explore the results:

# checking the Brownian motion fit
fitEvoModels[[1]]

GEIGER-fitted comparative model of continuous data
 fitted 'BM' model parameters:
    sigsq = 0.007886
    z0 = 2.479351

 model summary:
    log-likelihood = -71.785030
    AIC = 147.570059
    AICc = 147.696375
    free parameters = 2

Convergence diagnostics:
    optimization iterations = 100
    failed iterations = 0
    frequency of best fit = 1.00

 object summary:
    'lik' -- likelihood function
    'bnd' -- bounds for likelihood search
    'res' -- optimization iteration summary
    'opt' -- maximum likelihood parameter estimates

head(fitEvoModels[[1]]$res)

            sigsq       lnL convergence
Brent 0.007885838 -71.78503           0
Brent 0.007885838 -71.78503           0
Brent 0.007885838 -71.78503           0
Brent 0.007885838 -71.78503           0
Brent 0.007885838 -71.78503           0
Brent 0.007885838 -71.78503           0

# checking the Brownian motion fit
fitEvoModels[[2]]

GEIGER-fitted comparative model of continuous data
 fitted 'OU' model parameters:
    alpha = 0.004448
    sigsq = 0.009201
    z0 = 2.422298

 model summary:
    log-likelihood = -71.112667
    AIC = 148.225334
    AICc = 148.480653
    free parameters = 3

Convergence diagnostics:
    optimization iterations = 100
    failed iterations = 0
    frequency of best fit = 0.51

 object summary:
    'lik' -- likelihood function
    'bnd' -- bounds for likelihood search
    'res' -- optimization iteration summary
    'opt' -- maximum likelihood parameter estimates

head(fitEvoModels[[2]]$res)

                 alpha         sigsq            lnL convergence
subplex   4.448143e-03  9.200411e-03  -7.111267e+01           0
L-BFGS-B  1.561300e+00  1.741961e+00  -1.104545e+02           0
subplex   4.448143e-03  9.200411e-03  -7.111267e+01           0
L-BFGS-B  4.448381e-03  9.200600e-03  -7.111267e+01           0
subplex  7.637649e-217  7.885838e-03  -7.178503e+01           0
L-BFGS-B  1.934199e-17 1.261429e-204 -1.000000e+200           0

# checking the Brownian motion fit
fitEvoModels[[3]]

GEIGER-fitted comparative model of continuous data
 fitted 'EB' model parameters:
    a = -0.000001
    sigsq = 0.007887
    z0 = 2.479357

 model summary:
    log-likelihood = -71.785181
    AIC = 149.570362
    AICc = 149.825681
    free parameters = 3

Convergence diagnostics:
    optimization iterations = 100
    failed iterations = 0
    frequency of best fit = 0.07

 object summary:
    'lik' -- likelihood function
    'bnd' -- bounds for likelihood search
    'res' -- optimization iteration summary
    'opt' -- maximum likelihood parameter estimates

head(fitEvoModels[[3]]$res)

                     a         sigsq            lnL convergence
subplex  -1.000087e-06  7.886638e-03  -7.178518e+01           0
subplex  -1.001614e-06  7.886636e-03  -7.178518e+01           0
subplex  -1.001509e-06  7.887384e-03  -7.178518e+01           0
L-BFGS-B -7.826598e-02  1.409349e+22  -2.284690e+03           0
L-BFGS-B -5.717581e-02 1.815726e-180 -1.000000e+200           0
subplex  -1.000000e-06  7.886869e-03  -7.178518e+01           0

AICs <- sapply(fitEvoModels, 
               FUN = function(x) x$opt$aicc)

names(AICs) <- EvoModels

# Which one is the best model?
(AICtable <- aicw(AICs))

-- Akaike weights -- 
  Rank AIC  diff    wi  AICw
1    1 148 0.000 1.000 0.495
2    2 148 0.784 0.676 0.334
3    3 150 2.129 0.345 0.171

`Phylogenetic Comparative Methods` in R

Pedro Henrique Pereira Braga

2018-05-11

What is this?

Preparation for this tutorial

Why are phylogenies important?

A brief introduction to phylogenies

`ape` and the `phylo` object in R

Handling and preparing phylogenies

Dicotomous and polytomous trees

Ultrametric, non-ultrametric and uninformative trees

Edge length and rate transformations

Representing trees and other class objects

The `picante` package

The `pez` package

Other classes of objects

Exercises

Community Phylogenetic Patterns

Preparing data for community phylogenetic analyses

Phylogenetic \(\alpha\) diversity (PD)

Randomizations and null models

Phylogenetic community structure

Phylogenetic resemblance matrix

Net relatedness index (NRI)

Nearest taxon index (NTI)

Phylogenetic beta diversity (PBD)

Decomposing phylogenetic beta diversity

Trait Evolution

Dataset preparation

Phylogenetic signal

Moran’s I and Moran’s Correlogram

Blomberg’s K

Evolutionary models

Brownian motion model

Ornstein-Uhlenbeck model

Early-burst model

Fitting evolutionary models to trait data

Diversification rates

Phylogenetic Comparative Methods in R

Pedro Henrique Pereira Braga

2018-05-11

What is this?

Preparation for this tutorial

Why are phylogenies important?

A brief introduction to phylogenies

ape and the phylo object in R

Handling and preparing phylogenies

Dicotomous and polytomous trees

Ultrametric, non-ultrametric and uninformative trees

Edge length and rate transformations

Representing trees and other class objects

The picante package

The pez package

Other classes of objects

Exercises

Community Phylogenetic Patterns

Preparing data for community phylogenetic analyses

Phylogenetic \(\alpha\) diversity (PD)

Randomizations and null models

Phylogenetic community structure

Phylogenetic resemblance matrix

Net relatedness index (NRI)

Nearest taxon index (NTI)

Phylogenetic beta diversity (PBD)

Decomposing phylogenetic beta diversity

Trait Evolution

Dataset preparation

Phylogenetic signal

Moran’s I and Moran’s Correlogram

Blomberg’s K

Evolutionary models

Brownian motion model

Ornstein-Uhlenbeck model

Early-burst model

Fitting evolutionary models to trait data

Diversification rates

`Phylogenetic Comparative Methods` in R

`ape` and the `phylo` object in R

The `picante` package

The `pez` package