Community Phylogenetics in R
Overview
Community Phylogenetics: linking community ecology and evolutionary biology to your programming skills
par | by Pedro Henrique P. Braga (Concordia University) & Katherine Hébert (Université de Sherbrooke)
How do species assemble in communities? What are the involved mechanisms? These questions have fascinated ecologists for decades. As the community assembly of organisms depends on historical and ecological phenomena, it is necessary to link evolutionary and ecological hypotheses testing to address these questions. By combining phylogenetic information and species geographical co-existence, ‘community phylogenetics’ emerged as field seeking to reveal patterns and processes driving the organization and structure of ecological communities. In this workshop, we will dive into the debates and controversies surrounding the community structure and learn some of the various approaches used by community phylogenetics to detect evolutionary patterns on community assembly.
Comment les espèces s’assemblent-elles dans les communautés ? Quels sont les mécanismes impliqués ? Ces questions ont fasciné des écologistes pendant des décennies! L’assemblage d’organismes dans les communautés écologiques dépendent de phénomènes historiques et écologiques. Il est donc nécessaire de relier et de tester des hypothèses évolutives et écologiques pour répondre à ces questions. En combinant les informations phylogénétiques avec la coexistence géographique des espèces, le champ de « phylogénie des communautés » est apparu afin de révéler les patrons et les processus menant à l’organisation et à la structure des communautés écologiques. Dans cet atelier, nous allons plonger dans les débats et les controverses entourant la structure des communautés et apprendre certaines des méthodes utilisées par la phylogénétique de la communauté pour détecter les patrons évolutives lors de l’assemblage des communautés des espèces.
Preamble
We developed this workshop and self-guided tutorial to help researchers acquire a short background on phylogenetic patterns of community structure.
It was originally instructed during the 3rd QCBS R Symposium, which happened between October 10th and 12th, 2019, at the Gault Nature Reserve (McGill University), in Mont-Saint-Hilaire, QC (Canada).
Part of the material was inspired in the content of the book Phylogenies in Ecology, by Marc W. Cadotte and Jonathan Davies (2016).
If you would like to provide feedback on this material, do not hesitate to get in touch!
We will highly appreciate your comments!
Kind regards,
Pedro Henrique P. Braga and Katherine Hébert
pedrohbraga.github.io | github.com/katherinehebert
ph.pereirabraga [at] gmail [dot] com | katherine.ahebert [at] gmail [dot] com
Learning Outcomes
By the end of this workshop, participants should be able to:
- Identify and prepare phylogenetic trees and community data structure to perform comparative analyses;
- Understand the most common community phylogenetic structure metrics;
- Perform analyses to extract patterns from the phylogenetic structure on ecological communities;
- Identify, understand and, when adequate, resolve common issues when performing performing community structure;
- Critically think about the assumptions related to linking ecological and evolutionary processes to community structure patterns.
Outline
The main steps, described bellow, are the following:
- Preparation for the tutorial
- Phylogenetic patterns as proxies of community assembly mechanisms
- A brief introduction to phylogenies and community data
- Phylogenies: terminology and basic manipulation
- Community data manipulation
- Within-assemblage phylogenetic structure
- Between-assemblage phylogenetic structure
- Other applications
Preparation for this tutorial
We kindly ask you to run the following code to download the datasets, and install and load the libraries we will use through this course:
# Check for needed packages, and install the missing ones
<- c("ape", "picante",
required.libraries "pez", "phytools",
"vegan", "adephylo",
"phylobase", "geiger",
"mvMORPH", "OUwie",
"hisse", "BAMMtools",
"phylosignal", "Biostrings",
"devtools","ggplot2",
"kableExtra", "betapart", "gridExtra",
"reshape2")
<- required.libraries[!(required.libraries %in% installed.packages()[,"Package"])]
needed.libraries if(length(needed.libraries)) install.packages(needed.libraries)
# Load all required libraries at once
lapply(required.libraries, require, character.only = TRUE)
### Install ggtree from BiocManager
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("ggtree")
BiocManager
set.seed(1)
# Download files from Marc Cadotte's and Jonathan Davies' book
# "Phylogenies in Ecology: A guide to concepts and methods":
dir.create("data/Jasper")
download.file("https://raw.githubusercontent.com/pedrohbraga/PhyloCompMethods-in-R-workshop/master/data/Jasper/resources/data/Jasper/jasper_data.csv",
"data/Jasper/jasper_data.csv")
download.file("https://raw.githubusercontent.com/pedrohbraga/PhyloCompMethods-in-R-workshop/master/data/Jasper/resources/data/Jasper/jasper_tree.phy",
"data/Jasper/jasper_tree.phy")
# Download the tree files from this paper
# download.file("https://onlinelibrary.wiley.com/action/
# downloadSupplement?doi=10.1111%2Fj.1461-0248.2009.01307.x&attachmentId=179085359",
# destfile = "data/ele_1307_sm_sa1.tre")
# If the download of the above file does not work,
# download the first supplementary material from the following paper
# and place it within your 'data' directory:
# https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1461-0248.2009.01307.x
Alternatively, you can download the .zip
file from the Cadotte Urban Biodiversity & Ecosystem Services (CUBES) Laboratory and extract the files data/Jasper/jasper_tree.phy
and data/Jasper/jasper_data.csv
to your QCBS R Symposium workshop data
directory.
Then, set your working directory to the same directory of the folder CommunityPhylogenetics
using the function setwd(choose.dir())
or your prefered method.
A brief introduction to phylogenies
Phylogenetic trees (i.e., evolutionary tree or cladogram) are branching diagrams illustrating the evolutionary relationships among taxa.
Phylogenetic trees can be constructed based on species’ genes and/or on traits. Phylogenetic reconstruction through analyses of molecular sequences usually consists of three distinct procedures: (i) sequence alignment; (ii) character coding; and (iii) tree building. There is a range of available options for each of these steps, with no simple objective method for choosing between them.
In this course, we will not cover the steps necessary to reconstruct phylogenies. We will assume that a phylogenetic relationship is already available for the taxa of interest.
ape
and the phylo
object in R
The core of phylogenetics analyses in R is the library ape
- an acronym that stands for Analysis of Phylogenetics and Evolution. You can run the following commands to install, load and call help for ape
:
# install.packages("ape")
library(ape)
# help(package = ape)
From ape
, we depict two important functions to read and load trees into R
: read.nexus()
and read.tree()
:
read.nexus()
reads NEXUS formatted trees, which is organized into major units known as blocks and usually have the following basic structure:
#NEXUS
[!This is the data file used for my dissertation]
BEGIN TAXA;
TaxLabels Scarabaeus Drosophila Aranaeus;
END;
BEGIN TREES;
Translate beetle Scarabaeus, fly Drosophila, spider Aranaeus;
Tree tree1 = ((1,2),3);
Tree tree2 = ((beetle,fly),spider);
Tree tree3 = ((Scarabaeus,Drosophila),Aranaeus);
END;
read.tree
reads Newick formatted trees, which commonly have the following format:
(Bovine:0.69395,(Hylobates:0.36079,(Pongo:0.33636,(G._Gorilla:0.17147, (P._paniscus:0.19268,H._sapiens:0.11927):0.08386):0.06124):0.15057):0.54939, Rodent:1.21460)
When phylogenies are created, loaded and handled in ape
, their objects are represented as a list of class phylo
.
Let’s try reading the following text string of a tree (is it in Newick or in NEXUS format?) and see what R
tells us when we call it:
<- read.tree(text="((((Homo:0.21, Pongo:0.21):0.28, Macaca:0.49):0.13,
tree.primates Ateles:0.62):0.38, Galago:1.00);")
tree.primates
Phylogenetic tree with 5 tips and 4 internal nodes.
Tip labels:
Homo, Pongo, Macaca,
Ateles, Galago
Rooted; includes branch lengths.
A class phylo
object distributes the information of phylogenetic trees into six main components:
str(tree.primates)
List of 4
$ edge : int [1:8, 1:2] 6 7 8 9 9 8 7 6 7 8 ...
$ edge.length: num [1:8] 0.38 0.13 0.28 0.21 0.21 0.49 0.62 1
$ Nnode : int 4
$ tip.label : chr [1:5] "Homo" "Pongo" "Macaca" "\nAteles" ...
- attr(*, "class")= chr "phylo"
- attr(*, "order")= chr "cladewise"
Since phylo
is a list, we can access its elements by using the $
symbol. We can access the matrix edge
by running phylo_object$edge
, which has the beginning and ending node number for all the nodes and tips of a given tree. In summary, this allows us to keep track of the internal and external nodes of the tree:
$edge tree.primates
[,1] [,2]
[1,] 6 7
[2,] 7 8
[3,] 8 9
[4,] 9 1
[5,] 9 2
[6,] 8 3
[7,] 7 4
[8,] 6 5
The vector tip.label
element contains all the labels for all tips in the tree (i.e. the names of your OTUs):
$tip.label tree.primates
[1] "Homo" "Pongo" "Macaca" "\nAteles" "Galago"
There is a little error in the tip labels, right? We can easily rewrite the labels with this:
$tip.label <- c("Homo", "Pongo", "Macaca", "Ateles", "Galago") tree.primates
Another important element is the vector edge.length
, which indicates the length of each edge with relation to the root or base of the tree:
$edge.length tree.primates
[1] 0.38 0.13 0.28 0.21 0.21 0.49 0.62 1.00
Finally, Nnode
contains the number of internal nodes in the tree:
$Nnode tree.primates
[1] 4
One can see all the important information a phylo
object carries from a tree by plotting it with all its elements:
plot(tree.primates,
edge.width = 2,
label.offset = 0.05,
type = "cladogram")
nodelabels()
tiplabels()
add.scale.bar()
Handling and preparing phylogenies
Now that we know the main characteristics of a phylogenetic tree and of a phylo
object, let’s dive into more details on the structure of phylogenetic trees.
Dicotomous and polytomous trees
While some phylogenetic analyses are able to handle polytomies (i.e. more than two OTUs originating from the same node in a tree; or an unresolved clade), one may be interesting in solving the polytomies to produce a binary tree.
<- read.tree(text = "((A, B, C), D);")
t1 # See how A, B and C are within the same round brackets,
# *i.e.* departing from the same node.
plot(t1,
type = "cladogram")
One may check if a tree is binary by using is.binary
function and, in case it is FALSE
, randomly resolve polytomies using the multi2d
function:
is.binary(t1)
[1] FALSE
<- multi2di(t1)
t2 plot(t2,
type = "cladogram")
is.binary(t2)
[1] TRUE
Ultrametric, non-ultrametric and uninformative trees
Ultrametric trees (or approximates of them) can be used to infer both braching and temporal patterns of evolutionary history for a given clade. For a tree to be ultrametric, its matrix has to follow the following strong assumptions:
- Each tip has to be uniquely labelled;
- Each internal node of the tree diverges towards at least two “children”, i.e. is dicotomous;
- Along any path from the root to a tip (or leaf), the numbers labelling internal nodes strictly decrease, i.e. time must strictly increase along the path from the root;
- The distance from the root to all tips of the tree is constant for all clades.
In many ultrametric trees, one can assume that all extant species are sampled at the present time, and branch lengths represent time calibrated in millions of years.
This information is obtained when we infer plausible history from data that reflects time since divergence (i.e. reconstruct evolutionary history based on the molecular clock theory). When a tree violates one of the above assumptions (especially, number 4), a tree is then considered to be non-ultrametric. For instance, a non-ultrametric tree will have branches with different distances towards the tips. While this information reflects different rates of evolution across the phylogeny, many analyses will require an ultrametric tree (i.e. where models assume constant variance and equal means at the tips).
In these cases, one may be interested in “smoothing out” these differences across the evolutionary history to transform a non-ultrametric tree into a ultrametric tree.
One method is to use the mean path length (MPL) that assumes random mutation and a fixed molecular clock to calculate the age of a node as the mean of all the distances from this node to all tips diverging from it. To run this, we use the ape
function chronoMPL
. MPL sometimes returns negative branch lengths, meaning that it should be used with caution.
Another function we can use is chronos
, also from the ape
package. chronos
uses the penalized maximum likelihood method to estimate divergences times.
Let’s try these methods by first creating a random tree, then tansforming its edge lengths using both chronoMPL
and chronos
functions, and plotting them side-by-side:
<- rtree(7)
tree.NonUlt <- chronoMPL(tree.NonUlt)
tree.Ult.MPL <- chronos(tree.NonUlt) tree.Ult.S
Setting initial dates...
Fitting in progress... get a first set of estimates
(Penalised) log-lik = -7.852074
Optimising rates... dates... -7.852074
log-Lik = -7.74304
PHIIC = 49.53
par(mfrow=c(1,3))
plot(tree.NonUlt, edge.width = 2,
cex = 2,
main = "A) Non-ultrametric",
cex.main = 2)
plot(tree.Ult.MPL, edge.width = 2,
cex = 2,
main = "B) Ultrametric chronoMPL",
cex.main = 2)
plot(tree.Ult.S, edge.width = 2,
cex = 2,
main = "C) Ultrametric chronos",
cex.main = 2)
You may verify if any tree is ultrametric using is.ultrametric
:
is.ultrametric(tree.NonUlt)
[1] FALSE
is.ultrametric(tree.Ult.MPL)
[1] TRUE
is.ultrametric(tree.Ult.S)
[1] TRUE
Sometimes, trees may have no branch lengths available, but only the divergence relationships. We often refer to these trees as being non-informative trees (despite the fact that they can still give us some information about branching events):
<- rtree(7)
tree.NonInf $edge.length <- NULL
tree.NonInfplot(tree.NonInf, edge.width = 2,
cex = 2,
main = "D) Uninformative tree",
cex.main = 2)
The picante
package
One of the core packages for hypothesis testing in ecophylogenetics is picante
(Kembel et al. 2010). picante
mainly works with three types of objects: the phylogeny data as a phylo
class object; the community presence-absence (binary) or abundance matrix; and a species-trait matrix.
# install.packages("picante")
library(picante)
# help(package = picante)
Preparing data for community phylogenetic analyses
For this analysis, we will use the data on the Jasper Ridge plant communities available in the book “Phylogenies in Ecology”, written by Marc Cadotte and Jonathan Davies (2016).
<- read.tree("data/Jasper/jasper_tree.phy")
JasperPlants.tree <- read.csv("data/Jasper/jasper_data.csv",
JasperPlants.comm row.names = 1)
Before proceeding to our analyses, we have to make sure that both the community and phylogenetic data match. For this, the names of the tips should match the species names in the dataset. This also means that species present in one dataset should not be absent in the other.
(You thought you were done preparing your dataset, eh?)
For this, we can use the function drop.tip
from the ape
package:
<- drop.tip(phy = JasperPlants.tree,
JasperPlants.cleanTree tip = setdiff(JasperPlants.tree$tip.label,
colnames(JasperPlants.comm)))
You can also match your datasets using the function match.phylo.comm
from picante
, or the function comparative.comm
from pez
.
# Using picante
<- match.phylo.comm(phy = JasperPlants.tree,
JasperPlants.picCleanTree comm = JasperPlants.comm)$phy
<- match.phylo.comm(phy = JasperPlants.tree,
JasperPlants.picCleanComm comm = JasperPlants.comm)$comm
# Using pez
plot(JasperPlants.picCleanTree,
cex = 0.4)
1:4, 1:4] JasperPlants.picCleanComm[
Geranium_dissectum Erodium_botrys Anagallis_arvensis
J11 20 10 5
J110 0 15 0
J12 20 20 0
J13 0 40 0
Epilobium_brachycarpum
J11 0
J110 0
J12 1
J13 0
Phylogenetic patterns as proxies of community assembly mechanisms
Ecologists share the goal of understanding and describing the processes that create the uneven distribution patterns of species, which create the variation in abundance, identity and diversity we know today.
Among many factors that determine the presence of taxa in communities are the abiotic and biotic conditions of such communities (e.g., Diamond 1975; Westoby and Wright 2006; but see Hubbell 2001).
When studying biodiversity, we are often interested in two components: alpha diversity (α) that measures the diversity intrinsic to each community or site and beta diversity (β) that reflects the differences among communities or sites.
Taxon diversity (TD) is the most quantified measure of diversity, but it gives an incomplete information because the evolutionary history underlying spatial patterns is ignored. Since many decades we recognize that many ecological patterns are processes are not independent of the evolution of the lineages involved in generating these patterns.
Since taxa that recently diverged tend to be ecologically similar (Darwin 1859; Lord et al. 1995; Wiens and Graham 2005), a direct link may exist between the evolutionary relatedness of organisms in a community, the characters they possess, and the ecological processes that determine their distribution and abundance.
In fact, many communities exhibit nonrandom patterns of evolutionary relatedness among co-occurring species (reviewed in Webb et al. 2002): a phenomenon we refer to as phylogenetic community structure.
Within-assemblage phylogenetic structure
Phylogenetic \(\alpha\) diversity (PD)
One simple way to measure evolutionary diversity within a community is through Daniel Faith’s PD metric (Faith, 1992; http://goo.gl/wM08Oy).
Faith’s PD (1992) sums the branch lengths of all co-occurring species in a given site, from the tips to the root of the phylogenetic tree.
Higher PD values are given to communities that has more evolutionary divergent taxa and older history, while lower PD values represent assemblages that have taxa with more recent evolutionary history.
Faith’s PD can be implemented using the pd
function in the picante
package.
In addition to Faith’s PD, the pd
function also returns species richness (SR).
SR is the same as observed species richness (\(S_{obs}\)) or taxonomic \(\alpha\) diversity.
<- pd(samp = JasperPlants.picCleanComm,
Jasper.PD tree = JasperPlants.picCleanTree,
include.root = FALSE)
head(Jasper.PD) # See the first six rows of the resulting matrix
PD SR
J11 831.4887 13
J110 501.8703 9
J12 769.9249 14
J13 712.0532 13
J14 640.5896 11
J15 640.2884 13
And how do PD estimates compare with SR?
cor.test(Jasper.PD$PD, Jasper.PD$SR)
Pearson's product-moment correlation
data: Jasper.PD$PD and Jasper.PD$SR
t = 9.0965, df = 28, p-value = 7.457e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7321097 0.9338472
sample estimates:
cor
0.8643903
plot(Jasper.PD$PD,
$SR,
Jasper.PDxlab = "Phylogenetic Diversity", ylab = "Species Richness",
pch = 16)
With this, we can see that interpreting PD across sites that have different richness can be complicated. To overcome this problem, we can compare PD values to those expected by cahnce. To do this, we can use null models that will randomize some of the aspects of the assemblages or of the evolutionary relationship between species.
Randomizations and null models
We are often interested in assessing whether or not observed patterns differ from null expectation. picante
has many functions that allow us to test hypotheses under different types of null models:
Null Model | Description |
---|---|
taxa.labels | Shuffles taxa labels across tips of phylogeny (across all taxa included in phylogeny) |
richness | Randomizes community data matrix abundances within samples (maintains sample species richness) |
frequency | Randomizes community data matrix abundances within species (maintains species occurence frequency) |
sample.pool | Randomizes community data matrix by drawing species from pool of species occurring in at least one community (sample pool) with equal probability |
phylogeny.pool | Randomize community data matrix by drawing species from pool of species occurring in at least one community (sample pool) with equal probability |
independentswap | Randomizes community data matrix with the independent swap algorithm (Gotelli 2000) maintaining species occurrence frequency and sample species richness |
trialswap | Randomizes community data matrix with the trial-swap algorithm (Miklos & Podani 2004) maintaining species occurrence frequency and sample species richness |
The steps are straight-forward and are applicable to multiple frameworks!
We can take a very simple null model that will randomly shuffle the tip labels on the phylogeny (corresponding to the taxa.labels in picante
). After shuffling the tips of our phylogeny, we will recalculate our metrics (in this case PD). We will save this information and repeat the previous steps an \(n\) number of times, until we have our distribution of null values.
We can then use these values to obtain a standardized metric by calculating an effect size against the null values using a \(z\)-value. This \(z\) value will be a dimensionless value that will indicate the number of standard deviations an observation is from the mean of the distribution. We can calculate it by:
\(z = \frac{x - \mu}{\sigma}\)
where \(x\) is our observation (for example, the PD for a given community), \(\mu\) is the mean and \(sigma\) is the standard deviation of the null or random distribution.
If the value that we obtain from \(z\) is positive, it means that the number of standard deviations are above the mean; and if it is negative, the number of standard deviations are below the mean.
Thus, the standardized value for PD is:
\[z_{PD} = \frac{PD_{obs} - \mu_{PD_{null}}}{\sigma_{PD_{null}}}\]
We can calculate the standardized effect size (ses) of PD for each community using the ses.pd()
function in picante
:
# This is a computationally intensive process and take some time to run
<- ses.pd(samp = JasperPlants.picCleanComm,
Jasper.ses.pd tree = JasperPlants.picCleanTree,
null.model = "taxa.labels",
runs = 1000)
head(Jasper.ses.pd)
ntaxa pd.obs pd.rand.mean pd.rand.sd pd.obs.rank pd.obs.z pd.obs.p
J11 13 831.4887 793.6866 62.90851 733 0.6009067 0.73226773
J110 9 501.8703 622.8810 61.81899 28 -1.9575002 0.02797203
J12 14 769.9249 829.1513 62.50560 174 -0.9475372 0.17382617
J13 13 712.0532 791.1854 65.65384 122 -1.2052945 0.12187812
J14 11 640.5896 709.7741 62.17668 142 -1.1127091 0.14185814
J15 13 640.2884 793.9467 64.83166 16 -2.3701116 0.01598402
runs
J11 1000
J110 1000
J12 1000
J13 1000
J14 1000
J15 1000
Reflection time!
At the community level, phylogenetic clustering which is often regarded as evidence of environmental filtering, while phylogenetic overdispersion is often associated to limiting similarity. Nevertheless, these hypothetical links assume many ecological and evolutionary processes,may be weakened when there is a disparity between observed patterns and the assumed on-going processes. As a consequence, no one should interprete non-random phylogenetic patterns without care and detailed investigation.
Nearest taxon index (NTI)
Another way to assess the phylogenetic structure of communities is through the calculation of the nearest taxon index (NTI). NTI differs from NRI by being based on the mean nearest neighbour phylogenetic distance (MNTD), instead of the overall mean pairwise distances (MPD).
MNTD is obtained by calculating the mean phylogenetic distance of each taxon of a given community to its nearest neighbour in a tree. By doing this, one can observe that NTI is more sensitive to the clustering or overdispersion patterns happening closer to the tips of a tree, while NRI represents an overall pattern of structuring across the whole phylogenetic tree.
As in NRI, higher values of NRI indicate phylogenetic clustering and lower values of NRI indicate phylogenetic overdispersion.
# Estimate Standardized Effect Size of the MNTD on the cophenetic matrix
<- ses.mntd(JasperPlants.picCleanComm,
JasperPlants.ses.mntd
JasperPlants.cophenDist, null.model = "taxa.labels",
abundance.weighted = FALSE,
runs = 100)
head(JasperPlants.ses.mntd)
ntaxa mntd.obs mntd.rand.mean mntd.rand.sd mntd.obs.rank mntd.obs.z
J11 13 76.65059 79.27968 12.27747 47 -0.2141392
J110 9 71.32136 88.93312 17.20764 17 -1.0234849
J12 14 67.04067 76.60485 11.94680 24 -0.8005648
J13 13 65.27537 79.56577 12.58896 15 -1.1351535
J14 11 74.54219 83.60746 14.28146 24 -0.6347575
J15 13 47.69848 79.28997 11.10752 1 -2.8441529
mntd.obs.p runs
J11 0.46534653 100
J110 0.16831683 100
J12 0.23762376 100
J13 0.14851485 100
J14 0.23762376 100
J15 0.00990099 100
# Calculate NTI
<- as.matrix(-1 *
JasperPlants.NTI 2] - JasperPlants.ses.mntd[,3]) /
((JasperPlants.ses.mntd[,4]))
JasperPlants.ses.mntd[,rownames(JasperPlants.NTI) <- row.names(JasperPlants.ses.mntd)
colnames(JasperPlants.NTI) <- "NTI"
head(JasperPlants.NTI)
NTI
J11 0.2141392
J110 1.0234849
J12 0.8005648
J13 1.1351535
J14 0.6347575
J15 2.8441529
We can then compare between the wet and dry Jasper Ridge plots!
Let us do it with MPD!
$habitat <- c(rep("wet", 15),
JasperPlants.ses.mpdrep("dry", 15))
# Test whether differences between wet and dry habitats are significant
t.test(JasperPlants.ses.mpd$mpd.obs.z ~ JasperPlants.ses.mpd$habitat)
Welch Two Sample t-test
data: JasperPlants.ses.mpd$mpd.obs.z by JasperPlants.ses.mpd$habitat
t = 1.4657, df = 27.149, p-value = 0.1542
alternative hypothesis: true difference in means between group dry and group wet is not equal to 0
95 percent confidence interval:
-0.2152359 1.2926470
sample estimates:
mean in group dry mean in group wet
-0.1810248 -0.7197303
# Compare ses.mpd.z between habitats
plot.new()
boxplot(mpd.obs.z ~ habitat,
data = JasperPlants.ses.mpd,
xlab = "Habitat",
ylab = "SES(MPD)")
abline(h = 0, col = "gray")
And now, with MNTD:
$habitat <- c(rep("wet", 15),
JasperPlants.ses.mntdrep("dry", 15))
# Compare ses.mpd.z between habitats
plot.new()
# Test whether differences between wet and dry habitats are significant
t.test(JasperPlants.ses.mntd$mntd.obs.z ~ JasperPlants.ses.mntd$habitat)
Welch Two Sample t-test
data: JasperPlants.ses.mntd$mntd.obs.z by JasperPlants.ses.mntd$habitat
t = -0.35446, df = 20.711, p-value = 0.7266
alternative hypothesis: true difference in means between group dry and group wet is not equal to 0
95 percent confidence interval:
-0.8973162 0.6361631
sample estimates:
mean in group dry mean in group wet
-1.0850761 -0.9544996
boxplot(mntd.obs.z ~ habitat,
data = JasperPlants.ses.mntd,
xlab = "Habitat",
ylab = "SES(MNTD)")
abline(h = 0, col = "gray")
Between-assemblage phylogenetic structure
Phylogenetic beta diversity (PBD)
One may not be interested only in the patterns of \(\alpha\) diversity, but in understanding how communities are evolutionary distinct from each other, i.e. knowing the pairwise phylogenetic \(\beta\) diversity.
You might already know \(\beta\)-diversity as a measure of the dissimilarity between communities in terms of their composition. As you might guess, phylogenetic beta diversity is a measure of the dissimilarity between communities in terms of the evolutionary lineages they contain.
Evaluating phylogenetic \(\beta\)-diversity can provide insight about where communities differ in terms of their evolutionary histories, and even about which clades are driving these differences.
PhyloSor
Many metrics are available to assess the phylogenetic \(\beta\) diversity of communities. Here, we will start with the PhyloSor index, which is the phylogenetic analog of the Sørensen index (Bryant et al., 2008; Swenson, 2011; Feng et.al., 2012). It is based on the total length of branches shared and unshared between paired communities.
PhyloSor can be calculated using the function phylosor
from picante
:
Let’s first import the data:
# Import site by species matrix
<- read.csv("data/Jasper/jasper_data.csv", row.names = 1)
comm
# Import phylogeny
<- read.tree("data/Jasper/jasper_tree.phy") phy
Then, we can calculate and visualize PhyloSor:
# Calculate observed PhyloSor
<- phylosor(samp = comm,
jasper.sor tree = phy)
# Let's take a look!
# (note: darker colour = more dissimilar)
heatmap(as.matrix(jasper.sor),
Rowv = NA,
Colv = NA) # without default clustering
Standardized effect size (SES_PhyloSor)
We can go one step further and test whether the observed PBD values differ from a null expectation, using null models.
To do this, we will follow this flow chart, which outlines the process of calculating standardized effect sizes of phylogenetic \(\beta\)-diversity measures (in this case, PhyloSor).
First, we will use the function phylosor.rnd
to obtain PhyloSor values of phylogenetic \(\beta\)-diversity obtained by randomizing taxa labels on the phylogeny:
# generate 100 null expectations of Phylosor
<- phylosor.rnd(samp = comm,
jasper.sor.rndTaxa tree = phy,
cstSor = TRUE,
null.model = "taxa.labels", # Pay attention to the null models!
runs = 100)
We then need to test whether our observed phylogenetic \(\beta\)-diversity differs from this random expectation. There is no standardized effect size function available within picante
, so we will create our own null model to test our hypothesis:
# create function to calculate the standardized effect size of a phylogenetic beta-diversity metric. In this case, we are using it for PhyloSor.
# (this is modelled after ses.pd by Pedro!)
<- function(obs, rand){
ses.PBD
# first, we make sure our observed PhyloSor values are numeric
<- as.numeric(obs)
pbd.obs
# then, we take the mean of the 100 null expectations we generated
<- t(as.data.frame(lapply(rand, as.vector)))
rand <- apply(rand, MARGIN = 2,
pbd.mean FUN = mean,
na.rm = TRUE)
# as well as their standard deviation
<- apply(rand,
pbd.sd MARGIN = 2,
FUN = sd,
na.rm = TRUE)
# now, we can calculate the standardized effect size (SES)!
<- (pbd.obs - pbd.mean)/pbd.sd
pbd.ses
# rank observed PhyloSor (we use this to calculate p-values for SES)
<- apply(X = rbind(pbd.obs, rand),
pbd.obs.rank MARGIN = 2,
FUN = rank)[1, ]
<- ifelse(is.na(pbd.mean), NA, pbd.obs.rank)
pbd.obs.rank
# return results in a neat dataframe
data.frame(pbd.obs,
pbd.mean,
pbd.sd,
pbd.obs.rank,
pbd.ses, pbd.obs.p = pbd.obs.rank/(dim(rand)[1] + 1))
}
Using this function, we can determine whether communities are:
- more dissimilar than expected by chance (i.e. obs > mean(null) = positive SES_PhyloSor)
- more similar than expected by chance (i.e. obs < mean(null) = negative SES_PhyloSor)
- randomly dissimilar, i.e. as dissimilar you would expect by random chance (obs = mean(null) = 0 PhyloSor).
In other words, this measure can tell you whether community structure is random or not - so, you can find out whether some evolutionary (and/or ecological) processes are shaping the \(\beta\)-diversity you’ve observed.
# calculate the standardized effect size of Phylosor (SES_PhyloSor) using our function
<- ses.PBD(obs = jasper.sor,
jasper.ses.sor rand = jasper.sor.rndTaxa)
pbd.obs | pbd.mean | pbd.sd | pbd.obs.rank | pbd.ses | pbd.obs.p | |
---|---|---|---|---|---|---|
45 | 0.527 | 0.573 | 0.052 | 22 | -0.887 | 0.218 |
46 | 0.647 | 0.594 | 0.053 | 87 | 1.001 | 0.861 |
47 | 0.589 | 0.575 | 0.055 | 63 | 0.260 | 0.624 |
48 | 0.628 | 0.618 | 0.055 | 56 | 0.187 | 0.554 |
49 | 0.652 | 0.732 | 0.034 | 1 | -2.369 | 0.010 |
50 | 0.687 | 0.701 | 0.050 | 38 | -0.273 | 0.376 |
51 | 0.770 | 0.756 | 0.043 | 65 | 0.316 | 0.644 |
52 | 0.671 | 0.716 | 0.041 | 13 | -1.089 | 0.129 |
53 | 0.665 | 0.684 | 0.047 | 35 | -0.406 | 0.347 |
54 | 0.575 | 0.638 | 0.052 | 12 | -1.214 | 0.119 |
55 | 0.649 | 0.664 | 0.046 | 34 | -0.331 | 0.337 |
Let’s also visualize SES_PhyloSor as a heatmap!
# note: converting a vector to a distance matrix can be tricky, so let's
# use our PhyloSor matrix as a template, since it already has row/column
# names in the right places:
<- jasper.sor
jasper.ses.sor.m # assign SES_PhyloSor value to each cell
for(i in 1:length(jasper.sor)) {jasper.ses.sor.m[i] <- jasper.ses.sor$pbd.ses[i]}
# plot heatmap using ggplot this time!
library(ggplot2)
library(reshape2)
%>% as.matrix() %>% melt() %>% # manipulate data into a format that works for ggplot
jasper.ses.sor.m ggplot() +
geom_tile(aes(x = Var1, y = Var2, fill = value)) + # create the heatmap
scale_fill_gradient2(low = "blue", mid = "white", high = "red") + # create a diverging color palette
xlab("sites") + ylab("sites") + labs(fill = "SES_PhyloSor") + # edit axis and legend titles
theme(axis.text.x = element_text(angle = 90)) # rotates x axis labels
Looking at this heatmap, we can see that some communities are a lot more dissimilar than expected by chance (high positive SES_PhyloSor), such as J25-J32. What processes could explain this higher-than-expected \(\beta\)-diversity?
Some communities are also more similar than expected by chance (very negative SES_PhyloSor), like J26-J27. What processes could explain this lower-than-expected \(\beta\)-diversity?
UniFrac
UniFrac is another presence-absence phylogenetic \(\beta\)-diversity metric, which measures the unique fraction of the phylogeny that is unshared between a pair of communities. Like PhyloSor, this metric is mostly sensitive to dissimilarity at the tips of the phylogeny, and does not easily pick up on basal clades.
To compute the UniFrac metric, we can use the function unifrac
in the package picante
.
<- unifrac(comm = comm, tree = phy) jasper.uni
Let’s compare UniFrac to PhyloSor!
library(reshape2)
<- jasper.sor %>% as.matrix %>% melt() %>%
sor.heatmap ggplot() + geom_tile(aes(x = Var1, y = Var2, fill = value)) +
ggtitle("PhyloSor") + xlab("sites") + ylab("sites")
<- jasper.uni %>% as.matrix %>% melt() %>%
uni.heatmap ggplot() + geom_tile(aes(x = Var1, y = Var2, fill = value)) +
ggtitle("UniFrac") + xlab("sites") + ylab("sites")
::grid.arrange(sor.heatmap, uni.heatmap, ncol = 2) gridExtra
Exercise: SES_UniFrac
There is no function to calculate the standardized effect size of UniFrac in one shot. However, it is possible to write the code for this yourself! For this exercise, we challenge you to calculate SES_UniFrac and plot a heatmap showing the results.
The first step would be to randomize the community matrix (50 randomizations - it’s a bit long to do!). You can do this with the function randomizeMatrix
from the picante
package. We suggest you choose a null model that randomly shuffles species across sites while maintaining the species richness of each site (hmm.. which one could that be? ?randomize_matrix
).
# create a list to store of randomized community matrics
<- list()
jasper.rnd for(i in 1:50){ # randomize the community 50 times
<- randomizeMatrix(comm, null.model = "richness")
jasper.rnd[[i]] }
Now that we have our random community matrices, let’s calculate UniFrac for each of these!
<- lapply(jasper.rnd, unifrac, tree = phy) jasper.uni.rnd
We can then use the ses.PBD
function we created earlier to calculate a standardized effect size for UniFrac.
<- ses.PBD(obs = jasper.uni, rand = jasper.uni.rnd) jasper.ses.uni
pbd.obs | pbd.mean | pbd.sd | pbd.obs.rank | pbd.ses | pbd.obs.p | |
---|---|---|---|---|---|---|
45 | 0.642 | 0.567 | 0.092 | 37 | 0.819 | 0.725 |
46 | 0.522 | 0.561 | 0.080 | 16 | -0.492 | 0.314 |
47 | 0.583 | 0.578 | 0.078 | 27 | 0.055 | 0.529 |
48 | 0.542 | 0.565 | 0.078 | 17 | -0.296 | 0.333 |
49 | 0.517 | 0.569 | 0.086 | 10 | -0.600 | 0.196 |
50 | 0.476 | 0.573 | 0.071 | 5 | -1.373 | 0.098 |
51 | 0.375 | 0.559 | 0.084 | 2 | -2.188 | 0.039 |
52 | 0.495 | 0.585 | 0.064 | 7 | -1.422 | 0.137 |
53 | 0.502 | 0.564 | 0.069 | 10 | -0.909 | 0.196 |
54 | 0.596 | 0.571 | 0.076 | 26 | 0.332 | 0.510 |
55 | 0.520 | 0.559 | 0.077 | 13 | -0.506 | 0.255 |
Now, let’s visualize SES_UniFrac as a heatmap!
# note: converting a vector to a distance matrix can be tricky, so let's
# use our UniFrac matrix as a template, since it already has row/column
# names in the right places:
<- jasper.uni
jasper.ses.uni.m # assign UniFrac value to each cell
for(i in 1:length(jasper.uni)) {jasper.ses.uni.m[i] <- jasper.ses.uni$pbd.ses[i]}
# plot heatmap using ggplot
%>% as.matrix() %>% melt() %>% # manipulate data into a format that works for ggplot
jasper.ses.uni.m ggplot() +
geom_tile(aes(x = Var1, y = Var2, fill = value)) + # create the heatmap
scale_fill_gradient2(low = "blue", mid = "white", high = "red") + # create a diverging color palette
xlab("sites") + ylab("sites") + labs(fill = "SES_UniFrac") + # edit axis and legend titles
theme(axis.text.x = element_text(angle = 90)) # rotates x axis labels
There’s a lot of negative SES_UniFrac values here! What does that tell us about the phylogenetic dissimilarity between sites?
Decomposing phylogenetic beta diversity
Turnover & nestedness
The beta diversity we’ve been measuring can actually be broken down into two components (Baselga, 2010; Leprieur et al., 2012):
Nestedness: species loss/gain between sites attributable to differences in richness.
Turnover: the replacement of species between sites.
To illustrate this, let’s imagine three islands with different sets of species on them - like these!
If the \(\beta\)-diversity between islands is completely due to nestedness, you would find that the communities on species-poor islands are always subsets of the more species-rich islands’ communities (sites A1-A3 below).
If the \(\beta\)-diversity between islands is completely due to turnover, communities on each island would be a completely different set of species (sites B1-B3 below).
If the \(\beta\)-diversity between islands is due to both nestedness and turnover, each island community might contain a subset of species in common with species-rich islands, as well as a set of species that is completely replaced between sites (sites C1-C3 below).
So, what is the point of breaking \(\beta\)-diversity down into these 2 components?
Overall, doing this allows you to better understand how communities differ in their composition, which provides insight about the assembly processes that are shaping them!
Turnover allows us to test hypotheses about why some species are replaced between sites. Is it because they are “filtered” by local abiotic and/or biotic conditions? Is it because they cannot disperse between sites? Are there historical barriers?
Nestedness allows us to test hypotheses about why certain species are “lost” or “gained” between sites. Is it because selective extinctions have occurred in some sites? Is it because a few species are very good dispersers and can therefore be anywhere?
Decomposing PhyloSor
Now that we’ve established what turnover and nestedness are, let’s move on to actually breaking down the \(\beta\)-diversity we’ve observed into these two components. Baselga (2010) proposed an additive partitioning framework that decomposes the pair-wise Sørensen’s dissimilarity index into two additive components representing (i) ‘true’ turnover of species and (ii) richness differences between nested communities. In 2012, Leprieur et al. (2012) extended Baselga’s (2010) framework to decompose PhyloSor into two components accounting for ‘true’ phylogenetic turnover and phylogenetic diversity gradients, respectively.
Baselga (2012) published the package betapart
which allows us to do this decomposition in R, with the function phylo.beta.pair()
. (Note: the betapart
package also works for taxonomic and functional \(\beta\)-diversity!)
First, we need to convert our site x species matrix to presences/absences.
<- comm
preabs > 0] <- 1 preabs[preabs
Now, we can calculate PhyloSor and its decomposition into turnover and nestedness:
library(betapart)
<- phylo.beta.pair(x = preabs,
phylosor.betapart tree = phy,
index.family = "sorensen")
We now have 3 matrices:
-phylo.beta.sim
: turnover distance matrix
-phylo.beta.sne
: nestedness distance matrix
-phylo.beta.sor
: PhyloSor distance matrix
Let’s take a look at heatmaps of the first 2 matrices:
# Turnover matrix:
<- phylosor.betapart$phylo.beta.sim %>%
phylosor.turn as.matrix() %>% melt() %>%
ggplot() + geom_tile(aes(x = Var1, y = Var2, fill = value)) + # create the heatmap
xlab("sites") + ylab("sites") + labs(fill = "Turnover") + # edit axis and legend titles
scale_fill_viridis_c(option = "plasma", limits = range(0,0.5)) +
theme(axis.text.x = element_text(angle = 90)) # rotates x axis labels
# Nestedness matrix:
<- phylosor.betapart$phylo.beta.sne %>%
phylosor.nest as.matrix() %>% melt() %>%
ggplot() + geom_tile(aes(x = Var1, y = Var2, fill = value)) + # create the heatmap
xlab("sites") + ylab("sites") + labs(fill = "Nestedness") + # edit axis and legend titles
scale_fill_viridis_c(option = "plasma", limits = range(0,0.5)) +
theme(axis.text.x = element_text(angle = 90)) # rotates x axis labels
# plot both heatmaps next to each other
::grid.arrange(phylosor.turn, phylosor.nest, ncol = 2) gridExtra
We can also use a violin plot to look at the distribution of turnover and nestedness values, as well as their median:
# create dataframe
<- data.frame(turnover = as.vector(phylosor.betapart$phylo.beta.sim),
df nestedness =as.vector(phylosor.betapart$phylo.beta.sne))
# create violin plot
%>% melt() %>% ggplot() +
df geom_violin(aes(x = variable, y = value, fill = variable),
draw_quantiles = 0.5) + # show median
theme_classic()
In both plots, we can see that turnover is contributing more to the \(\beta\)-diversity between sites, compared to nestedness. This means species are usually being replaced between sites - and you could test some hypotheses to explain why that is!
Going further: Applications of these methods
This tutorial has mainly covered how to measure phylogenetic \(\alpha\) and \(\beta\) diversity. This is often only the first step of many studies. Here are some examples of things that people have done, and some ideas for things you could do with your own work!
1. Identifying hotspots of phylogenetic diversity for conservation.
Making decisions and laws to govern the conservation of biodiversity is often difficult, because there are so many different aspects of diversity than can be prioritized. Much of the time, the goal is to protect areas containing certain key species, or to protect the most species-rich communities, to ensure efficient use of conservation resources. Community phylogenetics can add further insight into this decision-making if the goal is to account for evolutionary history.
For example, some conservation projects might prioritize communities that harbour a high diversity of evolutionary lineages (i.e. phylogenetic \(\alpha\)-diversity), because these areas represent many long and irreplaceable evolutionary histories. Phylogenetic \(\beta\)-diversity would provide a metric to detect communities with more “unique” lineages - or more “evolutionary rare” communities - within the context of the region under assessment. Both of these would simply require spatial coordinates for the sites in our site x species matrix.
There are already several examples of this - here’s one that uses phylogenetic \(\beta\)-diversity to determine the spatial mismatch between protected areas and phylogenetic “hotspots”:
2. Testing theories of how communities are assembled.
For a long time, our understanding of community assembly - or how species come to colonize, persist, and coexist with other species in a community - was mostly limited to how ecological processes (competitive exclusion, environmental filtering, dispersal limitation, etc.) shape species composition. As community phylogenetics have evolved, community ecologists have been able to test hypotheses about how evolutionary processes shape phylogenetic community composition and structure as well.
For example, you can assess the relationship between phylogenetic \(\beta\)-diversity and the environmental distances between sites to determine the importance of environmental filtering over evolutionary time scales (which would be driven by mechanisms like adaptation and speciation). You can also assess this relationship between \(\beta\)-diversity and spatial distances between sites, which would give you insight into how dispersal limitation has shaped communities via its influence on adaptation, speciation, and extinctions too.
3. Delineating ecoregions or biomes.
Community phylogenetics have been increasingly used to reevaluate and delineate biome and ecoregion boundaries, so they can better reflect the evolutionary history that has shaped the distinctiveness of regional biotas. This is called phylogenetic regionalization.
Phylogenetic \(\beta\)-diversity measures like PhyloSor or turnover allow you to assess the dissimilarities between biotas that can determine biome or regional boundaries based on evolutionary distinctiveness.
For example, Holt et al. (2013) used phylogenetic \(\beta\)-diversity to reevaluate Wallace’s zoogeographic regions worldwide for birds, mammals, and amphibians:
Daru et al. (2016) also used phylogenetic turnover to delineate regions of evolutionary distinctiveness of woody plants in southern Africa:
Here’s another example of phylogenetic regionalization, this time delineating marine plant regions (Daru et al. 2017):
Thank you!
Thank you for attending and participating in this workshop about community phylogenetics!
We hope you now have a working grasp of how to prepare and use phylogenetic trees and community data. You now have the conceptual and analytical tools to evaluate metrics of phylogenetic community structure and composition, both at the alpha and beta levels!
Now, go forth and conquer (and happy Thanksgiving)!