Discrete Trait Evolution
With a phylogenetic network structure inferred, we can now estimate how quickly traits have evolved over time using a likelihood model. These traits should be discrete characteristics of a species such as feather color, diet type, or DNA in aligned genetic sequences.
Discrete trait data
As with continuous trait evolution, we assume a fixed network, correctly rooted, with branch lengths proportional to calendar time. We start with a network, then add data about the tips of this network.
The simplest way is to use a vector of species names with a data frame of traits:
julia> # read in network net = readTopology("(O:4,(A:3,((B:0.4)#H1:1.6::0.92,((#H1:0::0.08,C:0.4):0.6,(D:.2,E:.2):0.8):1):1):1);"); # read in trait data
julia> species = ["C","A","D","B","O","E"];
julia> dat = DataFrame(trait=["hi","lo","lo","hi","lo","lo"])
6×1 DataFrame Row │ trait │ String ─────┼──────── 1 │ hi 2 │ lo 3 │ lo 4 │ hi 5 │ lo 6 │ lo
If your species names and trait data are in the same data frame, read in your data frame then subset the data like this:
dat = DataFrame(species=["C","A","D","B","O","E"], trait=["hi","lo","lo","hi","lo","lo"]);
species = dat.species # or: dat[!, :species]
select!(dat, Not(:species)) # select all columns except for :species; modifies dat in place
Let's plot the network and map the data onto it:
using RCall, PhyloPlots
R"par"(mar=[0,0,0,0]); # to reduce margins
res = plot(net, tipoffset=0.3); # the results "res" provides point coordinates, to use for data annotation
o = [findfirst(isequal(tax), species) for tax in tipLabels(net)] # 5,2,4,1,3,6: order to match taxa from "species" to tip labels
isequal(species[o], tipLabels(net)) # true :)
traitcolor = map(x -> (x=="lo" ? "grey" : "red"), dat.trait[o])
leaves = res[13][!,:lea]
R"points"(x=res[13][leaves,:x] .+0.1, y=res[13][leaves,:y], pch=16, col=traitcolor, cex=1.5); # adds grey & red points
R"legend"(x=1, y=7, legend=["hi","lo"], pch=16, col=["red","grey"],
title="my trait", bty="n",var"title.adj"=0);
# next: add to gene flow edge the proportion γ of genes affected
hi = findfirst([!e.isMajor for e in net.edge]) # 6 : "h"ybrid "i"ndex: index of gene flow edge (minor hybrid) in net: horizontal segment
R"text"(res[14][hi,:x]-0.3, res[14][hi,:y]-0.1, res[14][hi,:gam], col="deepskyblue", cex=0.75); # add the γ value
Substitution models
After reading in your data, choose a model to describe how evolutionary changes (or substitutions, in the case of DNA) happened over time. Available Markov substitution models are described below.
For general trait types, use one of these three models:
:BTSM
Binary Trait Substitution Model (2 states, rates unconstrained):ERSM
Equal Rates Substitution Model (k
states, all transitions possible with equal rates):TBTSM
Two Binary Trait Substitution Model (though not fully implemented yet)
Inference
To infer evolutionary rates, run fitdiscrete
on the network and data. It will calculate the maximum likelihood score of one or more discrete trait characters at the tips on a fixed network.
- Along each edge, evolutionary changes are modeled with a continous time Markov model.
- At a hybrid node, the trait is assumed to be inherited from one or the other of its parents (immediately before the reticulation event), with probabilities equal to the inheritance γ of each parent edge, which is given by the network.
- At the root of the network, a uniform distribution among the possible states is assumed a priori.
- The model ignores incomplete lineage sorting (e.g. hemiplasy).
parameter estimation & model fit
The example below if for a binary trait, first using a model assuming equal rates (from lo
to hi
and from hi
to lo
); then using a model allowing for distinct rates. The option optimizeQ=false
causes transition rates to stay at their starting values, without being optimized.
julia> s1 = fitdiscrete(net, :ERSM, species, dat; optimizeQ=false)
PhyloNetworks.StatisticalSubstitutionModel: Equal Rates Substitution Model with k=2, all rates equal to α=0.07042. rate matrix Q: hi lo hi * 0.0704 lo 0.0704 * on a network with 1 reticulations data: 6 species 1 trait log-likelihood: -5.38599
julia> s2 = fitdiscrete(net, :BTSM, species, dat; optimizeQ=false)
PhyloNetworks.StatisticalSubstitutionModel: Binary Trait Substitution Model: rate hi→lo α=0.07042 rate lo→hi β=0.07042 on a network with 1 reticulations data: 6 species 1 trait log-likelihood: -5.38599
The default rates, which act as starting value if rates were to be optimized, are chosen equal to the inverse of the total edge lengths in the network (or 1/ntax if all branch lengths are missing).
By default optimizeQ = true
, such that fitdiscrete
estimates the parameters of the rate matrix Q.
julia> s3 = fitdiscrete(net, :ERSM, species, dat)
PhyloNetworks.StatisticalSubstitutionModel: Equal Rates Substitution Model with k=2, all rates equal to α=0.70216. rate matrix Q: hi lo hi * 0.7022 lo 0.7022 * on a network with 1 reticulations data: 6 species 1 trait log-likelihood: -3.76738
julia> s4 = fitdiscrete(net, :BTSM, species, dat)
PhyloNetworks.StatisticalSubstitutionModel: Binary Trait Substitution Model: rate hi→lo α=0.96783 rate lo→hi β=0.55896 on a network with 1 reticulations data: 6 species 1 trait log-likelihood: -3.58702
To compare the two models, we can use the Akaike criterion.
julia> using StatsBase
julia> aic(s3)
9.534758444197102
julia> aic(s4)
11.174046169879334
Here, the equal-rate model is slightly favored (lower AIC), so we will use s3
below.
ancestral state prediction
This is traditionally called "ancestral state reconstruction", but we do not actually reconstruct anything. We make predictions for (past of present-day) values, hopefully with some measure to quantify our uncertainty.
julia> # show(ancestralStateReconstruction(s3), allrows=true) ancestralStateReconstruction(s3)
13×4 DataFrame Row │ nodenumber nodelabel hi lo │ Int64 String Float64 Float64 ─────┼──────────────────────────────────────────── 1 │ 1 O 0.0 1.0 2 │ 2 A 0.0 1.0 3 │ 3 B 1.0 0.0 4 │ 4 C 1.0 0.0 5 │ 5 D 0.0 1.0 6 │ 6 E 0.0 1.0 7 │ 7 7 0.497566 0.502434 8 │ 8 8 0.497042 0.502958 9 │ 9 9 0.517993 0.482007 10 │ 10 10 0.476645 0.523355 11 │ 11 11 0.0226845 0.977316 12 │ 12 12 0.755186 0.244814 13 │ 13 H1 0.796886 0.203114
Rows 1-6 correspond to the tips, with known values. We see much prediction uncertainty at most of the internal nodes. To see where these internal nodes (7-13/H1) are, we need to look at the network stored within the fitted model. This network might differ somewhat from the input network in case taxa with missing data where pruned, and with edges possibly renumbered.
plot(s3.net, shownodenumber=true, shownodelabel=true, tipoffset=0.2);
Looking back at the posterior probabilities of states "hi" and "lo" at each node from the ancestral 'prediction' table above, we see that there is more uncertainty near the root, and less uncertainty near the tips. The most recent common ancestor of D and E (node 11), in particular, is predicted to be "lo" with fairly high certainty.
impact of gene flow on the trait
An interesting question is whether there is evidence that B obtained it's "hi" state via gene flow. The prior probability for this is γ: the supposedly known proportion of genes inherited via gene flow, which is part of the network (along with branch lengths). Here, this prior probability of trait inheritance via gene flow is:
julia> net.edge[6].gamma # the minor hybrid edge was edge 6, from above
1.0
We can compare this to the posterior probability, and get a Bayes factor to compare the two hypotheses: gene flow vs. vertical inheritance.
julia> exp(s3.priorltw[1]) # prior: for vertical inheritance. "ltw" = log tree weight
0.92
julia> exp(s3.priorltw[2]) # prior: for gene flow inheritance, same as γ above
0.07999999999999999
julia> postltw = PhyloNetworks.posterior_logtreeweight(s3)
2-element Vector{Float64}: -0.10479264829852397 -2.3077104629296414
julia> exp(postltw[2]) # posterior: for gene flow inheritance
0.09948877423615365
julia> function geneflowBF(fit) postltw = PhyloNetworks.posterior_logtreeweight(fit) exp(postltw[2] - postltw[1] + fit.priorltw[1] - fit.priorltw[2]) end
geneflowBF (generic function with 1 method)
julia> geneflowBF(s3)
1.2705237547097568
We get a Bayes factor greater than 1, so there is more evidence that the "hi" value of B was inherited via gene flow, than via vertical inheritance. But the Bayes factor is just barely above 1, so the evidence is very equivocal. This may not be surprising given that gene flow occurred between fairly closely related species, and that the data set is very small.
Trait simulation
randomTrait
can simulate traits along a known network. For example, we can define a binary trait model with states "carnivory" (state 1) and "non-carnivory" (state 2), then ask for a trait to be simulated along our network. We can ask for 3 independent simulations, giving us 3 traits then, arranged in 3 rows.
julia> m1 = BinaryTraitSubstitutionModel(1.0,2.0, ["carnivory", "non-carnivory"])
Binary Trait Substitution Model: rate carnivory→non-carnivory α=1.0 rate non-carnivory→carnivory β=2.0
julia> using Random; Random.seed!(1234); # for reproducibility of this example
julia> traitmatrix, nodecolumn = randomTrait(m1, net; ntraits=3);
julia> traitmatrix
3×13 Matrix{Int64}: 1 2 2 2 1 1 2 1 1 1 1 2 1 2 1 2 1 1 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 1 2 1
In this trait matrix, each column corresponds to a node, each row is a trait, and each entry gives the state of that trait for that node, as an index. To get the state labels:
julia> m1.label[traitmatrix]
3×13 Matrix{String}: "carnivory" "non-carnivory" … "non-carnivory" "carnivory" "non-carnivory" "carnivory" "carnivory" "non-carnivory" "carnivory" "carnivory" "non-carnivory" "carnivory"
The nodecolumn
vector says which node corresponds to which column in the trait matrix, and we can compare to the node numbers in the network. For example, the first column corresponds to node -2
, which is the root. (The root is always in the first column: that's where the simulation starts.) Also, as an example, the column for taxon "A" is column 12:
julia> nodecolumn
13-element Vector{String}: "-2" "-3" "-4" "-6" "-8" "E" "D" "-7" "C" "H1" "B" "A" "O"
julia> net.node[net.root]
PhyloNetworks.Node: number:-2 attached to 2 edges, numbered: 1 13
julia> findfirst(isequal("A"), nodecolumn)
12
julia> nodecolumn[12]
"A"
julia> traitmatrix[:,12]
3-element Vector{Int64}: 2 1 2
julia> m1.label[traitmatrix[:,12]]
3-element Vector{String}: "non-carnivory" "carnivory" "non-carnivory"