Public Documentation
Documentation for PhyloNetworks
's public (exported) interface.
See Internal Documentation for documentation on internal functions.
Index
PhyloNetworks.BinaryTraitSubstitutionModel
PhyloNetworks.DataCF
PhyloNetworks.EqualRatesSubstitutionModel
PhyloNetworks.HKY85
PhyloNetworks.HybridNetwork
PhyloNetworks.JC69
PhyloNetworks.ParamsBM
PhyloNetworks.ParamsMultiBM
PhyloNetworks.PhyloNetworkLinearModel
PhyloNetworks.Quartet
PhyloNetworks.RateVariationAcrossSites
PhyloNetworks.RateVariationAcrossSites
PhyloNetworks.ReconstructedStates
PhyloNetworks.ShiftNet
PhyloNetworks.TraitSimulation
PhyloNetworks.TraitSubstitutionModel
PhyloNetworks.TwoBinaryTraitSubstitutionModel
Base.getindex
PhyloNetworks.Q
PhyloNetworks.ancestralStateReconstruction
PhyloNetworks.biconnectedComponents
PhyloNetworks.blobDecomposition
PhyloNetworks.bootsnaq
PhyloNetworks.calibrateFromPairwiseDistances!
PhyloNetworks.checkroot!
PhyloNetworks.cladewiseorder!
PhyloNetworks.countquartetsintrees
PhyloNetworks.deleteHybridThreshold!
PhyloNetworks.deleteaboveLSA!
PhyloNetworks.deleteleaf!
PhyloNetworks.descendenceMatrix
PhyloNetworks.directEdges!
PhyloNetworks.displayedNetworkAt!
PhyloNetworks.displayedTrees
PhyloNetworks.empiricalDNAfrequencies
PhyloNetworks.expectations
PhyloNetworks.expectationsPlot
PhyloNetworks.fitdiscrete
PhyloNetworks.fittedQuartetCF
PhyloNetworks.getNodeAges
PhyloNetworks.getShiftEdgeNumber
PhyloNetworks.getShiftValue
PhyloNetworks.getchild
PhyloNetworks.getlabels
PhyloNetworks.getparent
PhyloNetworks.getpartneredge
PhyloNetworks.getroot
PhyloNetworks.hardwiredCluster
PhyloNetworks.hardwiredClusterDistance
PhyloNetworks.hardwiredClusters
PhyloNetworks.hassinglechild
PhyloNetworks.hybridBootstrapSupport
PhyloNetworks.hybridDetection
PhyloNetworks.hybridatnode!
PhyloNetworks.hybridlambdaformat
PhyloNetworks.isparentof
PhyloNetworks.isrootof
PhyloNetworks.lambda_estim
PhyloNetworks.majorTree
PhyloNetworks.mapAllelesCFtable
PhyloNetworks.maxParsimonyNet
PhyloNetworks.minorTreeAt
PhyloNetworks.mu_phylo
PhyloNetworks.nj
PhyloNetworks.nni!
PhyloNetworks.nparams
PhyloNetworks.nstates
PhyloNetworks.pairwiseTaxonDistanceMatrix
PhyloNetworks.parsimonyGF
PhyloNetworks.parsimonySoftwired
PhyloNetworks.phylolm
PhyloNetworks.predint
PhyloNetworks.predintPlot
PhyloNetworks.preorder!
PhyloNetworks.printEdges
PhyloNetworks.printNodes
PhyloNetworks.randomTrait
PhyloNetworks.randomTrait!
PhyloNetworks.readBootstrapTrees
PhyloNetworks.readInputTrees
PhyloNetworks.readMultiTopology
PhyloNetworks.readSnaqNetwork
PhyloNetworks.readTableCF
PhyloNetworks.readTableCF!
PhyloNetworks.readTopology
PhyloNetworks.readTopologyLevel1
PhyloNetworks.readTrees2CF
PhyloNetworks.readfastatodna
PhyloNetworks.readnexus_treeblock
PhyloNetworks.regressorHybrid
PhyloNetworks.regressorShift
PhyloNetworks.removedegree2nodes!
PhyloNetworks.rootatnode!
PhyloNetworks.rootonedge!
PhyloNetworks.rotate!
PhyloNetworks.setGamma!
PhyloNetworks.setLength!
PhyloNetworks.sharedPathMatrix
PhyloNetworks.shiftHybrid
PhyloNetworks.shrink2cycles!
PhyloNetworks.shrink3cycles!
PhyloNetworks.sigma2_phylo
PhyloNetworks.sigma2_within
PhyloNetworks.simulate
PhyloNetworks.snaq!
PhyloNetworks.sorttaxa!
PhyloNetworks.stationary
PhyloNetworks.summarizeDataCF
PhyloNetworks.summarizeHFdf
PhyloNetworks.tipLabels
PhyloNetworks.topologyMaxQPseudolik!
PhyloNetworks.topologyQPseudolik!
PhyloNetworks.treeEdgesBootstrap
PhyloNetworks.treeedgecomponents
PhyloNetworks.undirectedOtherNetworks
PhyloNetworks.vcv
PhyloNetworks.writeMultiTopology
PhyloNetworks.writeSubTree!
PhyloNetworks.writeTableCF
PhyloNetworks.writeTopology
types
PhyloNetworks.BinaryTraitSubstitutionModel
— TypeBinaryTraitSubstitutionModel(α, β [, label])
Model for binary traits, that is, with 2 states. Default labels are "0" and "1". α is the rate of transition from "0" to "1", and β from "1" to "0".
PhyloNetworks.DataCF
— TypeDataCF
type that contains the following attributes:
- quartet (vector of Quartets)
- numQuartets
- tree (vector of trees: empty if a table of CF was input instead of list of trees)
- numTrees (-1 if a table CF was input instead of list of trees)
- repSpecies (taxon names that were repeated in table of CF or input gene trees: used inside snaq for multiple alleles case)
The list of Quartet may be accessed with the attribute .quartet. If the input was a list of trees, the HybridNetwork's can be accessed with the attribute .tree. For example, if the DataCF object is named d, d.quartet[1] will show the first quartet and d.tree[1] will print the first input tree.
PhyloNetworks.EqualRatesSubstitutionModel
— TypeEqualRatesSubstitutionModel(numberStates, α, labels)
TraitSubstitutionModel
for traits with any number of states and equal substitution rates α between all states. Default labels are "1","2",...
example
julia> m1 = EqualRatesSubstitutionModel(2, [1.0], ["low","high"])
Equal Rates Substitution Model with k=2,
all rates equal to α=1.0.
rate matrix Q:
low high
low * 1.0000
high 1.0000 *
PhyloNetworks.HKY85
— TypeHKY85(rate, pi, relative)
A nucleic acid substitution model based on Hasegawa et al. 1985 substitution model. rate
should be a vector of 1 or 2 rates, and pi
a vector of 4 probabilities summing to 1.
If relative
is false, the 2 rates represent the transition rate and the transversion rate, α and β. If relative
is true (default), only the first rate is used and represents the transition/transversion ratio: κ=α/β. The rate transition matrix Q is normalized to have 1 change / unit of time on average, i.e. the absolute version of Q is divided by 2(piT*piC + piA*piG)α + 2(piY*piR)β
.
nparams
returns 1 or 2. In other words: the stationary distribution is not counted in the number of parameters (and fitdiscrete
does not optimize the pi values at the moment).
examples
julia> m1 = HKY85([.5], [0.20, 0.30, 0.30, 0.20])
HKY85 Substitution Model base frequencies: [0.2, 0.3, 0.3, 0.2]
relative rate version with transition/tranversion ratio kappa = 0.5,
scaled so that there is one substitution per unit time
rate matrix Q:
A C G T
A * 0.4839 0.2419 0.3226
C 0.3226 * 0.4839 0.1613
G 0.1613 0.4839 * 0.3226
T 0.3226 0.2419 0.4839 *
julia> nstates(m1)
4
julia> m2 = HKY85([0.5, 0.5], [0.20, 0.30, 0.30, 0.20], false)
HKY85 Substitution Model base frequencies: [0.2, 0.3, 0.3, 0.2]
absolute rate version with transition/transversion ratio kappa = a/b = 1.0
with rates a = 0.5 and b = 0.5
rate matrix Q:
A C G T
A * 0.1500 0.1500 0.1000
C 0.1000 * 0.1500 0.1000
G 0.1000 0.1500 * 0.1000
T 0.1000 0.1500 0.1500 *
PhyloNetworks.HybridNetwork
— TypeHybridNetwork
Subtype of abstract Network
type. Explicit network or tree with the following attributes:
- numTaxa (taxa are tips, i.e. nodes attached to a single edge)
- numNodes (total number of nodes: tips and internal nodes)
- numEdges
- numHybrids (number of hybrid nodes)
- edge (array of Edges)
- node (array of Nodes)
- root (index of root in vector 'node'. May be artificial, for printing and traversal purposes only.)
- hybrid (array of Nodes: those are are hybrid nodes)
- leaf (array of Nodes: those that are leaves)
- loglik (score after fitting network to data, i.e. negative log pseudolik for SNaQ)
- isRooted (true or false)
PhyloNetworks.JC69
— TypeJC69(rate, relative)
Jukes Cantor (1969) nucleic acid substitution model, which has a single rate parameter. rate
corresponds to the absolute diagonal elements, that is, the rate of change (to any of the other 2 states). Individual rates are rate
/3. If relative
is true (default), the transition matrix Q
is normalized to an average of 1 transition per unit of time: in which case rate
is set to 1.0.
examples
julia> m1 = JC69([0.25], false)
Jukes and Cantor 69 Substitution Model,
absolute rate version
off-diagonal rates equal to 0.25/3.
rate matrix Q:
A C G T
A * 0.0833 0.0833 0.0833
C 0.0833 * 0.0833 0.0833
G 0.0833 0.0833 * 0.0833
T 0.0833 0.0833 0.0833 *
julia> nstates(m1)
4
julia> nparams(m1)
1
julia> m2 = JC69([0.5])
Jukes and Cantor 69 Substitution Model,
relative rate version
off-diagonal rates equal to 1/3
rate matrix Q:
A C G T
A * 0.3333 0.3333 0.3333
C 0.3333 * 0.3333 0.3333
G 0.3333 0.3333 * 0.3333
T 0.3333 0.3333 0.3333 *
julia> nparams(m2)
0
PhyloNetworks.ParamsBM
— TypeParamsBM <: ParamsProcess
Type for a BM process on a network. Fields are mu
(expectation), sigma2
(variance), randomRoot
(whether the root is random, default to false
), and varRoot
(if the root is random, the variance of the root, default to NaN
).
PhyloNetworks.ParamsMultiBM
— TypeParamsMultiBM <: ParamsProcess
Type for a multivariate Brownian diffusion (MBD) process on a network. Fields are mu
(expectation), sigma
(covariance matrix), randomRoot
(whether the root is random, default to false
), varRoot
(if the root is random, the covariance matrix of the root, default to [NaN]
), shift
(a ShiftNet type, default to missing
), and L
(the lower triangular of the cholesky decomposition of sigma
, computed automatically)
Constructors
julia> ParamsMultiBM([1.0, -0.5], [2.0 0.3; 0.3 1.0]) # no shifts
ParamsMultiBM:
Parameters of a MBD with fixed root:
mu: [1.0, -0.5]
Sigma: [2.0 0.3; 0.3 1.0]
julia> net = readTopology("((A:1,B:1):1,C:2);");
julia> shifts = ShiftNet(net.node[2], [-1.0, 2.0], net);
julia> ParamsMultiBM([1.0, -0.5], [2.0 0.3; 0.3 1.0], shifts) # with shifts
ParamsMultiBM:
Parameters of a MBD with fixed root:
mu: [1.0, -0.5]
Sigma: [2.0 0.3; 0.3 1.0]
There are 2 shifts on the network:
───────────────────────────────────────────
Edge Number Shift Value 1 Shift Value 2
───────────────────────────────────────────
2.0 -1.0 2.0
───────────────────────────────────────────
PhyloNetworks.PhyloNetworkLinearModel
— TypePhyloNetworkLinearModel <: GLM.LinPredModel
Phylogenetic linear model representation.
Fields
lm
, V
, Vy
, RL
, Y
, X
, logdetVy
, reml
, ind
, nonmissing
, evomodel
, model_within
and formula
. The following syntax pattern can be used to get more information on a specific field: e.g. to find out about the lm
field, do ?PhyloNetworkLinearModel.lm
.
Methods applied to fitted models
The following StatsBase functions can be applied: coef
, nobs
, vcov
, stderror
, confint
, coeftable
, dof_residual
, dof
, deviance
, residuals
, response
, predict
, loglikelihood
, nulldeviance
, nullloglikelihood
, r2
, adjr2
, aic
, aicc
, bic
, ftest
, lrtest
etc.
The estimated variance-rate and estimated mean of the species-level trait model (see ContinuousTraitEM
) can be retrieved using sigma2_phylo
and mu_phylo
respectively.
If relevant, the estimated individual-level/within-species variance can be retrieved using sigma2_within
.
The optimized λ parameter for Pagel's λ model (see PagelLambda
) can be retrieved using lambda_estim
.
An ancestral state reconstruction can be performed using ancestralStateReconstruction
.
Within-species variation
The true species/population means for the response trait/variable (or the residuals: conditional on the predictors) are jointly modeled as 𝒩(·, σ²ₛV) where V depends on the trait model (see ContinuousTraitEM
) and on the species network. σ²ₛ is the between-species variance-rate.
Within-species variation is modeled by assuming that the individual-level responses are iid 𝒩(0, σ²ₑ) about the true species means, so that the species-level sample means (conditional on the predictors) are jointly modeled as 𝒩(·, σ²ₛV + σ²ₑD⁻¹), where σ²ₑ is the within-species variance and D⁻¹ is a diagonal matrix whose entries are the inverse sample-sizes (see WithinSpeciesCTM
).
Although the above two models can be expressed in terms of a joint distribution for the species-level sample means (or residuals conditional on the predictors), more data are required to fit a model accounting for within-species variation, that is, a model recognizing that the sample means are estimates of the true population means. To fit a model without within-species variation, data on the species means are sufficient. To fit a model with within-species variation, we need to have the species means and the standard deviations of the response variable for each species.
phylolm
can fit a model with within-species variation either from species-level statistics ("mean response" and "standard deviation in response") or from individual-level data (in which case the species-level statistics are computed internally). See phylolm
for more details on these two input choices.
In the object, obj.Y
and obj.X
are the observed species means. predict
, residuals
and response
return the values at the species level.
PhyloNetworks.Quartet
— TypeQuartet
type that saves the information on a given 4-taxon subset. It contains the following attributes:
- number: integer
- taxon: vector of taxon names, like t1 t2 t3 t4
- obsCF: vector of observed CF, in order 12|34, 13|24, 14|23
- logPseudoLik
- ngenes: number of gene trees used to compute the observed CF; -1.0 if unknown
- qnet:
QuartetNetwork
, which saves the expCF after snaq estimation to emphasize that the expCF depend on a specific network, not the data
see also: QuartetT
for quartet with data of user-defined type T
, using a mapping between quartet indices and quartet taxa.
PhyloNetworks.RateVariationAcrossSites
— TypeRateVariationAcrossSites(rvsymbol::Symbol, ncategories=4::Int)
Default model for rate variation across site, specified by a symbol:
:noRV
for no rate variation:G
or:Gamma
for gamma-distributed rates:I
or:Inv
for two categories: invariable and variable:GI
or:GI
for both.
PhyloNetworks.RateVariationAcrossSites
— MethodRateVariationAcrossSites(; pinv=0.0, alpha=Inf, ncat=4)
Model for variable substitution rates across sites (or across traits) using the discrete Gamma model (+G, Yang 1994, Journal of Molecular Evolution) or the invariable-sites model (+I, Hasegawa, Kishino & Yano 1985 J Mol Evol). Both types of rate variation can be combined (+G+I, Gu, Fu & Li 1995, Mol Biol Evol) but this is discouraged (Jia, Lo & Ho 2014 PLOS One). Using rate variation increases the number of parameters by one (+G or +I) or by two (+G+I).
Because the mean of the desired distribution or rates is 1, we use a Gamma distribution with shape α and scale θ=1/α (rate β=α) if no invariable sites, or scale θ=1/(α(1-pinv)), that is rate β=α(1-pinv) with a proportion pinv of invariable sites. The shape parameter is referred to as alpha here. The Gamma distribution is discretized into ncat
categories. In each category, the category's rate multiplier is a normalized quantile of the gamma distribution.
julia> rv = RateVariationAcrossSites()
Rate variation across sites: discretized Gamma
categories for Gamma discretization: 1
rates: [1.0]
julia> nparams(rv)
0
julia> typeof(rv)
PhyloNetworks.RVASGamma{1}
julia> rv = RateVariationAcrossSites(alpha=1.0, ncat=4)
Rate variation across sites: discretized Gamma
alpha: 1.0
categories for Gamma discretization: 4
rates: [0.146, 0.513, 1.071, 2.27]
julia> typeof(rv)
PhyloNetworks.RVASGamma{4}
julia> PhyloNetworks.setalpha!(rv, 2.0)
Rate variation across sites: discretized Gamma
alpha: 2.0
categories for Gamma discretization: 4
rates: [0.319, 0.683, 1.109, 1.889]
julia> nparams(rv)
1
julia> rv = RateVariationAcrossSites(pinv=0.3)
Rate variation across sites: +I (invariable sites)
pinv: 0.3
rates: [0.0, 1.429]
julia> nparams(rv)
1
julia> typeof(rv)
PhyloNetworks.RVASInv
julia> PhyloNetworks.setpinv!(rv, 0.05)
Rate variation across sites: +I (invariable sites)
pinv: 0.05
rates: [0.0, 1.053]
julia> rv = RateVariationAcrossSites(pinv=0.3, alpha=2.0, ncat=4)
Rate variation across sites: discretized Gamma+I
pinv: 0.3
alpha: 2.0
categories for Gamma discretization: 4
rates: [0.0, 0.456, 0.976, 1.584, 2.698]
probabilities: [0.3, 0.175, 0.175, 0.175, 0.175]
julia> nparams(rv)
2
julia> typeof(rv)
PhyloNetworks.RVASGammaInv{5}
julia> PhyloNetworks.setalpha!(rv, 3.0)
Rate variation across sites: discretized Gamma+I
pinv: 0.3
alpha: 3.0
categories for Gamma discretization: 4
rates: [0.0, 0.6, 1.077, 1.584, 2.454]
probabilities: [0.3, 0.175, 0.175, 0.175, 0.175]
julia> PhyloNetworks.setpinv!(rv, 0.05)
Rate variation across sites: discretized Gamma+I
pinv: 0.05
alpha: 3.0
categories for Gamma discretization: 4
rates: [0.0, 0.442, 0.793, 1.167, 1.808]
probabilities: [0.05, 0.238, 0.238, 0.238, 0.238]
julia> PhyloNetworks.setpinvalpha!(rv, 0.1, 5.0)
Rate variation across sites: discretized Gamma+I
pinv: 0.1
alpha: 5.0
categories for Gamma discretization: 4
rates: [0.0, 0.593, 0.91, 1.221, 1.721]
probabilities: [0.1, 0.225, 0.225, 0.225, 0.225]
PhyloNetworks.ReconstructedStates
— TypeReconstructedStates
Type containing the inferred information about the law of the ancestral states given the observed tips values. The missing tips are considered as ancestral states.
The following functions can be applied to it: expectations
(vector of expectations at all nodes), stderror
(the standard error), predint
(the prediction interval).
The ReconstructedStates
object has fields: traits_nodes
, variances_nodes
, NodeNumbers
, traits_tips
, tipNumbers
, model
. Type in "?ReconstructedStates.field" to get help on a specific field.
PhyloNetworks.ShiftNet
— TypeShiftNet
Shifts associated to a HybridNetwork
sorted in topological order. Its shift
field is a vector of shift values, one for each node, corresponding to the shift on the parent edge of the node (which makes sense for tree nodes only: they have a single parent edge).
Two ShiftNet
objects on the same network can be concatened with *
.
ShiftNet(node::Vector{Node}, value::AbstractVector, net::HybridNetwork; checkPreorder=true::Bool)
Constructor from a vector of nodes and associated values. The shifts are located on the edges above the nodes provided. Warning, shifts on hybrid edges are not allowed.
ShiftNet(edge::Vector{Edge}, value::AbstractVector, net::HybridNetwork; checkPreorder=true::Bool)
Constructor from a vector of edges and associated values. Warning, shifts on hybrid edges are not allowed.
Extractors: getShiftEdgeNumber
, getShiftValue
PhyloNetworks.TraitSimulation
— TypeTraitSimulation
Result of a trait simulation on an HybridNetwork
with function simulate
.
The following functions and extractors can be applied to it: tipLabels
, obj[:Tips]
, obj[:InternalNodes]
(see documentation for function getindex(::TraitSimulation, ::Symbol)
).
The TraitSimulation
object has fields: M
, params
, model
.
PhyloNetworks.TraitSubstitutionModel
— TypeTraitSubstitutionModel
For subtypes, see BinaryTraitSubstitutionModel
, EqualRatesSubstitutionModel
, TwoBinaryTraitSubstitutionModel
PhyloNetworks.TwoBinaryTraitSubstitutionModel
— TypeTwoBinaryTraitSubstitutionModel(rate [, label])
TraitSubstitutionModel
for two binary traits, possibly correlated. Default labels are "x0", "x1" for trait 1, and "y0", "y1" for trait 2. If provided, label
should be a vector of size 4, listing labels for trait 1 first then labels for trait 2. rate
should be a vector of substitution rates of size 8. rate[1],...,rate[4] describe rates of changes in trait 1. rate[5],...,rate[8] describe rates of changes in trait 2. In the transition matrix, trait combinations are listed in the following order: x0-y0, x0-y1, x1-y0, x1-y1.
example
model = TwoBinaryTraitSubstitutionModel([2.0,1.2,1.1,2.2,1.0,3.1,2.0,1.1],
["carnivory", "noncarnivory", "wet", "dry"]);
model
using PhyloPlots
plot(model) # to visualize states and rates
basic utilities
PhyloNetworks.tipLabels
— FunctiontipLabels(x)
Return a vector of taxon names at the leaves, for objects of various types: HybridNetwork
, Vector of HybridNetwork
s (in which case the union is taken then sorted), Vector of Quartet
s, DataCF
, TraitSimulation
, MatrixTopologicalOrder
.
For a network, the taxon names are coerced to strings.
PhyloNetworks.printEdges
— FunctionprintEdges(net)
printEdges(io::IO, net)
Print information on the edges of a HybridNetwork
or QuartetNetwork
object net
: edge number, numbers of nodes attached to it, edge length, whether it's a hybrid edge, its γ inheritance value, whether it's a major edge, if it could contain the root (this field is not always updated, though) and attributes pertaining to level-1 networks used in SNaQ: in which cycle it is contained (-1 if no cycle), and if the edge length is identifiable (based on quartet concordance factors).
PhyloNetworks.printNodes
— FunctionprintNodes(net)
printNodes(io, net)
Print information on the nodes of a HybridNetwork
net: node number, whether it's a leaf, whether it's a hybrid node, whether it's connected to one or more hybrid edges, it's name (label), the cycle in which it is belong (-1 if no cycle; makes sense for level-1 networks), and the list of edges attached to it, by their numbers.
PhyloNetworks.getroot
— Functiongetroot(net)
Node used to root net
. If net
is to be considered as semi-directed or unrooted, this root node is used to write the networks' Newick parenthetical description or for network traversals.
See also: isrootof
PhyloNetworks.isrootof
— Functionisrootof(node, net)
true
if node
is the root of net
(or used as such for network traversals in case the network is considered as semi-directed); false
otherwise.
isleaf(node)
isexternal(edge)
true
if node
is a leaf or edge
is adjacent to a leaf, false
otherwise.
PhyloNetworks.isparentof
— Functionisparentof(node, edge)
ischildof(node, edge)
true
if node
is the tail / head, or parent / child, of edge
; false
otherwise. Assumes that the edge's direction is correct, meaning it's field isChild1
is reliable (in sync with the rooting).
PhyloNetworks.hassinglechild
— Functionhassinglechild(node)
true
if node
has a single child, based on the edges' isChild1
field; false
otherwise.
PhyloNetworks.getchild
— Functiongetchild(edge)
getchild(node)
getchildren(node)
Get child(ren) node(s).
getchild
: single child node ofedge
, or ofnode
after checking thatnode
has a single child.getchildren
: vector of all children nodes ofnode
.
getchildedge(node)
Single child edge of node
. Checks that it's a single child.
Warning: these functions rely on correct edge direction, via their isChild1
field.
See also: getparent
, getpartneredge
, isparentof
, hassinglechild
.
PhyloNetworks.getparent
— Functiongetparent(edge)
getparent(node)
getparentminor(node)
getparents(node)
Get parental node(s).
getparent
: major (or only) parent node ofedge
ornode
getparentminor
: minor parent node ofnode
getparents
: vector of all parent nodes ofnode
.
getparentedge(node)
getparentedgeminor(node)
Get one parental edge of a node
.
getparentedge
: major parent edge. For a tree node, it's its only parent edge.getparentedgeminor
: minor parent edge, ifnode
is hybrid (with an error ifnode
has no minor parent).
If node
has multiple major (resp. minor) parent edges, the first one would be returned without any warning or error.
Warning: these functions use the field isChild1
of edges.
See also: getchild
, getpartneredge
.
PhyloNetworks.getpartneredge
— Functiongetpartneredge(edge::Edge)
getpartneredge(edge::Edge, node::Node)
Edge that is the hybrid partner of edge
, meaning that is has the same child node
as edge
. This child node
is given as an argument in the second method. Assumptions, not checked:
- no in-coming polytomy: a node has 0, 1 or 2 parents, no more
- when
node
is given, it is assumed to be the child ofedge
(the first method calls the second).
PhyloNetworks.getNodeAges
— FunctiongetNodeAges(net)
vector of node ages in pre-order, as in nodes_changed
.
Warnings: net
is assumed to
- have been preordered before (to calculate
nodes_changed
) - be time-consistent (all paths to the root to a given hybrid have the same length)
- be ultrametric (all leaves have the same age: 0)
PhyloNetworks.pairwiseTaxonDistanceMatrix
— FunctionpairwiseTaxonDistanceMatrix(net; keepInternal=false,
checkPreorder=true, nodeAges=[])
pairwiseTaxonDistanceMatrix!(M, net, nodeAges)
Return the matrix M
of pairwise distances between nodes in the network:
- between all nodes (internal and leaves) if
keepInternal=true
, in which case the nodes are listed inM
in the order in which they appear innet.nodes_changed
- between taxa only otherwise, in which case the nodes are listed in
M
in the order in which they appear intipLabels(net)
(i.e. same order as innet.leaf
)
The second form modifies M
in place, assuming all nodes.
The distance between the root and a given hybrid node (to take an example) is the weighted average of path lengths from the root to that node, where each path is weighted by the product of γs of all edges on that path. This distance measures the average genetic distance across the genome, if branch lengths are in substitutions/site.
optional arguments:
checkPreorder
: if true,net.nodes_changed
is updated to get a topological ordering of nodes.nodeAges
: if not provided, i.e. empty vector, the network is not modified. If provided and non-empty,nodeAges
should list node ages in the pre-order in which nodes are listed innodes_changed
(including leaves), and edge lengths innet
are modified accordingly.
Providing node ages hence makes the network time consistent: such that all paths from the root to a given hybrid node have the same length. If node ages are not provided, the network need not be time consistent.
utilities to manipulate networks
PhyloNetworks.sorttaxa!
— Functionsorttaxa!(DataFrame, columns)
Reorder the 4 taxa and reorders the observed concordance factors accordingly, on each row of the data frame. If columns
is ommitted, taxon names are assumed to be in columns 1-4 and CFs are assumed to be in columns 5-6 with quartets in this order: 12_34
, 13_24
, 14_23
. Does not reorder credibility interval values, if present.
sorttaxa!(DataCF)
sorttaxa!(Quartet, permutation_tax, permutation_cf)
Reorder the 4 taxa in each element of the DataCF quartet
. For a given Quartet, reorder the 4 taxa in its fields taxon
and qnet.quartetTaxon
(if non-empty) and reorder the 3 concordance values accordingly, in obsCF
and qnet.expCF
.
permutation_tax
and permutation_cf
should be vectors of short integers (Int8) of length 4 and 3 respectively, whose memory allocation gets reused. Their length is not checked.
qnet.names
is unchanged: the order of taxon names here relates to the order of nodes in the network (???)
PhyloNetworks.directEdges!
— FunctiondirectEdges!(net::HybridNetwork; checkMajor=true::Bool)
Updates the edges' attribute isChild1
, according to the root placement. Also updates edges' attribute containRoot
, for other possible root placements compatible with the direction of existing hybrid edges. Relies on hybrid nodes having exactly 1 major hybrid parent edge, but checks for that if checkMajor=true.
Warnings:
- Assumes that isChild1 is correct on hybrid edges
(to avoid changing the identity of which nodes are hybrids and which are not).
- Does not check for cycles (to maintain a network's DAG status)
Returns the network. Throws a 'RootMismatch' Exception if the root was found to conflict with the direction of any hybrid edge.
PhyloNetworks.checkroot!
— Functioncheckroot!(net)
checkroot!(net::HybridNetwork, membership::Dict{Node, Int})
Set the root of net
to an appropriate node and update the edges containRoot
field appropriately, using the membership
output by treeedgecomponents
. A node is appropriate to serve as root if it belongs in the root tree-edge component, that is, the root of the tree-edge component graph.
- If the current root is appropriate, it is left as is. The direction of edges (via
isChild1
) is also left as is, assuming it was in synch with the existing root. - Otherwise, the root is set to the first appropriate node in
net.node
, that is not a leaf. Then edges are directed away from this root.
A RootMismatch
error is thrown if net
is not a valid semidirected phylogenetic network (i.e. it is not possible to root the network in a way compatible with the given hybrid edges).
Output: the membership
ID of the root component. The full set of nodes in the root component can be obtained as shown below. Warning: only use the output component ID after calling the second version checkroot!(net, membership)
.
julia> net = readTopology("(#H1:::0.1,#H2:::0.2,(((b)#H1)#H2,a));");
julia> membership = treeedgecomponents(net);
julia> rootcompID = checkroot!(net, membership);
julia> rootcomp = keys(filter(p -> p.second == rootcompID, membership));
julia> sort([n.number for n in rootcomp]) # number of nodes in the root component
3-element Vector{Int64}:
-3
-2
4
PhyloNetworks.preorder!
— Functionpreorder!(net::HybridNetwork)
Updates attribute net.nodes_changed in which the nodes are pre-ordered (also called topological sorting), such that each node is visited after its parent(s). The edges' direction needs to be correct before calling preorder!, using directEdges!
PhyloNetworks.cladewiseorder!
— Functioncladewiseorder!(net::HybridNetwork)
Updates attribute net.cladewiseorder_nodeIndex. Used for plotting the network. In the major tree, all nodes in a given clade are consecutive. On a tree, this function also provides a pre-ordering of the nodes. The edges' direction needs to be correct before calling cladewiseorder!
, using directEdges!
PhyloNetworks.rootatnode!
— Functionrootatnode!(HybridNetwork, nodeNumber::Integer; index=false::Bool)
rootatnode!(HybridNetwork, Node)
rootatnode!(HybridNetwork, nodeName::AbstractString)
Root the network/tree object at the node with name 'nodeName' or number 'nodeNumber' (by default) or with index 'nodeNumber' if index=true. Attributes isChild1 and containRoot are updated along the way. Use plot(net, shownodenumber=true, showedgelength=false)
to visualize and identify a node of interest. (see package PhyloPlots)
Return the network.
Warnings:
If the node is a leaf, the root will be placed along the edge adjacent to the leaf. This might add a new node.
If the desired root placement is incompatible with one or more hybrids, then
- the original network is restored with its old root and edges' direction.
- a RootMismatch error is thrown.
See also: rootonedge!
.
PhyloNetworks.rootonedge!
— Functionrootonedge!(HybridNetwork, edgeNumber::Integer; index=false::Bool)
rootonedge!(HybridNetwork, Edge)
Root the network/tree along an edge with number edgeNumber
(by default) or with index edgeNumber
if index=true
. Attributes isChild1
and containRoot
are updated along the way.
This adds a new node and a new edge to the network. Use plot(net, showedgenumber=true, showedgelength=false)
to visualize and identify an edge of interest. (see package PhyloPlots)
See also: rootatnode!
.
PhyloNetworks.hybridatnode!
— Functionhybridatnode!(net::HybridNetwork, nodeNumber::Integer)
Change the direction and status of edges in network net
, to move the hybrid node in a cycle to the node with number nodeNumber
. This node must be in one (and only one) cycle, otherwise an error will be thrown. Check and update the nodes' field inCycle
.
Output: net
after hybrid modification.
Assumption: net
must be of level 1, that is, each blob has a single cycle with a single reticulation.
example
net = readTopology("(A:1.0,((B:1.1,#H1:0.2::0.2):1.2,(((C:0.52,(E:0.5)#H2:0.02::0.7):0.6,(#H2:0.01::0.3,F:0.7):0.8):0.9,(D:0.8)#H1:0.3::0.8):1.3):0.7):0.1;");
using PhyloPlots
plot(net, shownodenumber=true); # to locate nodes and their numbers. D of hybrid origin
hybridatnode!(net, -4)
plot(net, shownodenumber=true); # hybrid direction reversed: now 2B of hybrid origin
hybridatnode!(net::HybridNetwork, hybrid::Node, newNode::Node)
Move the reticulation from hybrid
to newNode
, which must in the same cycle. net
is assumed to be of level 1, but no checks are made and fields are supposed up-to-date.
Called by hybridatnode!(net, nodenumber)
, which is itself called by undirectedOtherNetworks
.
PhyloNetworks.setLength!
— FunctionsetLength!(edge, newlength)`
Set the length of edge
, and set edge.y
and edge.z
accordingly. Warning: specific to SNaQ. Use setlengths!
or setBranchLength!
for more general tools.
- The new length is censored to 10: if the new length is above 10, the edge's length will be set to 10. Lengths are interpreted in coalescent units, and 10 is close to infinity: near perfect gene tree concordance. 10 is used as an upper limit to coalescent units that can be reliably estimated.
- The new length is allowed to be negative, but must be greater than -log(1.5), to ensure that the major quartet concordance factor (1 - 2/3 exp(-length)) is >= 0.
PhyloNetworks.setGamma!
— FunctionsetGamma!(Edge, new γ)
setGamma!(Edge, new γ, change_other=true::Bool)
Set inheritance probability γ for an edge, which must be a hybrid edge. The new γ needs to be in [0,1]. The γ of the "partner" hybrid edge is changed accordingly, to 1-γ. The field isMajor
is also changed accordingly. If the new γ is approximately 0.5, Edge
is set to the major parent, its partner is set to the minor parent.
If net
is a HybridNetwork object, printEdges(net)
will show the list of edges and their γ's. The γ of the third hybrid edge (say) can be changed to 0.2 with setGamma!(net.edge[3],0.2)
. This will automatically set γ of the partner hybrid edge to 0.8.
The last argument is true by default. If false: the partner edge is not updated. This is useful if the new γ is 0.5, and the partner's γ is already 0.5, in which case the isMajor
attributes can remain unchanged.
PhyloNetworks.deleteleaf!
— Functiondeleteleaf!(HybridNetwork, leafName::AbstractString; ...)
deleteleaf!(HybridNetwork, Node; ...)
deleteleaf!(HybridNetwork, Integer; index=false, ...)
Delete a node from the network, possibly from its name, number, or index in the network's array of nodes. The first two versions require that the node is a leaf. The third version does not require that the node is a leaf: If it has degree 3 or more, nothing happens. If it has degree 1 or 2, then it is deleted.
keyword arguments
simplify
: if true and if deleting the node results in 2 hybrid edges forming a cycle of k=2 nodes, then these hybrid edges are merged and simplified as a single tree edge.
unroot
: if true, a root of degree 1 or 2 is deleted. If false, the root is deleted if it is of degree 1 (no root edge is left), but is kept if it is of degree 2. Deleting all leaves in an outgroup clade or grade will leave the ingroup rooted (that is, the new root will be of degree 2).
nofuse
: if true, keep nodes (and edges) provided that they have at least one descendant leaf, even if they are of degree 2. This will keep two-cycles (forcing simplify
to false). Nodes without any descendant leaves are deleted. If nofuse
is false, edges adjacent to degree-2 nodes are fused.
multgammas
: if true, the fused edge has γ equal to the product of the hybrid edges that have been fused together, which may result in tree edges with γ<1, or with reticulations in which the two parent γ don't add up to 1.
keeporiginalroot
: if true, keep the root even if it is of degree one (forcing unroot
to be false).
Warning: does not update edges' containRoot
nor attributes related to level-1 networks such as inCycle, partition, gammaz, etc. Does not require branch lengths, and designed to work on networks of all levels.
PhyloNetworks.deleteaboveLSA!
— FunctiondeleteaboveLSA!(net, preorder=true::Bool)
Delete edges and nodes above (ancestral to) the least stable ancestor (LSA) of the leaves in net
. See leaststableancestor
for the definition of the LSA. Output: modified network net
.
PhyloNetworks.removedegree2nodes!
— Functionremovedegree2nodes!(net::HybridNetwork, keeproot=false::Bool)
Delete all nodes of degree two in net
, fusing the two adjacent edges together each time, and return the network. If the network has a degree-2 root and keeproot
is false, then the root is eliminated as well, leaving the network unrooted. The only exception to this rule is if the root is incident to 2 (outgoing) hybrid edges. Removing the root should leave a loop-edge (equal end point), which we don't want to do, to preserve the paths in the original network. In this case, the root is maintained even if keeproot
is false. If keeproot
is true, then the root is kept even if it's of degree 2.
See fuseedgesat!
.
julia> net = readTopology("(((((S1,(S2)#H1),(#H1,S3)))#H2),(#H2,S4));");
julia> PhyloNetworks.breakedge!(net.edge[3], net); # create a degree-2 node along hybrid edge
julia> PhyloNetworks.breakedge!(net.edge[3], net); # another one: 2 in a row
julia> PhyloNetworks.breakedge!(net.edge[10], net); # another one, elsewhere
julia> writeTopology(net) # extra pairs of parentheses
"((#H2,S4),(((((S1,(((S2)#H1))),(#H1,S3)))#H2)));"
julia> removedegree2nodes!(net);
julia> writeTopology(net) # even the root is gone
"(#H2,S4,(((S1,(S2)#H1),(#H1,S3)))#H2);"
julia> net = readTopology("((((C:0.9)I1:0.1)I3:0.1,((A:1.0)I2:0.4)I3:0.6):1.4,(((B:0.2)H1:0.6)I2:0.5)I3:2.1);");
julia> removedegree2nodes!(net, true);
julia> writeTopology(net, round=true) # the root was kept
"((C:1.1,A:2.0):1.4,B:3.4);"
PhyloNetworks.shrink2cycles!
— Functionshrink2cycles!(net::HybridNetwork, unroot=false::Bool)
If net
contains a 2-cycle, collapse the cycle into one edge of length tA + γt1+(1-γ)t2 + tB (see below), and return true. Return false otherwise. A 2-cycle is a set of 2 parallel hybrid edges, from the same parent node to the same hybrid child node.
A A
| tA |
parent |
| \ |
t2,1-γ | | t1,γ | tA + γ*t1 + (1-γ)*t2 + tB
| / |
hybrid |
| tB |
B B
If any of the lengths or gammas associated with a 2-cycle are missing, the combined length is missing. If γ is missing, branch lengths are calculated using γ=0.5.
If unroot
is false and the root is up for deletion, it will be kept only if it is has degree 2 or more. If unroot
is true and the root is up for deletion, it will be kept only if it has degree 3 or more. A root node with degree 1 will be deleted in both cases.
PhyloNetworks.shrink3cycles!
— Functionshrink3cycles!(net::HybridNetwork, unroot=false::Bool)
Remove all 2- and 3-cycles from a network.
Return true if net
contains a 2-cycle or a 3-cycle; false otherwise. A 3-cycle (2-cycle) is a set of 3 (2) nodes that are all connected. One of them must be a hybrid node, since net
is a DAG.
If unroot
is false and the root is up for deletion, it will be kept only if it is has degree 2 or more. If unroot
is true and the root is up for deletion, it will be kept only if it has degree 3 or more. A root node with degree 1 will be deleted in both cases.
See shrink3cycleat!
for details on branch lengths and inheritance values.
PhyloNetworks.deleteHybridThreshold!
— FunctiondeleteHybridThreshold!(net::HybridNetwork, threshold::Float64,
nofuse=false, unroot=false, multgammas=false,
keeporiginalroot=false)
Deletes from a network all hybrid edges with heritability below a threshold gamma. Returns the network.
- if threshold<0.5: delete minor hybrid edges with γ < threshold (or with a missing γ, for any threshold > -1.0)
- if threshold=0.5: delete all minor hybrid edges (i.e normally with γ < 0.5, if γ non-missing)
nofuse
: if true, do not fuse edges and keep original nodes.unroot
: if false, the root will not be deleted if it becomes of degree 2.multgammas
: if true, the modified edges have γ values equal to the proportion of genes that the extracted subnetwork represents. For an edgee
in the modified network, the inheritance γ fore
is the product of γs of all edges in the original network that have been merged intoe
.
-keeporiginalroot
: if true, the root will be retained even if of degree 1.
Warnings:
- by default,
nofuse
is false, partner hybrid edges are fused with their child edge and have their γ changed to 1.0. Ifnofuse
is true: the γ's of partner hybrid edges are unchanged. - assumes correct
isMajor
fields, and correctisChild1
fields to updatecontainRoot
.
PhyloNetworks.rotate!
— Functionrotate!(net::HybridNetwork, nodeNumber::Integer; orderedEdgeNum::Array{Int,1})
Rotates the order of the node's children edges. Useful for plotting, to remove crossing edges. If node
is a tree node with no polytomy, the 2 children edges are switched and the optional argument orderedEdgeNum
is ignored.
Use plot(net, shownodenumber=true, showedgenumber=false)
to map node and edge numbers on the network, as shown in the examples below. (see package PhyloPlots)
Warning: assumes that edges are correctly directed (isChild1 updated). This is done by plot(net)
. Otherwise run directEdges!(net)
.
Example
julia> net = readTopology("(A:1.0,((B:1.1,#H1:0.2::0.2):1.2,(((C:0.52,(E:0.5)#H2:0.02::0.7):0.6,(#H2:0.01::0.3,F:0.7):0.8):0.9,(D:0.8)#H1:0.3::0.8):1.3):0.7):0.1;");
julia> using PhyloPlots
julia> plot(net, shownodenumber=true)
julia> rotate!(net, -4)
julia> plot(net)
julia> net=readTopology("(4,((1,(2)#H7:::0.864):2.069,(6,5):3.423):0.265,(3,#H7:::0.136):10.0);");
julia> plot(net, shownodenumber=true, showedgenumber=true)
julia> rotate!(net, -1, orderedEdgeNum=[1,12,9])
julia> plot(net, shownodenumber=true, showedgenumber=true)
julia> rotate!(net, -3)
julia> plot(net)
Note that LinearAlgebra
also exports a function named rotate!
in Julia v1.5. If both packages need to be used in Julia v1.5 or higher, usage of rotate!
needs to be qualified, such as with PhyloNetworks.rotate!
.
PhyloNetworks.biconnectedComponents
— FunctionbiconnectedComponents(network, ignoreTrivial=false)
Calculate biconnected components (aka "blobs") using Tarjan's algorithm.
Output: array of arrays of edges.
- the length of the array is the number of blobs
- each element is an array of all the edges inside a given blob.
These blobs are returned in post-order, but within a blob, edges are not necessarily sorted in topological order. If ignoreTrivial
is true, trivial components (of a single edge) are not returned. The network is assumed to be connected.
Warnings: for nodes, fields k
and inCycle
are modified during the algorithm. They are used to store the node's "index" (time of visitation), "lowpoint", and the node's "parent", as defined by the order in which nodes are visited. For edges, field fromBadDiamondI
is modified, to store whether the edge has been already been visited or not.
References:
- p. 153 of Tarjan (1972). Depth first search and linear graph algorithms, SIAM Journal on Computing, 1(2):146-160
- on geeksforgeeks, there is an error (as of 2018-01-30):
elif v != parent[u] and low[u] > disc[v]:
(python version) should be replaced byelif v != parent[u] and disc[u] > disc[v]:
- nice explanation at this url
biconnectedComponents(node, index, S, blobs, ignoreTrivial)
Helper recursive function starting at a node (not a network). index
is an array containing a single integer, thus mutable: order in which nodes are visited.
PhyloNetworks.blobDecomposition
— FunctionblobDecomposition!(network)
blobDecomposition(network)
Find blobs using biconnectedComponents
; find their roots using blobInfo
; create a forest in the form of a disconnected network (for efficiency), by deconnecting the root of each non-trivial blob from its parent. The root of each blob corresponds to a new leaf (in another tree of the forest): the number of the blob's root is given to the newly created leaf.
The first (bang) version modifies the network and returns the array of blob roots. The second version copies the network then returns a tuple: the forest and the array of blob roots.
Warnings:
- the forest is represented by a single HybridNetwork object, on which most functions don't work (like
writeTopology
, plotting etc.) because the network is disconnected (to make the forest). Revert back to low-level functions, e.g.printEdges
andprintNodes
. - see
biconnectedComponents
for node attributes modified during the algorithm.
PhyloNetworks.treeedgecomponents
— Functiontreeedgecomponents(net::HybridNetwork)
Return the tree-edge components of the semidirected network as a membership
dictionary Node => Int
. Nodes with the same membership integer value are in the same tree-edge component. The tree-edge components of a network are the connected components of the network when all hybrid edges are removed.
A RootMismatch
error is thrown if there exists a cycle in any of the tree-edge components, or if a tree-edge component has more than one "entry" hybrid node.
Warnings:
- since
Node
s are mutable, the network should not be modified until usage of the outputmembership
dictionary is over. - the component IDs are not predicable, but will be consecutive integers from 1 to the number of components.
PhyloNetworks.nni!
— Functionnni!(net::HybridNetwork, e::Edge, nohybridladder::Bool=true, no3cycle::Bool=true,
constraints=TopologyConstraint[]::Vector{TopologyConstraint})
Attempt to perform a nearest neighbor interchange (NNI) around edge e
, randomly chosen among all possible NNIs (e.g 3, sometimes more depending on e
) satisfying the constraints, and such that the new network is a DAG. The number of possible NNI moves around an edge depends on whether the edge's parent/child nodes are tree or hybrid nodes. This is calculated by nnimax
.
The option no3cycle
forbids moves that would create a 3-cycle in the network. When no3cycle
= false, 2-cycle and 3-cycles may be generated.
Note that the defaults values are for positional (not keyword) arguments, so two or more arguments can be used, but in a specific order: nni!(net, e) or nni!(net, e, nohybridladder), nni!(net, e, nohybridladder, no3cycle), nni!(net, e, nohybridladder, no3cycle, contraints).
Assumptions:
- The starting network does not have 3-cycles, if
no3cycle=true
. No check for the presence of 2- and 3-cycles in the input network. - The edges' field
isChild1
is correct in the input network. (This field will be correct in the output network.)
Output: information indicating how to undo the move or nothing
if all NNIs failed.
examples
julia> str_network = "(((S8,S9),(((((S1,S2,S3),S4),(S5)#H1),(#H1,(S6,S7))))#H2),(#H2,S10));";
julia> net = readTopology(str_network);
julia> using Random; Random.seed!(3);
julia> undoinfo = nni!(net, net.edge[3], true, true); # true's to avoid hybrid ladders and 3-cycles
julia> writeTopology(net)
"(((S8,(((((S1,S2,S3),S4),(S5)#H1),(#H1,(S6,S7))))#H2),S9),(#H2,S10));"
julia> nni!(undoinfo...);
julia> writeTopology(net) == str_network # net back to original topology: the NNI was "undone"
true
nni!(net::HybridNetwork, uv::Edge, nummove::UInt8,
nohybridladder::Bool, no3cycle::Bool)
Modify net
with a nearest neighbor interchange (NNI) around edge uv
. Return the information necessary to undo the NNI, or nothing
if the move was not successful (such as if the resulting graph was not acyclic (not a DAG) or if the focus edge is adjacent to a polytomy). If the move fails, the network is not modified. nummove
specifies which of the available NNIs is performed.
rooted-NNI options according to Gambette et al. (2017), fig. 8:
- BB: 2 moves, both to BB, if directed edges. 8 moves if undirected.
- RR: 2 moves, both to RR.
- BR: 3 moves, 1 RB & 2 BRs, if directed. 6 moves if e is undirected.
- RB: 4 moves, all 4 BRs.
The extra options are due to assuming a semi-directed network, whereas Gambette et al (2017) describe options for rooted networks. On a semi-directed network, there might be a choice of how to direct the edges that may contain the root, e.g. choice of e=uv versus vu, and choice of labelling adjacent nodes as α/β (BB), or as α/γ (BR).
nohybridladder
= true prevents moves that would create a hybrid ladder in the network, that is, 2 consecutive hybrid nodes (one parent of the other). no3cycle
= true prevents NNI moves that would make a 3-cycle, and assumes that the input network does not have any 2- or 3-cycles. If no3cycle
is false, 3-cycles can be generated, but NNIs generating 2-cycles are prevented.
The edge field isChild1
is assumed to be correct according to the net.root
.
nni!(αu, u, uv::Edge, v, vδ)
nni!(αu,u,uv,v,vδ, flip::Bool, inner::Bool, indices)
Transform a network locally around the focus edge uv
with the following NNI, that detaches u-β and grafts it onto vδ:
α - u -- v ------ δ
| |
β γ
α ------ v -- u - δ
| |
γ β
flip
boolean indicates if the uv edge was flipped inner
boolean indicates if edges αu and uv both point toward node u, i.e. α->u<-v<-δ. If this is true, we flip the hybrid status of αu and vδ.
indices
give indices for nodes and edges uinαu, αuinu, vδinv, and vinvδ. These are interpreted as:
u_in_αu: the index for u in the edge αu
αu_in_u: the index for αu in node u
vδ_in_v: the index for vδ in node v
v_in_vδ: the index for v in edge vδ
Warnings:
- No check of assumed adjacencies
- Not implemented for cases that are not necessary thanks to symmetry, such as cases covered by
nni!(vδ, v, uv, u, αu)
ornni!(βu, u, v, vγ)
. More specifically, these cases are not implemented (and not checked):- u not hybrid & v hybrid
- u hybrid, v not hybrid, α -> u <- v -> δ
- Because of this,
nni(αu,u,uv,v,vδ, ...)
should not be used directly; use insteadnni!(net, uv, move_number)
. - nni!(undoinfo...) restores the topology, but edges below hybrid nodes will have length 0.0 even if they didn't before.
Node numbers and edge numbers are not modified. Edge uv
keeps its direction unchanged unless the directions were α -> u -> v -> δ
or α <- u <- v <- δ
, in which case the direction of uv
is flipped.
The second version's input has the same signature as the output, but will undo the NNI more easily. This means that if output = nni!(input)
, then nni!(output...)
is valid and undoes the first operation.
Right now, branch lengths are not modified except when below a hybrid node. Future versions might implement options to modify branch lengths.
data and topology read/write
PhyloNetworks.readfastatodna
— Functionreadfastatodna(filename::String, countPatterns=false::Bool)
Read a fasta file to a dataframe containing a column for each site. If countPatterns
is true, calculate weights and remove identical site patterns to reduce matrix dimension.
Return a tuple containing:
- data frame of BioSequence DNA sequences, with taxon names in column 1 followed by a column for each site pattern, in columns 2-npatterns;
- array of weights, one weight for each of the site columns. The length of the weight vector is equal to npatterns.
Warning: assumes a semi-sequential format, not interleaved, where each taxon name appears only once. For this one time, the corresponding sequence may be broken across several lines though.
PhyloNetworks.readTopology
— FunctionreadTopology(file name)
readTopology(parenthetical description)
readTopology(IO)
Read tree or network topology from parenthetical format (extended Newick). If the root node has a single child: ignore (i.e. delete from the topology) the root node and its child edge.
Input: text file or parenthetical format directly. The file name may not start with a left parenthesis, otherwise the file name itself would be interpreted as the parenthetical description. Nexus-style comments ([&...]
) are ignored, and may be placed after (or instead) of a node name, and before/after an edge length.
A root edge, not enclosed within a pair a parentheses, is ignored. If the root node has a single edge, this one edge is removed.
PhyloNetworks.readTopologyLevel1
— FunctionreadTopologyLevel1(filename)
readTopologyLevel1(parenthetical format)
same as readTopology, reads a tree or network from parenthetical format, but this function enforces the necessary conditions for any starting topology in SNaQ: non-intersecting cycles, no polytomies, unrooted. It sets any missing branch length to 1.0.
If the network has a bad diamond II (in which edge lengths are γ's are not identifiable) and if the edge below this diamond has a length t
different from 0, then this length is set back to 0 and the major parent hybrid edge is lengthened by t
.
PhyloNetworks.readInputTrees
— FunctionreadInputTrees(file)
Read a text file with a list of trees/networks in parenthetical format (one tree per line) and transform them like readTopologyLevel1
does: to be unrooted, with resolved polytomies, missing branch lengths set to 1.0, etc. See readMultiTopology
to read multiple trees or networks with no modification.
Output: array of HybridNetwork objects.
Each line starting with "(" will be considered as describing one topology. The file can have extra lines that are ignored.
PhyloNetworks.readMultiTopology
— FunctionreadMultiTopology(filename::AbstractString, fast=true)
readMultiTopology(newicktrees_list::Vector{<:AbstractString})
Read a list of networks in parenthetical format, either from a file (one network per line) if the input is a string giving the path to the file, or from a vector of strings with each string corresponding to a newick-formatted topology. By default (fast=true
), Functors.fmap
is used for repeatedly reading the newick trees into of HybridNetwork-type objects. The option fast=false
corresponds to the behavior up until v0.14.3: with a file name as input, it prints a message (without failing) when a phylogeny cannot be parsed, and allows for empty lines. Each network is read with readTopology
.
Return an array of HybridNetwork objects.
Examples
julia> multitreepath = joinpath(dirname(Base.find_package("PhyloNetworks")), "..", "examples", "multitrees.newick");
julia> multitree = readMultiTopology(multitreepath) # vector of 25 HybridNetworks
julia> multitree = readMultiTopology(multitreepath, false) # same but slower & safer
julia> treestrings = readlines(multitreepath) # vector of 25 strings
julia> multitree = readMultiTopology(treestrings)
julia> readMultiTopology(treestrings, false) # same, but slower
PhyloNetworks.readnexus_treeblock
— Functionreadnexus_treeblock(filename, treereader=readTopology, args...;
reticulate=true, stringmodifier=[r"#(\d+)" => s"#H\1"])
Read the first "trees" block of a nexus-formatted file, using the translate table if present, and return a vector of HybridNetwork
s. Information inside [&...]
are interpreted as comments and are discarded by the default tree reader. Optional arguments args
are passed to the tree reader.
For the nexus format, see Maddison, Swofford & Maddison (1997).
Unless reticulate
is false, the following is done to read networks with reticulations.
Prior to reading each phylogeny, each instance of #number
is replaced by #Hnumber
to fit the standard extended Newick format at hybrid nodes. This behavior can be changed with option stringmodifier
, which should be a vector of pairs accepted by replace
.
Inheritance γ values are assumed to be given within "comment" blocks at minor hybrid edges (cut as tips to form the extended Newick) like this for example, as output by bacter (Vaughan et al. 2017):
#11[&conv=0, relSize=0.08, ...
or like this, as output by SpeciesNetwork (Zhang et al. 2018):
#H11[&gamma=0.08]
In this example, the corresponding edge to hybrid H11 has γ=0.08.
PhyloNetworks.readSnaqNetwork
— FunctionreadSnaqNetwork(output file)
Read the estimated network from a .out
file generated by snaq!
. The network score is read also, and stored in the network's field .loglik
.
Warning: despite the name "loglik", this score is only proportional to the network's pseudo-deviance: the lower, the better. Do NOT use this score to calculate an AIC or BIC (etc.) value.
PhyloNetworks.readTrees2CF
— FunctionreadTrees2CF(treefile)
readTrees2CF(vector of trees)
Read trees in parenthetical format from a file, or take a vector of trees already read, and calculate the proportion of these trees having a given quartet (concordance factor: CF), for all quartets or for a sample of quartets. Optional arguments include:
- quartetfile: name of text file with list of 4-taxon subsets to be analyzed. If none is specified, the function will list all possible 4-taxon subsets.
- whichQ="rand": to choose a random sample of 4-taxon subsets
- numQ: size of random sample (ignored if whichQ is not set to "rand")
- writeTab=false: does not write the observedCF to a table (default true)
- CFfile: name of file to save the observedCF (default tableCF.txt)
- writeQ=true: save intermediate files with the list of all 4-taxon subsets and chosen random sample (default false).
- writeSummary: write descriptive stats of input data (default: true)
- nexus: if true, it assumes the gene trees are written in nexus file (default: false)
See also: countquartetsintrees
, which uses a much faster algorithm; readTableCF
to read a table of quartet CFs directly.
PhyloNetworks.countquartetsintrees
— Functioncountquartetsintrees(trees [, taxonmap]; which=:all, weight_byallele=true)
Calculate the quartet concordance factors (CF) observed in the trees
vector. If present, taxonmap
should be a dictionary that maps each allele name to it's species name. To save to a file, first convert to a data frame using writeTableCF
. When which=:all
, quartet CFs are calculated for all 4-taxon sets. (Other options are not implemented yet.)
The algorithm runs in O(mn⁴) where m is the number of trees and n is the number of tips in the trees.
CFs are calculated at the species level only, that is, considering 4-taxon sets made of 4 distinct species, even if the gene trees may have multiple alleles from the same species. For 4 distinct species a,b,c,d
, all alleles from each species (a
etc.) will be considered to calculate the quartet CF.
By default, each gene has a weight of 1. So if there are n_a
alleles from a
, n_b
alleles from b
etc. in a given gene, then each set of 4 alleles has a weight of 1/(n_a n_b b_c n_c)
in the calculation of the CF for a,b,c,d
. With option weight_byallele=true
, then each set of 4 alleles is given a weight of 1 instead. This inflates the total number of sets used to calculate the quartet CFs (to something larger than the number of genes). This may also affect the CF values if the number of alleles varies across genes: genes with more alleles will be given more weight.
examples
julia> tree1 = readTopology("(E,(A,B),(C,D),O);"); tree2 = readTopology("(((A,B),(C,D)),E);");
julia> q,t = countquartetsintrees([tree1, tree2]);
Reading in trees, looking at 15 quartets in each...
0+--+100%
**
julia> t # taxon order: t[i] = name of taxon number i
6-element Vector{String}:
"A"
"B"
"C"
"D"
"E"
"O"
julia> length(q) # 15 four-taxon sets on 6 taxa
15
julia> q[1] # both trees agree on AB|CD: resolution 1
4-taxon set number 1; taxon numbers: 1,2,3,4
data: [1.0, 0.0, 0.0, 2.0]
julia> q[8] # tree 2 is missing O (taxon 6), tree 1 wants resolution 3: AO|CD
4-taxon set number 8; taxon numbers: 1,3,4,6
data: [0.0, 0.0, 1.0, 1.0]
julia> q[11] # tree 1 has ACEO unresolved, and tree 2 is missing O: no data for this quartet
4-taxon set number 11; taxon numbers: 1,3,5,6
data: [0.0, 0.0, 0.0, 0.0]
julia> tree1 = readTopology("(E,(a1,B),(a2,D),O);"); tree2 = readTopology("(((a1,a2),(B,D)),E);");
julia> q,t = countquartetsintrees([tree1, tree2], Dict("a1"=>"A", "a2"=>"A"); showprogressbar=false);
julia> t
5-element Vector{String}:
"A"
"B"
"D"
"E"
"O"
julia> q[1] # tree 1 has discordance: a1B|DE and a2D|BE. tree 2 has AE|BD for both alleles of A
4-taxon set number 1; taxon numbers: 1,2,3,4
data: [0.25, 0.25, 0.5, 2.0]
julia> q[3] # tree 2 is missing O (taxon 5), and a2 is unresolved in tree 1. There's only a1B|EO
4-taxon set number 3; taxon numbers: 1,2,4,5
data: [1.0, 0.0, 0.0, 0.5]
julia> df = writeTableCF(q,t); # to get a DataFrame that can be saved to a file later
julia> show(df, allcols=true)
5×9 DataFrame
Row │ qind t1 t2 t3 t4 CF12_34 CF13_24 CF14_23 ngenes
│ Int64 String String String String Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────
1 │ 1 A B D E 0.25 0.25 0.5 2.0
2 │ 2 A B D O 0.5 0.5 0.0 1.0
3 │ 3 A B E O 1.0 0.0 0.0 0.5
4 │ 4 A D E O 1.0 0.0 0.0 0.5
5 │ 5 B D E O 0.0 0.0 0.0 0.0
julia> # using CSV; CSV.write(df, "filename.csv");
julia> tree2 = readTopology("((A,(B,D)),E);");
julia> q,t = countquartetsintrees([tree1, tree2], Dict("a1"=>"A", "a2"=>"A"); weight_byallele=true);
Reading in trees, looking at 5 quartets in each...
0+--+100%
**
julia> show(writeTableCF(q,t), allcols=true)
5×9 DataFrame
Row │ qind t1 t2 t3 t4 CF12_34 CF13_24 CF14_23 ngenes
│ Int64 String String String String Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────
1 │ 1 A B D E 0.333333 0.333333 0.333333 3.0
2 │ 2 A B D O 0.5 0.5 0.0 2.0
3 │ 3 A B E O 1.0 0.0 0.0 1.0
4 │ 4 A D E O 1.0 0.0 0.0 1.0
5 │ 5 B D E O 0.0 0.0 0.0 0.0
PhyloNetworks.readTableCF
— FunctionreadTableCF(file)
readTableCF(data frame)
readTableCF!(data frame)
Read a file or DataFrame object containing a table of concordance factors (CF), with one row per 4-taxon set. The first 4 columns are assumed to give the labels of the 4 taxa in each set (tx1, tx2, tx3, tx4). Columns containing the CFs are assumed to be named CF12_34
, CF13_24
and CF14_23
; or CF12.34
, CF13.24
and CF14.23
; or else are assumed to be columns 5,6,7. If present, a column named 'ngenes' will be used to get the number of loci used to estimate the CFs for each 4-taxon set.
Output: DataCF
object
Optional arguments:
- summaryfile: if specified, a summary file will be created with that name.
- delim (for the first form only): to specify how columns are delimited, with single quotes: delim=';'. Default is a
csv
file, i.e.delim=','
. mergerows
: false by default. When true, will attempt to merge multiple rows corresponding to the same four-taxon set (by averaging their quartet CFs) even if none of the species is repeated within any row (that is, in any set of 4 taxa)
The last version modifies the input data frame, if species are represented by multiple alleles for instance (see readTableCF!
(data frame, columns)).
PhyloNetworks.readTableCF!
— FunctionreadTableCF!(data frame, columns; mergerows=false)
Read in quartet CFs from data frame, assuming information is in columns numbered columns
, of length 7 or 8: 4 taxon labels then 3 CFs then ngenes possibly.
If some species appears more than once in the same 4-taxon set (e.g. t1,t1,t2,t3), then the data frame is modified to remove rows (4-taxon sets) that are uninformative about between-species relationships. This situation may occur if multiple individuals are sampled from the same species. A 4-taxon set is uninformative (and its row is removed) if one taxon is repeated 3 or 4 times (like t1,t1,t1,t1 or t1,t2,t2,t2). The list of species appearing twice in some 4-taxon sets is stored in the output DataCF object. For these species, the length of their external edge is identifiable (in coalescent units). If multiple rows correspond to the same 4-taxon set, these rows are merged and their CF values (and number of genes) are averaged. If none of the species is repeated within any 4-taxon set, then this averaging is attempted only if mergerows
is true.
readTableCF!(DataCF, data frame, columns)
Modify the .quartet.obsCF
values in the DataCF
object with those read from the data frame in columns numbered columns
. columns
should have 3 columns numbers for the 3 CFs in this order: 12_34
, 13_24
and 14_23
.
Assumptions:
- same 4-taxon sets in
DataCF
and in the data frame, and in the same order, but this assumption is not checked (for speed, e.g. during bootstrapping). - one single row per 4-taxon set (multiple individuals representatives of the same 4-taxon set should have been already merged); basically: the DataCF should have been created from the data frame by
readTableCF!(df, colums)
PhyloNetworks.writeTableCF
— FunctionwriteTableCF(vector of Quartet objects)
writeTableCF(DataCF)
Build a DataFrame containing observed quartet concordance factors, with columns named:
t1
,t2
,t3
,t4
for the four taxon names in each quartetCF12_34
,CF13_24
,CF14_23
for the 3 quartets of a given four-taxon setngenes
if this information is available for some quartets
writeTableCF(quartetlist::Vector{QuartetT} [, taxonnames]; colnames)
Convert a vector of QuartetT
objects to a data frame, with 1 row for each four-taxon set in the list. Each four-taxon set contains quartet data of some type T
, which determines the number of columns in the data frame. This data type T
should be a vector of length 3 or 4, or a 3×n matrix.
In the output data frame, the columns are, in this order:
qind
: contains the quartet'snumber
t1, t2, t3, t4
: contain the quartet'staxonnumber
s if notaxonnames
are given, or the taxon names otherwise. The name of taxon numberi
is taken to betaxonnames[i]
.- 3 columns for each column in the quartet's
data
. The first 3 columns are namedCF12_34, CF13_24, CF14_23
. The next columns are namedV2_12_34, V2_13_24, V2_14_23
and contain the data in the second column of the quartet's data matrix. And so on. For the data frame to have non-default column names, provide the desired 3, 4, or 3×n names as a vector via the optional argumentcolnames
.
PhyloNetworks.readBootstrapTrees
— FunctionreadBootstrapTrees(listfile; relative2listfile=true)
Read the list of file names in listfile
, then read all the trees in each of these files. Output: vector of vectors of trees (networks with h>0 allowed).
listfile
should be the name of a file containing the path/name to multiple bootstrap files, one on each line (no header). Each named bootstrap file should contain multiple trees, one per line (such as bootstrap trees from a single gene).
The path/name to each bootstrap file should be relative to listfile
. Otherwise, use option relative2listfile=false
, in which case the file names are interpreted as usual: relative to the user's current directory if not given as absolute paths.
PhyloNetworks.writeSubTree!
— FunctionwriteSubTree!(IO, network, dendroscope::Bool, namelabel::Bool,
round_branch_lengths::Bool, digits::Integer,
internallabel::Bool)
Write to IO the extended newick format (parenthetical description) of a network. If written for dendroscope, inheritance γ's are not written. If namelabel
is true, taxa are labelled by their names; otherwise taxa are labelled by their number IDs. If unspecified, branch lengths and γ's are rounded to 3 digits. Use internallabel=false
to suppress the labels of internal nodes.
writeSubTree!(IO, node, edge, dendroscope::Bool, namelabel::Bool,
round_branch_lengths::Bool, digits::Integer, internallabel::Bool)
Write the extended newick format of the sub-network rooted at node
and assuming that edge
is a parent of node
.
If the parent edge
is nothing
, the edge attribute isChild1
is used and assumed to be correct to write the subtree rooted at node
. This is useful to write a subtree starting at a non-root node. Example:
net = readTopology("(((A,(B)#H1:::0.9),(C,#H1:::0.1)),D);")
directEdges!(net)
s = IOBuffer()
writeSubTree!(s, net.node[7], nothing, false, true)
String(take!(s))
Used by writeTopology
.
PhyloNetworks.writeTopology
— FunctionwriteTopology(net)
writeTopology(net, filename)
writeTopology(net, IO)
Write the parenthetical extended Newick format of a network, as a string, to a file or to an IO buffer / stream. Optional arguments (default values):
- di (false): write in format for Dendroscope
- round (false): rounds branch lengths and heritabilities γ
- digits (3): digits after the decimal place for rounding
- append (false): if true, appends to the file
- internallabel (true): if true, writes internal node labels
If the current root placement is not admissible, other placements are tried. The network is updated with this new root placement, if successful.
Uses lower-level function writeSubTree!
.
PhyloNetworks.writeMultiTopology
— FunctionwriteMultiTopology(nets, file_name; append=false)
writeMultiTopology(nets, IO)
Write an array of networks in parenthetical extended Newick format, one network per line. Use the option append=true to append to the file. Otherwise, the default is to create a new file or overwrite it, if it already existed. Each network is written with writeTopology
.
Examples
julia> net = [readTopology("(D,((A,(B)#H7:::0.864):2.069,(F,E):3.423):0.265,(C,#H7:::0.1361111):10);"),
readTopology("(A,(B,C));"),readTopology("(E,F);"),readTopology("(G,H,F);")];
julia> writeMultiTopology(net, "fournets.net") # to (over)write to file "fournets.net"
julia> writeMultiTopology(net, "fournets.net", append=true) # to append to this file
julia> writeMultiTopology(net, stdout) # to write to the screen (standard out)
(D,((A,(B)#H7:::0.864):2.069,(F,E):3.423):0.265,(C,#H7:::0.1361111):10.0);
(A,(B,C));
(E,F);
(G,H,F);
PhyloNetworks.hybridlambdaformat
— Functionhybridlambdaformat(net::HybridNetwork; prefix="I")
Output net
as a string in the format that the Hybrid-Lambda simulator expects, namely:
- all internal nodes are named, including the root, with names that are unique and start with a letter.
- hybrid nodes are written as
H6#γ1:length1
andH6#γ1:length2
instead of#H6:length1::γ1
and#H6:length2::γ2
(note the samme γ value expected by Hybrid-Lambda)
This is a modified version of the extended Newick format.
Optional keyword argument prefix
: must start with a letter, other than "H". Internal nodes are given names like "I1", "I2", etc. Existing internal non-hybrid node names are replaced, which is crucial if some of them don't start with a letter (e.g. in case node names are bootstrap values). See nameinternalnodes!
to add node names.
examples
julia> net = readTopology("((a:1,(b:1)#H1:1::0.8):5,(#H1:0::0.2,c:1):1);");
julia> hybridlambdaformat(net) # net is unchanged here
"((a:1.0,(b:1.0)H1#0.8:1.0)I1:5.0,(H1#0.8:0.0,c:1.0)I2:1.0)I3;"
julia> # using PhyloPlots; plot(net, shownodenumber=true) # shows that node -2 is the root
julia> rotate!(net, -2)
julia> writeTopology(net) # now the minor edge with γ=0.2 appears first
"((#H1:0.0::0.2,c:1.0):1.0,(a:1.0,(b:1.0)#H1:1.0::0.8):5.0);"
julia> hybridlambdaformat(net)
"((H1#0.2:0.0,c:1.0)I2:1.0,(a:1.0,(b:1.0)H1#0.2:1.0)I1:5.0)I3;"
julia> net = readTopology("((((B)#H1:::.6)#H2,((D,C,#H2:::0.8),(#H1,A))));"); # 2 reticulations, no branch lengths
julia> writeTopology(net, round=true)
"(#H2:::0.2,((D,C,((B)#H1:::0.6)#H2:::0.8),(#H1:::0.4,A)));"
julia> hybridlambdaformat(net; prefix="int")
"(H2#0.2,((D,C,((B)H1#0.6)H2#0.2)int1,(H1#0.6,A)int2)int3)int4;"
PhyloNetworks.mapAllelesCFtable
— FunctionmapAllelesCFtable(mapping file, CF file; filename, columns, delim)
Create a new DataFrame containing the same concordance factors as in the input CF file, but with modified taxon names. Each allele name in the input CF table is replaced by the species name that the allele maps onto, based on the mapping file. The mapping file should have column names: allele and species.
Optional arguments:
- file name to write/save resulting CF table. If not specified, then the output data frame is not saved to a file.
- column numbers for the taxon names. 1-4 by default.
- any keyword arguments that
CSV.File
would accept. For example, delim=',' by default: columns are delimited by commas. Unless specified otherwise by the user,pool
=false (to read taxon names as Strings, not levels of a categorical factor, for combining the 4 columns with taxon names more easily). The same CSV arguments are used to read both input file (mapping file and quartet file)
See also mapAllelesCFtable!
to input DataFrames instead of file names.
If a filename
is specified, such as "quartetCF_speciesNames.csv" in the example below, this file is best read later with the option pool=false
. example:
mapAllelesCFtable("allele-species-map.csv", "allele-quartet-CF.csv";
filename = "quartetCF_speciesNames.csv")
df_sp = CSV.read("quartetCF_speciesNames.csv", DataFrame); # DataFrame object
dataCF_specieslevel = readTableCF!(df_sp, mergerows=true); # DataCF object
network inference
PhyloNetworks.snaq!
— Functionsnaq!(T::HybridNetwork, d::DataCF)
Estimate the network (or tree) to fit observed quartet concordance factors (CFs) stored in a DataCF object, using maximum pseudo-likelihood. A level-1 network is assumed. The search starts from topology T
, which can be a tree or a network with no more than hmax
hybrid nodes. The function name ends with ! because it modifies the CF data d
by updating its attributes expCF
: CFs expected under the network model. It does not modify T
. The quartet pseudo-deviance is the negative log pseudo-likelihood, up to an additive constant, such that a perfect fit corresponds to a deviance of 0.0.
Output:
- estimated network in file
.out
(also in.log
): best network overall and list of networks from each individual run. - the best network and modifications of it, in file
.networks
. All networks in this file have the same undirected topology as the best network, but have different hybrid/gene flow directions. These other networks are reported with their pseudo-likelihood scores, because non-identifiability issues can cause them to have very similar scores, and because SNaQ was shown to estimate the undirected topology accurately but not the direction of hybridization in cases of near non-identifiability. - if any error occurred, file
.err
provides information (seed) to reproduce the error.
There are many optional arguments, including
hmax
(default 1): maximum number of hybridizations allowedverbose
(default false): if true, print information about the numerical optimizationruns
(default 10): number of independent starting points for the searchoutgroup
(default none): outgroup taxon to root the estimated topology at the very endfilename
(default "snaq"): root name for the output files (.out
,.err
). If empty (""), files are not created, progress log goes to the screen only (standard out).seed
(default 0 to get it from the clock): seed to replicate a given searchprobST
(default 0.3): probability to start fromT
at each given run. With problability 1-probST, the search is started from an NNI modification ofT
along a tree edge with no hybrid neighbor, with a possible modification of one reticulation ifT
has one.updateBL
(default true): If true and ifT
is a tree, the branch lengths inT
are first optimized roughly withupdateBL!
by using the average CF of all quartets defining each branch and back-calculating the coalescent units.
The following optional arguments control when to stop the optimization of branch lengths and γ's on each individual candidate network. Defaults are in parentheses:
ftolRel
(1e-6) andftolAbs
(1e-6): relative and absolute differences of the network score between the current and proposed parameters,xtolRel
(1e-2) andxtolAbs
(1e-3): relative and absolute differences between the current and proposed parameters.
Greater values will result in a less thorough but faster search. These parameters are used when evaluating candidate networks only. The following optional arguments control when to stop proposing new network topologies:
Nfail
(75): maximum number of times that new topologies are proposed and rejected (in a row).liktolAbs
(1e-6): the proposed network is accepted if its score is better than the current score by at least liktolAbs.
Lower values of Nfail
and greater values of liktolAbs
and ftolAbs
would result in a less thorough but faster search.
At the end, branch lengths and γ's are optimized on the last "best" network with different and very thorough tolerance parameters: 1e-12 for ftolRel, 1e-10 for ftolAbs, xtolRel, xtolAbs.
See also: topologyMaxQPseudolik!
to optimize parameters on a fixed topology, and topologyQPseudolik!
to get the deviance (pseudo log-likelihood up to a constant) of a fixed topology with fixed parameters.
Reference: Claudia Solís-Lemus and Cécile Ané (2016). Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS Genetics 12(3):e1005896
PhyloNetworks.topologyMaxQPseudolik!
— FunctiontopologyMaxQPseudolik!(net::HybridNetwork, d::DataCF)
Estimate the branch lengths and inheritance probabilities (γ's) for a given network topology. The network is not modified, only the object d
is, with updated expected concordance factors.
Ouput: new network, with optimized parameters (branch lengths and gammas). The maximized quartet pseudo-deviance is the negative log pseudo-likelihood, up to an additive constant, such that a perfect fit corresponds to a deviance of 0.0. This is also an attribute of the network, which can be accessed with net.loglik
.
Optional arguments (default value):
- verbose (false): if true, information on the numerical optimization is printed to screen
- ftolRel (1e-5), ftolAbs (1e-6), xtolRel (1e-3), xtolAbs (1e-4): absolute and relative tolerance values for the pseudo-deviance function and the parameters
PhyloNetworks.topologyQPseudolik!
— FunctiontopologyQPseudolik!(net::HybridNetwork, d::DataCF)
Calculate the quartet pseudo-deviance of a given network/tree for DataCF d
. This is the negative log pseudo-likelihood, up to an additive constant, such that a perfect fit corresponds to a deviance of 0.0.
Be careful if the net object does not have all internal branch lengths specified because then the pseudolikelihood will be meaningless.
The loglik attribute of the network is undated, and d
is updated with the expected concordance factors under the input network.
PhyloNetworks.summarizeDataCF
— FunctionsummarizeDataCF(d::DataCF)
function to summarize the information contained in a DataCF object. It has the following optional arguments:
- filename: if provided, the summary will be saved in the filename, not to screen
- pc (number between (0,1)): threshold of percentage of missing genes to identify 4-taxon subsets with fewer genes than the threshold
PhyloNetworks.fittedQuartetCF
— FunctionfittedQuartetCF(d::DataCF, format::Symbol)
return a data frame with the observed and expected quartet concordance factors after estimation of a network with snaq(T,d). The format can be :wide (default) or :long.
- if wide, the output has one row per 4-taxon set, and each row has 10 columns: 4 columns for the taxon names, 3 columns for the observed CFs and 3 columns for the expected CF.
- if long, the output has one row per quartet, i.e. 3 rows per 4-taxon sets, and 7 columns: 4 columns for the taxon names, one column to give the quartet resolution, one column for the observed CF and the last column for the expected CF.
see also: topologyQPseudolik!
and topologyMaxQPseudolik!
to update the fitted CF expected under a specific network, inside the DataCF object d
.
PhyloNetworks.bootsnaq
— Functionbootsnaq(T::HybridNetwork, df::DataFrame)
bootsnaq(T::HybridNetwork, vector of tree lists)
Bootstrap analysis for SNaQ. Bootstrap data can be quartet concordance factors (CF), drawn from sampling uniformly in their credibility intervals, as given in the data frame df
. Alternatively, bootstrap data can be gene trees sampled from a vector of tree lists: one list of bootstrap trees per locus (see readBootstrapTrees
to generate this, from a file containing a list of bootstrap files: one per locus).
From each bootstrap replicate, a network is estimated with snaq!, with a search starting from topology T
. Optional arguments include the following, with default values in parentheses:
hmax
(1): max number of reticulations in the estimated networksnrep
(10): number of bootstrap replicates.runs
(10): number of independent optimization runs for each replicatefilename
("bootsnaq"): root name for output files. No output files if "".seed
(0 to get a random seed from the clock): seed for random number generatorotherNet
(empty): another starting topology so that each replicate will start prcnet% runs on otherNet and (1-prcnet)% runs onT
prcnet
(0): percentage of runs starting onotherNet
; error if different than 0.0, and otherNet not specified.ftolRel
,ftolAbs
,xtolRel
,xtolAbs
,liktolAbs
,Nfail
,probST
,verbose
,outgroup
: seesnaq!
, same defaults.
If T
is a tree, its branch lengths are first optimized roughly with updateBL!
(by using the average CF of all quartets defining each branch and calculating the coalescent units corresponding to this quartet CF). If T
has one or more reticulations, its branch lengths are taken as is to start the search. The branch lengths of otherNet
are always taken as is to start the search.
PhyloNetworks.calibrateFromPairwiseDistances!
— FunctioncalibrateFromPairwiseDistances!(net, distances::Matrix{Float64},
taxon_names::Vector{<:AbstractString})
Calibrate the network to match (as best as possible) input pairwise distances between taxa, such as observed from sequence data. taxon_names
should provide the list of taxa, in the same order in which they they are considered in the distances
matrix. The optimization criterion is the sum of squares between the observed distances, and the distances from the network (weighted average of tree distances, weighted by γ's). The network's edge lengths are modified.
Warning: for many networks, mutiple calibrations can fit the pairwise distance data equally well (lack of identifiability). This function will output one of these equally good calibrations.
optional arguments (default):
- checkPreorder (true)
- forceMinorLength0 (false) to force minor hybrid edges to have a length of 0
- NLoptMethod (
:LD_MMA
) for the optimization algorithm. Other options include:LN_COBYLA
(derivative-free); see NLopt package. - tolerance values to control when the optimization is stopped: ftolRel (1e-12), ftolAbs (1e-10) on the criterion, and xtolRel (1e-10), xtolAbs (1e-10) on branch lengths / divergence times.
- verbose (false)
PhyloNetworks.undirectedOtherNetworks
— FunctionundirectedOtherNetworks(net::HybridNetwork)
Return a vector of HybridNetwork objects, obtained by switching the placement of each hybrid node to other nodes inside its cycle. This amounts to changing the direction of a gene flow event (recursively to move around the whole cycle of each reticulation).
Optional argument: outgroup
, as a String. If an outgroup is specified, then networks conflicting with the placement of the root are avoided.
Assumptions: net
is assumed to be of level 1, that is, each blob has a single cycle with a single reticulation. All level-1 fields of net
are assumed up-to-date.
Example
julia> net = readTopology("(A:1.0,((B:1.1,#H1:0.2::0.2):1.2,(((C:0.52,(E:0.5)#H2:0.02::0.7):0.6,(#H2:0.01::0.3,F:0.7):0.8):0.9,(D:0.8)#H1:0.3::0.8):1.3):0.7):0.1;");
julia> vnet = undirectedOtherNetworks(net)
PhyloNetworks.nj
— Functionnj(D::DataFrame; force_nonnegative_edges::Bool=false)
Construct a tree from a distance matrix by neighbor joining, where D
is a DataFrame
of the distance matrix, with taxon names taken from the header of the data frame. The rows are assumed to correspond to tips in the tree in the same order as they do in columns. With force_nonnegative_edges
being true
, any negative edge length is changed to 0.0 (with a message).
For the algorithm, see Satou & Nei 1987.
See nj!
for using a matrix as input.
network Comparisons
PhyloNetworks.majorTree
— FunctionmajorTree(net::HybridNetwork; nofuse=false::Bool, unroot=false::Bool,
keeporiginalroot=false::Bool)
Extract the major tree displayed in a network, keeping the major edge and dropping the minor edge at each hybrid node.
nofuse
: if true, edges and degree-2 nodes are retained during edge removal. Otherwise, at each reticulation the child edge (below the hybrid node) is retained: the major hybrid edge is fused with it.
unroot
: is true, the root will be deleted if it becomes of degree 2.
keeporiginalroot
: the network's root is kept even if it becomes of degree 1.
Warnings:
- if
nofuse
is true: the hybrid edges that are retained (without fusing) have their γ values unchanged, but theirisMajor
is changed to true - assume correct
isMajor
attributes.
PhyloNetworks.minorTreeAt
— FunctionminorTreeAt(net::HybridNetwork, hybindex::Integer, nofuse=false, unroot=false::Bool)
Extract the tree displayed in the network, following the major hybrid edge at each hybrid node, except at the ith hybrid node (i=hybindex
), where the minor hybrid edge is kept instead of the major hybrid edge. If nofuse
is true, edges are not fused (degree-2 nodes are kept). If unroot
is true, the root will be deleted if it becomes of degree 2.
Warning: assume correct isMajor
fields.
PhyloNetworks.displayedTrees
— FunctiondisplayedTrees(net::HybridNetwork, gamma::Float64; nofuse=false::Bool,
unroot=false::Bool, multgammas=false::Bool,
keeporiginalroot=false::Bool)
Extracts all trees displayed in a network, following hybrid edges with heritability >= γ threshold (or >0.5 if threshold=0.5) and ignoring any hybrid edge with heritability lower than γ. Returns an array of trees, as HybridNetwork objects.
nofuse
: if true, do not fuse edges (keep degree-2 nodes) during hybrid edge removal. unroot
: if false, the root will not be deleted if it becomes of degree 2 unless keeporiginalroot is true. multgammas
: if true, the edges in the displayed trees have γ values equal to the proportion of genes that the edge represents, even though all these edges are tree edges. The product of all the γ values across all edges is the proportion of genes that the tree represents. More specifically, edge e
in a given displayed tree has γ equal to the product of γs of all edges in the original network that have been merged into e
. keeporiginalroot
: if true, keep root even if of degree 1.
Warnings:
- if
nofuse
is true: the retained partner hybrid edges have their γ values unchanged, but theirisMajor
is changed to true - assume correct
isMajor
attributes.
PhyloNetworks.displayedNetworkAt!
— FunctiondisplayedNetworkAt!(net::HybridNetwork, node::Node, nofuse=false,
unroot=false, multgammas=false)
Delete all the minor hybrid edges, except at input node. The network is left with a single hybridization, and otherwise displays the same major tree as before. If nofuse
is true, edges are not fused (degree-2 nodes are kept).
Warning: assume correct isMajor
fields.
PhyloNetworks.hardwiredClusters
— FunctionhardwiredClusters(net::HybridNetwork, taxon_labels)
Returns a matrix describing all the hardwired clusters in a network, with taxa listed in same order as in taxon_labels
to describe their membership in each cluster. Allows for missing taxa, with entries all 0.
Warnings:
- clusters are rooted, so the root must be correct.
- each hybrid node is assumed to have exactly 2 parents (no more).
Each row corresponds to one internal edge, that is, external edges are excluded. If the root is a leaf node, the external edge to that leaf is included (first row). Both parent hybrid edges to a given hybrid node only contribute a single row (they share the same hardwired cluster).
- first column: edge number
- next columns: 0/1. 1=descendant of edge, 0=not a descendant, or missing taxon.
- last column: 10/11 values. 10=tree edge, 11=hybrid edge
See also hardwiredClusterDistance
and hardwiredCluster
.
PhyloNetworks.hardwiredCluster
— FunctionhardwiredCluster(edge::Edge, taxa::Union{AbstractVector{String},AbstractVector{Int}})
hardwiredCluster!(v::Vector{Bool}, edge::Edge, taxa)
hardwiredCluster!(v::Vector{Bool}, edge::Edge, taxa, visited::Vector{Int})
Calculate the hardwired cluster of edge
, coded as a vector of booleans: true for taxa that are descendent of the edge, false for other taxa (including missing taxa).
The edge should belong in a rooted network for which isChild1
is up-to-date. Run directEdges!
beforehand. This is very important, otherwise one might enter an infinite loop, and the function does not test for this.
visited: vector of node numbers, of all visited nodes.
Examples:
julia> net5 = "(A,((B,#H1),(((C,(E)#H2),(#H2,F)),(D)#H1)));" |> readTopology |> directEdges! ;
julia> taxa = net5 |> tipLabels # ABC EF D
6-element Vector{String}:
"A"
"B"
"C"
"E"
"F"
"D"
julia> hardwiredCluster(net5.edge[12], taxa) # descendants of 12th edge = CEF
6-element Vector{Bool}:
0
0
1
1
1
0
See also hardwiredClusterDistance
and hardwiredClusters
PhyloNetworks.hardwiredClusterDistance
— FunctionhardwiredClusterDistance(net1::HybridNetwork, net2::HybridNetwork, rooted::Bool)
Hardwired cluster distance between the topologies of net1
and net2
, that is, the number of hardwired clusters found in one network and not in the other (with multiplicity, see below).
If the 2 networks are trees, this is the Robinson-Foulds distance. If rooted=false, then both networks are considered as semi-directed.
Networks are assumed bicombining (each hybrid has exactly 2 parents, no more).
Dissimilarity vs distance
This is not a distance per se on the full space of phylogenetic networks: there are pairs of distinct networks for which this dissimilarity is 0. But it is a distance on some classes of networks, such as the class of tree-child networks that are "normal" (without shortcuts), or the class of tree-child networks that can be assigned node ages such that hybrid edges have length 0 and tree edges have non-negative lengths. See Cardona, Rossello & Valiente (2008), Cardona, Llabres, Rossello & Valiente (2008), and Huson, Rupp, Scornavacca (2010).
Example
julia> net1 = readTopology("(t6,(t5,((t4,(t3,((t2,t1))#H1)),#H1)));");
julia> taxa = sort(tipLabels(net1)); # t1 through t6, sorted alphabetically
julia> # using PhyloPlots; plot(net1, showedgenumber=true);
julia> # in matrix below: column 1: edge number. last column: tree (10) vs hybrid (11) edge
# middle columns: for 'taxa': t1,...t6. 1=descendant, 0=not descendant
hardwiredClusters(net1, taxa)
6×8 Matrix{Int64}:
13 1 1 1 1 1 0 10
12 1 1 1 1 0 0 10
10 1 1 1 1 0 0 10
9 1 1 1 0 0 0 10
8 1 1 0 0 0 0 11
7 1 1 0 0 0 0 10
julia> net2 = readTopology("(t6,(t5,((t4,(t3)#H1),(#H1,(t1,t2)))));");
julia> hardwiredClusters(net2, taxa)
6×8 Matrix{Int64}:
13 1 1 1 1 1 0 10
12 1 1 1 1 0 0 10
6 0 0 1 1 0 0 10
5 0 0 1 0 0 0 11
11 1 1 1 0 0 0 10
10 1 1 0 0 0 0 10
julia> hardwiredClusterDistance(net1, net2, true) # true: as rooted networks
4
What is a hardwired cluster?
Each edge in a network is associated with its hardwired cluster, that is, the set of all its descendant taxa (leaves). The set of hardwired cluster of a network is the set of its edges' hardwired clusters. The dissimilarity d_hard
defined in Huson, Rupp, Scornavacca (2010) is the number of hardwired clusters that are in one network but not in the other.
This implementation is a slightly more discriminative version of d_hard
, where each cluster is counted with multiplicity and annotated with its edge's hybrid status, as follows:
- External edges are not counted (they are tree edges to a leaf, shared by all phylogenetic networks).
- A cluster is counted for each edge for which it's the hardwired cluster.
- At a given hybrid node, both hybrid partner edges have the same cluster, so this cluster is only counted once for both partners.
- A given cluster is matched between the two networks only if it's the cluster from a tree edge in both networks, or from a hybrid edge in both networks.
In the example above, net1
has a shortcut (hybrid edge 11) resulting in 2 tree edges (12 and 10) with the same cluster {t1,t2,t3,t4}. So cluster {t1,t2,t3,t4} has multiplicity 2 in net1
. net2
also has this cluster, but only associated with 1 tree edge, so this cluster contributes (2-1)=1 towards the hardwired cluster distance between the two networks. The distance of 4 corresponds to these 4 clusters:
- {t1,t2,t3,t4}: twice in net1, once in net2
- {t3,t4}: absent in net1, once in net2
- {t1,t2}: twice in net1 (from a hybrid edge & a tree edge), once in net2
- {t3}: absent in net1 (because external edges are not counted), once in net2 (from a hybrid edge).
Degree-2 nodes cause multiple edges to have the same cluster, so counting clusters with multiplicity distinguishes a network with extra degree-2 nodes from the "same" network after these nodes have been suppressed (e.g. with PhyloNetworks.fuseedgesat!
or PhyloNetworks.shrinkedge!
).
Networks as semi-directed
If rooted
is false and one of the phylogenies is not a tree (1+ reticulations), then all degree-2 nodes are removed before comparing the hardwired clusters, and the minimum distance is returned over all possible ways to root the networks at internal nodes.
See also: hardwiredClusters
, hardwiredCluster
PhyloNetworks.treeEdgesBootstrap
— FunctiontreeEdgesBootstrap(boot_net::Vector{HybridNetwork}, ref_net::HybridNetwork)
Read a list of bootstrap networks (boot_net
) and a reference network (ref_net
), and calculate the bootstrap support of the tree edges in the reference network. All minor hybrid edges (γ<0.5) are removed to extract the major tree from each network. All remaining edges are tree edges, each associated with a bipartition.
output:
- a data frame with one row per tree edge and two columns: edge number, bootstrap support (as a percentage)
- the major tree from the reference network, where minor hybrid edges (with γ<0.5) have been removed.
PhyloNetworks.hybridDetection
— FunctionhybridDetection(net::Vector{HybridNetwork}, net1::HybridNetwork, outgroup::AbstractString)
function can only compare hybrid nodes in networks that have the same underlying major tree also, need to root all networks in the same place, and the root has to be compatible with the direction of the hybrid edges
it computes the rooted hardwired distance between networks, the root matters. input: vector of bootstrap networks (net), estimated network (net1), outgroup
returns
- a matrix with one row per bootstrap network, and 2*number of hybrids in net1, column i corresponds to whether hybrid i (
net1.hybrid[i]
) is found in the bootstrap network, column 2i+1 corresponds to the estimated gamma on the bootstrap network (0.0 if hybrid not found). To know the order of hybrids, printnet1.hybrid
orh.name for h in net1.hybrid
- list of discrepant trees (trees not matching the main tree in net1)
PhyloNetworks.summarizeHFdf
— FunctionsummarizeHFdf(HFmat::Matrix)
Summarize data frame output from hybridDetection
. Output: dataframe with one row per hybrid, and 5 columns:
- hybrid index (order from estimated network, see
hybridDetection
, - number of bootstrap trees that match the underlying tree of estimated network
- number of bootstrap networks that have the hybrid
- mean estimated gamma in the bootstrap networks that have the hybrid
- sd estimated gamma in the bootstrap networks that have the hybrid also
last row has index -1, and the third column has the number of networks that have all hybrids (hybrid index, mean gamma, sd gamma are meaningless in this last row)
PhyloNetworks.hybridBootstrapSupport
— FunctionhybridBootstrapSupport(boot_net::Vector{HybridNetwork}, ref_net::HybridNetwork; rooted=false)
Match hybrid nodes in a reference network with those in an array of networks, like bootstrap networks. All networks must be fully resolved, and on the same taxon set. If rooted=true
, all networks are assumed to have been properly rooted beforehand. Otherwise, the origin of each hybrid edge is considered as an unrooted bipartition (default).
Two hybrid edges in two networks are said to match if they share the same "hybrid" clade (or recipient) and the same "donor clade", which is a sister to the hybrid clade in the network. Since a hybrid clade has 2 parent edges, it is sister to two clades simultaneously: one is its major sister (following the major hybrid edge with γ>0.5) and one is its minor sister (following the major hybrid edge with γ<0.5).
To calculate these hybrid and sister clades at a given hybrid node, all other hybrid edges are first removed from the network. Then, the hybrid clade is the hardwired cluster (descendants) of either hybrid edge and major/minor clade is the hardwired cluster of the sibling edge of the major/minor hybrid parent. If rooted=false
, sister clades are considered as bipartitions.
Output:
- a "node" data frame (see below)
- an "edge" data frame (see below)
- a "clade" data frame to describe the make up of all clades found as hybrids or sisters, starting with a column
taxa
that lists all taxa. All other columns correspond to a given clade and contain true/false values.true
means that a given taxon belongs in a given clade. For a clade namedH1
, for instance, and if the data frame was namedcla
, the list of taxa in this clade can be obtained withcla[:taxa][cla[:H1]]
. - an array of gamma values, with one row for each bootstrap network and two columns (major/minor) for each hybrid edge in the reference network. If this hybrid edge was found in the bootstrap network (i.e. same hybrid and sister clades, after removal of all other hybrid nodes), its bootstrap gamma value is recorded here. Otherwise, the gamma entry is 0.0.
- a vector with the number of each hybrid edge in the reference network, in the same order as for the columns in the array of gamma values above.
The "node" data frame has one row per clade and 9 columns giving:
:clade
: the clade's name, like the taxon name (if a hybrid is a single taxon) or the hybrid tag (like 'H1') in the reference network:node
: the node number in the reference network. missing if the clade is not in this network.:hybridnode
: typically the same node number as above, except for hybrid clades in the reference network. For those, the hybrid node number is listed here.:edge
: number of the parent edge, parent to the node in column 2, if found in the ref network. missing otherwise.:BS_hybrid
: percentage of bootstrap networks in which the clade is found to be a hybrid clade.:BS_sister
: percentage of bootstrap networks in which the clade is found to be sister to some hybrid clade (sum of the next 2 columns):BS_major_sister
: percentage of bootstrap networks in which the clade is found to be the major sister to some hybrid clade:BS_minor_sister
: same as previous, but minor:BS_hybrid_samesisters
: percentage of bootstrap networks in which the clade is found to be a hybrid and with the same set of sister clades as in the reference network. Applies to hybrid clades found in the reference network only, missing for all other clades.
The "edge" data frame has one row for each pair of clades, and 8 columns:
:edge
: hybrid edge number, if the edge appears in the reference network. missing otherwise.:hybrid_clade
: name of the clade found to be a hybrid, descendent of 'edge':hybrid
: node number of that clade, if it appears in the reference network. missing otherwise.:sister_clade
: name of the clade that is sister to 'edge', i.e. be sister to a hybrid:sister
: node number of that clade, if in the ref network.:BS_hybrid_edge
: percentage of bootstrap networks in which 'edge' is found to be a hybrid edge, i.e. when the clade in the 'hybrid' column is found to be a hybrid and the clade in the 'sister' column is one of its sisters.:BS_major
: percentage of bootstrap networks in which 'edge' is found to be a major hybrid edge, i.e. when 'hybrid' is found to be a hybrid clade and 'sister' is found to be its major sister.:BS_minor
: same as previous, but minor
continuous trait evolution
PhyloNetworks.phylolm
— Functionphylolm(X::Matrix, Y::Vector, net::HybridNetwork, model::ContinuousTraitEM=BM(); kwargs...)
Return a PhyloNetworkLinearModel
object. This method is called by phylolm(formula, data, network; kwargs...)
.
phylolm(f::StatsModels.FormulaTerm, fr::AbstractDataFrame, net::HybridNetwork; kwargs...)
Fit a phylogenetic linear regression model to data. Return an object of type PhyloNetworkLinearModel
. It contains a linear model from the GLM package, in object.lm
, of type GLM.LinearModel.
Arguments
f
: formula to use for the regression, see StatsModelsfr
: DataFrame containing the response values, predictor values, species/tip labels for each observation/row.net
: phylogenetic network to use. Should have labelled tips.
Keyword arguments
model="BM"
: model for trait evolution (as a string) "lambda" (Pagel's lambda), "scalingHybrid" are other possible values (seeContinuousTraitEM
)tipnames=:tipNames
: column name for species/tip-labels, represented as a symbol. For example, if the column containing the species/tip labels infr
is named "Species", then dotipnames=:Species
.no_names=false
: Iftrue
, force the function to ignore the tips names. The data is then assumed to be in the same order as the tips of the network. Default is false, setting it to true is dangerous, and strongly discouraged.reml=true
: iftrue
, use REML criterion ("restricted maximum likelihood") for estimating variance components, else use ML criterion.
The following tolerance parameters control the optimization of lambda if model="lambda"
or model="scalingHybrid"
, and control the optimization of the variance components if model="BM"
and withinspecies_var=true
.
fTolRel=1e-10
: relative tolerance on the likelihood valuefTolAbs=1e-10
: absolute tolerance on the likelihood valuexTolRel=1e-10
: relative tolerance on the parameter valuexTolAbs=1e-10
: absolute tolerance on the parameter valuestartingValue=0.5
: Ifmodel
="lambda" or "scalingHybrid", this provides the starting value for the optimization in lambda.fixedValue=missing
: Ifmodel
="lambda" or "scalingHybrid", andfixedValue
is a number, then lambda is set to this number and is not optimized.withinspecies_var=false
: Iftrue
, fits a within-species variation model. Currently only implemented formodel
="BM".y_mean_std::Bool=false
: Iftrue
, andwithinspecies_var=true
, then accounts for within-species variation, using species-level statistics provided infr
.
Methods applied to fitted models
To access the response values, do response(object)
. To access the model matrix, do modelmatrix(object)
. To access the model formula, do formula(object)
.
Within-species variation
For a high-level description, see PhyloNetworkLinearModel
. To fit a model with within-species variation in the response variable, either of the following must be provided in the data frame fr
:
(1) Individual-level data: There should be columns for response, predictors, and species/tip-labels. Every row should correspond to an individual observation. At least one species must be represented by two or more individuals.
(2) Species-level statistics: There should be columns for mean response, predictors, species/tip-labels, species sample-sizes (number of individuals for each species), and species standard deviations (standard deviations of the response values by species). Every row should correspond to a species: each species should be represented by a unique row. The column names for species sample-sizes and species standard deviations are expected to be "[response column name]_n" and "[response column name]_sd". For example, if the response column name is "y", then the column names should be "y_n" and "y_sd" for the sample-sizes and standard deviations.
Regardless of whether the data provided follows (1) or (2), withinspecies_var
should be set to true. If the data provided follows (2), then y_mean_std
should be set to false.
Within-species variation in predictors
The model assumes no within-species variation in predictors, because it aims to capture the evolutionary (historical, phylogenetic) relationship between the predictors and the response, not the within-species (present-day, or phenotypic) relationship.
If a within-species variation model is fitted on individual-level data, and if there are individuals within the same species with different values for the same predictor, these values are all replaced by the mean predictor value for all the individuals in that species. For example, suppose there are 3 individuals in a given species, and that their predictor values are (x₁=3, x₂=6), (x₁=4, x₂=8) and (x₁=2, x₂=1). Then the predictor values for these 3 individuals are each replaced by (x₁=(3+4+2)/3, x₂=(6+8+1)/3) before model fitting. If a fourth individual had data (x₁=10, x₂=missing), then that individual would be ignored for any model using x₂, and would not contribute any information to its species data for these models.
Missing data
Rows with missing data for either the response or the predictors are omitted from the model-fitting. There should minimally be columns for response, predictors, species/tip-labels. As detailed above, additional columns may be required for fitting within-species variation. Missing data in the columns for species names, species standard deviation / sample sizes (if used) will throw an error.
See also
simulate
, ancestralStateReconstruction
, vcv
Examples: Without within-species variation
julia> phy = readTopology(joinpath(dirname(pathof(PhyloNetworks)), "..", "examples", "caudata_tree.txt"));
julia> using DataFrames, CSV # to read data file, next
julia> dat = CSV.File(joinpath(dirname(pathof(PhyloNetworks)), "..", "examples", "caudata_trait.txt")) |> DataFrame;
julia> using StatsModels # for stat model formulas
julia> fitBM = phylolm(@formula(trait ~ 1), dat, phy; reml=false);
julia> fitBM # Shows a summary
PhyloNetworkLinearModel
Formula: trait ~ 1
Model: Brownian motion
Parameter Estimates, using ML:
phylogenetic variance rate: 0.00294521
Coefficients:
─────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────
(Intercept) 4.679 0.330627 14.15 <1e-31 4.02696 5.33104
─────────────────────────────────────────────────────────────────────
Log Likelihood: -78.9611507833
AIC: 161.9223015666
julia> round(sigma2_phylo(fitBM), digits=6) # rounding for jldoctest convenience
0.002945
julia> round(mu_phylo(fitBM), digits=4)
4.679
julia> using StatsBase # for aic() stderror() loglikelihood() etc.
julia> round(loglikelihood(fitBM), digits=10)
-78.9611507833
julia> round(aic(fitBM), digits=10)
161.9223015666
julia> round(aicc(fitBM), digits=10)
161.9841572367
julia> round(bic(fitBM), digits=10)
168.4887090241
julia> round.(coef(fitBM), digits=4)
1-element Vector{Float64}:
4.679
julia> confint(fitBM)
1×2 Matrix{Float64}:
4.02696 5.33104
julia> abs(round(r2(fitBM), digits=10)) # absolute value for jldoctest convenience
0.0
julia> abs(round(adjr2(fitBM), digits=10))
0.0
julia> round.(vcov(fitBM), digits=6)
1×1 Matrix{Float64}:
0.109314
julia> round.(residuals(fitBM), digits=6)
197-element Vector{Float64}:
-0.237648
-0.357937
-0.159387
-0.691868
-0.323977
-0.270452
-0.673486
-0.584654
-0.279882
-0.302175
⋮
-0.777026
-0.385121
-0.443444
-0.327303
-0.525953
-0.673486
-0.603158
-0.211712
-0.439833
julia> round.(response(fitBM), digits=5)
197-element Vector{Float64}:
4.44135
4.32106
4.51961
3.98713
4.35502
4.40855
4.00551
4.09434
4.39912
4.37682
⋮
3.90197
4.29388
4.23555
4.3517
4.15305
4.00551
4.07584
4.46729
4.23917
julia> round.(predict(fitBM), digits=5)
197-element Vector{Float64}:
4.679
4.679
4.679
4.679
4.679
4.679
4.679
4.679
4.679
4.679
⋮
4.679
4.679
4.679
4.679
4.679
4.679
4.679
4.679
4.679
Examples: With within-species variation (two different input formats shown)
julia> using DataFrames, StatsModels # for statistical model formulas
julia> net = readTopology("((((D:0.4,C:0.4):4.8,((A:0.8,B:0.8):2.2)#H1:2.2::0.7):4.0,(#H1:0::0.3,E:3.0):6.2):2.0,O:11.2);");
julia> df = DataFrame( # individual-level observations
species = repeat(["D","C","A","B","E","O"],inner=3),
trait1 = [4.08298,4.08298,4.08298,3.10782,3.10782,3.10782,2.17078,2.17078,2.17078,1.87333,1.87333,
1.87333,2.8445,2.8445,2.8445,5.88204,5.88204,5.88204],
trait2 = [-7.34186,-7.34186,-7.34186,-7.45085,-7.45085,-7.45085,-3.32538,-3.32538,-3.32538,-4.26472,
-4.26472,-4.26472,-5.96857,-5.96857,-5.96857,-1.99388,-1.99388,-1.99388],
trait3 = [18.8101,18.934,18.9438,17.0687,17.0639,17.0732,14.4818,14.1112,14.2817,13.0842,12.9562,
12.9019,15.4373,15.4075,15.4317,24.2249,24.1449,24.1302]);
julia> m1 = phylolm(@formula(trait3 ~ trait1), df, net;
tipnames=:species, withinspecies_var=true)
PhyloNetworkLinearModel
Formula: trait3 ~ 1 + trait1
Model: Brownian motion
Parameter Estimates, using REML:
phylogenetic variance rate: 0.156188
within-species variance: 0.0086343
Coefficients:
──────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────
(Intercept) 9.65347 1.3066 7.39 0.0018 6.02577 13.2812
trait1 2.30358 0.276163 8.34 0.0011 1.53683 3.07033
──────────────────────────────────────────────────────────────────────
Log Likelihood: 1.9446255188
AIC: 4.1107489623
julia> df_r = DataFrame( # species-level statistics (sample means, standard deviations)
species = ["D","C","A","B","E","O"],
trait1 = [4.08298,3.10782,2.17078,1.87333,2.8445,5.88204],
trait2 = [-7.34186,-7.45085,-3.32538,-4.26472,-5.96857,-1.99388],
trait3 = [18.896,17.0686,14.2916,12.9808,15.4255,24.1667],
trait3_sd = [0.074524,0.00465081,0.185497,0.0936,0.0158379,0.0509643],
trait3_n = [3, 3, 3, 3, 3, 3]);
julia> m2 = phylolm(@formula(trait3 ~ trait1), df_r, net;
tipnames=:species, withinspecies_var=true, y_mean_std=true)
PhyloNetworkLinearModel
Formula: trait3 ~ 1 + trait1
Model: Brownian motion
Parameter Estimates, using REML:
phylogenetic variance rate: 0.15618
within-species variance: 0.0086343
Coefficients:
──────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────
(Intercept) 9.65342 1.30657 7.39 0.0018 6.02582 13.281
trait1 2.30359 0.276156 8.34 0.0011 1.53686 3.07032
──────────────────────────────────────────────────────────────────────
Log Likelihood: 1.9447243714
AIC: 4.1105512573
PhyloNetworks.sigma2_phylo
— Functionsigma2_phylo(m::PhyloNetworkLinearModel)
Estimated between-species variance-rate for a fitted object.
PhyloNetworks.sigma2_within
— Functionsigma2_within(m::PhyloNetworkLinearModel)
Estimated within-species variance for a fitted object.
PhyloNetworks.mu_phylo
— Functionmu_phylo(m::PhyloNetworkLinearModel)
Estimated root value for a fitted object.
PhyloNetworks.lambda_estim
— Functionlambda_estim(m::PhyloNetworkLinearModel)
Estimated lambda parameter for a fitted object.
PhyloNetworks.ancestralStateReconstruction
— FunctionancestralStateReconstruction(net::HybridNetwork, Y::Vector, params::ParamsBM)
Compute the conditional expectations and variances of the ancestral (un-observed) traits values at the internal nodes of the phylogenetic network (net
), given the values of the traits at the tips of the network (Y
) and some known parameters of the process used for trait evolution (params
, only BM with fixed root works for now).
This function assumes that the parameters of the process are known. For a more general function, see ancestralStateReconstruction(obj::PhyloNetworkLinearModel[, X_n::Matrix])
.
ancestralStateReconstruction(obj::PhyloNetworkLinearModel[, X_n::Matrix])
Function to find the ancestral traits reconstruction on a network, given an object fitted by function phylolm
. By default, the function assumes that the regressor is just an intercept. If the value of the regressor for all the ancestral states is known, it can be entered in X_n, a matrix with as many columns as the number of predictors used, and as many lines as the number of unknown nodes or tips.
Returns an object of type ReconstructedStates
.
Examples
julia> using DataFrames, CSV # to read data file
julia> phy = readTopology(joinpath(dirname(pathof(PhyloNetworks)), "..", "examples", "carnivores_tree.txt"));
julia> dat = CSV.File(joinpath(dirname(pathof(PhyloNetworks)), "..", "examples", "carnivores_trait.txt")) |> DataFrame;
julia> using StatsModels # for statistical model formulas
julia> fitBM = phylolm(@formula(trait ~ 1), dat, phy);
julia> ancStates = ancestralStateReconstruction(fitBM) # Should produce a warning, as variance is unknown.
┌ Warning: These prediction intervals show uncertainty in ancestral values,
│ assuming that the estimated variance rate of evolution is correct.
│ Additional uncertainty in the estimation of this variance rate is
│ ignored, so prediction intervals should be larger.
└ @ PhyloNetworks ~/build/juliaphylo/PhyloNetworks.jl/src/traits.jl:3359
ReconstructedStates:
───────────────────────────────────────────────
Node index Pred. Min. Max. (95%)
───────────────────────────────────────────────
-5.0 1.32139 -0.33824 2.98102
-8.0 1.03258 -0.589695 2.65485
-7.0 1.41575 -0.140705 2.97221
-6.0 1.39417 -0.107433 2.89577
-4.0 1.39961 -0.102501 2.90171
-3.0 1.51341 -0.220523 3.24733
-13.0 5.3192 3.92279 6.71561
-12.0 4.51176 2.89222 6.13131
-16.0 1.50947 -0.0186118 3.03755
-15.0 1.67425 0.196069 3.15242
-14.0 1.80309 0.309992 3.29618
-11.0 2.7351 1.17608 4.29412
-10.0 2.73217 1.12361 4.34073
-9.0 2.41132 0.603932 4.21871
-2.0 2.04138 -0.0340955 4.11686
14.0 1.64289 1.64289 1.64289
8.0 1.67724 1.67724 1.67724
5.0 0.331568 0.331568 0.331568
2.0 2.27395 2.27395 2.27395
4.0 0.275237 0.275237 0.275237
6.0 3.39094 3.39094 3.39094
13.0 0.355799 0.355799 0.355799
15.0 0.542565 0.542565 0.542565
7.0 0.773436 0.773436 0.773436
10.0 6.94985 6.94985 6.94985
11.0 4.78323 4.78323 4.78323
12.0 5.33016 5.33016 5.33016
1.0 -0.122604 -0.122604 -0.122604
16.0 0.73989 0.73989 0.73989
9.0 4.84236 4.84236 4.84236
3.0 1.0695 1.0695 1.0695
───────────────────────────────────────────────
julia> expectations(ancStates)
31×2 DataFrame
Row │ nodeNumber condExpectation
│ Int64 Float64
─────┼─────────────────────────────
1 │ -5 1.32139
2 │ -8 1.03258
3 │ -7 1.41575
4 │ -6 1.39417
5 │ -4 1.39961
6 │ -3 1.51341
7 │ -13 5.3192
8 │ -12 4.51176
⋮ │ ⋮ ⋮
25 │ 10 6.94985
26 │ 11 4.78323
27 │ 12 5.33016
28 │ 1 -0.122604
29 │ 16 0.73989
30 │ 9 4.84236
31 │ 3 1.0695
16 rows omitted
julia> predint(ancStates)
31×2 Matrix{Float64}:
-0.33824 2.98102
-0.589695 2.65485
-0.140705 2.97221
-0.107433 2.89577
-0.102501 2.90171
-0.220523 3.24733
3.92279 6.71561
2.89222 6.13131
-0.0186118 3.03755
0.196069 3.15242
⋮
0.542565 0.542565
0.773436 0.773436
6.94985 6.94985
4.78323 4.78323
5.33016 5.33016
-0.122604 -0.122604
0.73989 0.73989
4.84236 4.84236
1.0695 1.0695
julia> expectationsPlot(ancStates) # format the ancestral states
31×2 DataFrame
Row │ nodeNumber PredInt
│ Int64 Abstract…
─────┼───────────────────────
1 │ -5 1.32
2 │ -8 1.03
3 │ -7 1.42
4 │ -6 1.39
5 │ -4 1.4
6 │ -3 1.51
7 │ -13 5.32
8 │ -12 4.51
⋮ │ ⋮ ⋮
25 │ 10 6.95
26 │ 11 4.78
27 │ 12 5.33
28 │ 1 -0.12
29 │ 16 0.74
30 │ 9 4.84
31 │ 3 1.07
16 rows omitted
julia> using PhyloPlots # next: plot ancestral states on the tree
julia> plot(phy, nodelabel = expectationsPlot(ancStates));
julia> predintPlot(ancStates) # prediction intervals, in data frame, useful to plot
31×2 DataFrame
Row │ nodeNumber PredInt
│ Int64 Abstract…
─────┼───────────────────────────
1 │ -5 [-0.34, 2.98]
2 │ -8 [-0.59, 2.65]
3 │ -7 [-0.14, 2.97]
4 │ -6 [-0.11, 2.9]
5 │ -4 [-0.1, 2.9]
6 │ -3 [-0.22, 3.25]
7 │ -13 [3.92, 6.72]
8 │ -12 [2.89, 6.13]
⋮ │ ⋮ ⋮
25 │ 10 6.95
26 │ 11 4.78
27 │ 12 5.33
28 │ 1 -0.12
29 │ 16 0.74
30 │ 9 4.84
31 │ 3 1.07
16 rows omitted
julia> plot(phy, nodelabel = predintPlot(ancStates));
julia> allowmissing!(dat, :trait);
julia> dat[[2, 5], :trait] .= missing; # missing values allowed to fit model
julia> fitBM = phylolm(@formula(trait ~ 1), dat, phy);
julia> ancStates = ancestralStateReconstruction(fitBM);
┌ Warning: These prediction intervals show uncertainty in ancestral values,
│ assuming that the estimated variance rate of evolution is correct.
│ Additional uncertainty in the estimation of this variance rate is
│ ignored, so prediction intervals should be larger.
└ @ PhyloNetworks ~/build/juliaphylo/PhyloNetworks.jl/src/traits.jl:3166
julia> first(expectations(ancStates), 3) # looking at first 3 nodes only
3×2 DataFrame
Row │ nodeNumber condExpectation
│ Int64 Float64
─────┼─────────────────────────────
1 │ -5 1.42724
2 │ -8 1.35185
3 │ -7 1.61993
julia> predint(ancStates)[1:3,:] # just first 3 nodes again
3×2 Matrix{Float64}:
-0.373749 3.22824
-0.698432 3.40214
-0.17179 3.41165
julia> first(expectationsPlot(ancStates),3) # format node <-> ancestral state
3×2 DataFrame
Row │ nodeNumber PredInt
│ Int64 Abstract…
─────┼───────────────────────
1 │ -5 1.43
2 │ -8 1.35
3 │ -7 1.62
julia> plot(phy, nodelabel = expectationsPlot(ancStates));
julia> first(predintPlot(ancStates),3) # prediction intervals, useful to plot
3×2 DataFrame
Row │ nodeNumber PredInt
│ Int64 Abstract…
─────┼───────────────────────────
1 │ -5 [-0.37, 3.23]
2 │ -8 [-0.7, 3.4]
3 │ -7 [-0.17, 3.41]
julia> plot(phy, nodelabel = predintPlot(ancStates));
ancestralStateReconstruction(fr::AbstractDataFrame, net::HybridNetwork; kwargs...)
Estimate the ancestral traits on a network, given some data at the tips. Uses function phylolm
to perform a phylogenetic regression of the data against an intercept (amounts to fitting an evolutionary model on the network).
See documentation on phylolm
and ancestralStateReconstruction(obj::PhyloNetworkLinearModel[, X_n::Matrix])
for further details.
Returns an object of type ReconstructedStates
.
ancestralStateReconstruction(obj::SSM, trait::Integer = 1)
Estimate the marginal probability of ancestral states for discrete character number trait
(first trait by default). The parameters of the StatisticalSubstitutionModel
object obj
must first be fitted using fitdiscrete
, and ancestral state reconstruction is conditional on the estimated parameters. If these parameters were estimated using all traits, they are used as is, to do ancestral state reconstruction of the particular trait
of interest.
output: data frame with a first column for the node numbers, a second column for the node labels, and a column for each possible state: the entries in these columns give the marginal probability that a given node has a given state.
warnings:
- node numbers and node labels refer to those in
obj.net
, which might have a different internal representation of nodes than the original network used to buildobj
. obj
is modified: its likelihood fields (forward, directional & backward) are updated to make sure that they correspond to the current parameter values inobj.model
, and to thetrait
of interest.
limitations: the following are not checked.
- Assumes that every node in the large network is also present (with descendant leaves) in each displayed tree. This is not true if the network is not tree-child...
- Assumes that the root is also in each displayed tree, which may not be the case if the root had a hybrid child edge.
See also posterior_logtreeweight
and discrete_backwardlikelihood_trait!
to update obj.backwardlik
.
examples
julia> net = readTopology("(((A:2.0,(B:1.0)#H1:0.1::0.9):1.5,(C:0.6,#H1:1.0::0.1):1.0):0.5,D:2.0);");
julia> m1 = BinaryTraitSubstitutionModel([0.1, 0.1], ["lo", "hi"]);
julia> using DataFrames
julia> dat = DataFrame(species=["C","A","B","D"], trait=["hi","lo","lo","hi"]);
julia> fit1 = fitdiscrete(net, m1, dat);
julia> asr = ancestralStateReconstruction(fit1)
9×4 DataFrame
Row │ nodenumber nodelabel lo hi
│ Int64 String Float64 Float64
─────┼───────────────────────────────────────────
1 │ 1 A 1.0 0.0
2 │ 2 B 1.0 0.0
3 │ 3 C 0.0 1.0
4 │ 4 D 0.0 1.0
5 │ 5 5 0.286021 0.713979
6 │ 6 6 0.319456 0.680544
7 │ 7 7 0.16855 0.83145
8 │ 8 8 0.767359 0.232641
9 │ 9 H1 0.782776 0.217224
julia> using PhyloPlots
julia> plot(fit1.net, nodelabel = asr[!,[:nodenumber, :lo]], tipoffset=0.2); # pp for "lo" state
PhyloNetworks.expectations
— Functionexpectations(obj::ReconstructedStates)
Estimated reconstructed states at the nodes and tips.
PhyloNetworks.predint
— Functionpredint(obj::ReconstructedStates; level=0.95::Real)
Prediction intervals with level level
for internal nodes and missing tips.
PhyloNetworks.expectationsPlot
— FunctionexpectationsPlot(obj::ReconstructedStates)
Compute and format the expected reconstructed states for the plotting function. The resulting dataframe can be readily used as a nodelabel
argument to plot
from package PhyloPlots
. Keyword argument markMissing
is a string that is appended to predicted tip values, so that they can be distinguished from the actual datapoints. Default to "*". Set to "" to remove any visual cue.
PhyloNetworks.predintPlot
— FunctionpredintPlot(obj::ReconstructedStates; level=0.95::Real, withExp=false::Bool)
Compute and format the prediction intervals for the plotting function. The resulting dataframe can be readily used as a nodelabel
argument to plot
from package PhyloPlots
. Keyworks argument level
control the confidence level of the prediction interval. If withExp
is set to true, then the best predicted value is also shown along with the interval.
PhyloNetworks.simulate
— Functionsimulate(net::HybridNetwork, params::ParamsProcess, checkPreorder=true::Bool)
Simulate traits on net
using the parameters params
. For now, only parameters of type ParamsBM
(univariate Brownian Motion) and ParamsMultiBM
(multivariate Brownian motion) are accepted.
The simulation using a recursion from the root to the tips of the network, therefore, a pre-ordering of nodes is needed. If checkPreorder=true
(default), preorder!
is called on the network beforehand. Otherwise, it is assumed that the preordering has already been calculated.
Returns an object of type TraitSimulation
, which has a matrix with the trait expecations and simulated trait values at all the nodes.
See examples below for accessing expectations and simulated trait values.
Examples
Univariate
julia> phy = readTopology("(A:2.5,((U:1,#H1:0.5::0.4):1,(C:1,(D:0.5)#H1:0.5::0.6):1):0.5);");
julia> par = ParamsBM(1, 0.1) # BM with expectation 1 and variance 0.1.
ParamsBM:
Parameters of a BM with fixed root:
mu: 1
Sigma2: 0.1
julia> using Random; Random.seed!(17920921); # for reproducibility
julia> sim = simulate(phy, par) # Simulate on the tree.
TraitSimulation:
Trait simulation results on a network with 4 tips, using a BM model, with parameters:
mu: 1
Sigma2: 0.1
julia> traits = sim[:Tips] # Extract simulated values at the tips.
4-element Vector{Float64}:
0.9664650558470932
0.4104321932336118
0.2796524923704289
0.7306692819731366
julia> sim.M.tipNames # name of tips, in the same order as values above
4-element Vector{String}:
"A"
"U"
"C"
"D"
julia> traits = sim[:InternalNodes] # Extract simulated values at internal nodes. Order: as in sim.M.internalNodeNumbers
5-element Vector{Float64}:
0.5200361297500204
0.8088890626285765
0.9187604100796469
0.711921371091375
1.0
julia> traits = sim[:All] # simulated values at all nodes, ordered as in sim.M.nodeNumbersTopOrder
9-element Vector{Float64}:
1.0
0.711921371091375
0.9187604100796469
0.2796524923704289
0.5200361297500204
0.8088890626285765
0.7306692819731366
0.4104321932336118
0.9664650558470932
julia> traits = sim[:Tips, :Exp] # Extract expected values at the tips (also works for sim[:All, :Exp] and sim[:InternalNodes, :Exp]).
4-element Vector{Float64}:
1.0
1.0
1.0
1.0
Multivariate
julia> phy = readTopology("(A:2.5,((B:1,#H1:0.5::0.4):1,(C:1,(V:0.5)#H1:0.5::0.6):1):0.5);");
julia> par = ParamsMultiBM([1.0, 2.0], [1.0 0.5; 0.5 1.0]) # BM with expectation [1.0, 2.0] and variance [1.0 0.5; 0.5 1.0].
ParamsMultiBM:
Parameters of a MBD with fixed root:
mu: [1.0, 2.0]
Sigma: [1.0 0.5; 0.5 1.0]
julia> using Random; Random.seed!(17920921); # for reproducibility
julia> sim = simulate(phy, par) # simulate on the phylogeny
TraitSimulation:
Trait simulation results on a network with 4 tips, using a MBD model, with parameters:
mu: [1.0, 2.0]
Sigma: [1.0 0.5; 0.5 1.0]
julia> traits = sim[:Tips] # Extract simulated values at the tips (each column contains the simulated traits for one node).
2×4 Matrix{Float64}:
2.99232 -0.548734 -1.79191 -0.773613
4.09575 0.712958 0.71848 2.00343
julia> traits = sim[:InternalNodes] # simulated values at internal nodes. order: same as in sim.M.internalNodeNumbers
2×5 Matrix{Float64}:
-0.260794 -1.61135 -1.93202 0.0890154 1.0
1.46998 1.28614 0.409032 1.94505 2.0
julia> traits = sim[:All]; # 2×9 Matrix: values at all nodes, ordered as in sim.M.nodeNumbersTopOrder
julia> sim[:Tips, :Exp] # Extract expected values (also works for sim[:All, :Exp] and sim[:InternalNodes, :Exp])
2×4 Matrix{Float64}:
1.0 1.0 1.0 1.0
2.0 2.0 2.0 2.0
PhyloNetworks.shiftHybrid
— FunctionshiftHybrid(value::Vector{T} where T<:Real, net::HybridNetwork; checkPreorder=true::Bool)
Construct an object ShiftNet
with shifts on all the edges below hybrid nodes, with values provided. The vector of values must have the same length as the number of hybrids in the network.
PhyloNetworks.getShiftEdgeNumber
— FunctiongetShiftEdgeNumber(shift::ShiftNet)
Get the edge numbers where the shifts are located, for an object ShiftNet
. If a shift is placed at the root node with no parent edge, the edge number of a shift is set to -1 (as if missing).
PhyloNetworks.getShiftValue
— FunctiongetShiftValue(shift::ShiftNet)
Get the values of the shifts, for an object ShiftNet
.
PhyloNetworks.descendenceMatrix
— FunctiondescendenceMatrix(net::HybridNetwork; checkPreorder=true::Bool)
Descendence matrix between all the nodes of a network: object D
of type MatrixTopologicalOrder
in which D[i,j]
is the proportion of genetic material in node i
that can be traced back to node j
. If D[i,j]>0
then j
is a descendent of i
(and j
is an ancestor of i
). The network is assumed to be pre-ordered if checkPreorder
is false. If checkPreorder
is true (default), preorder!
is run on the network beforehand.
PhyloNetworks.regressorShift
— FunctionregressorShift(node::Vector{Node}, net::HybridNetwork; checkPreorder=true)
regressorShift(edge::Vector{Edge}, net::HybridNetwork; checkPreorder=true)
Compute the regressor vectors associated with shifts on edges that are above nodes node
, or on edges edge
, on a network net
. It uses function descendenceMatrix
, so net
might be modified to sort it in a pre-order. Return a DataFrame
with as many rows as there are tips in net, and a column for each shift, each labelled according to the pattern shift{numberof_edge}. It has an aditional column labelled tipNames
to allow easy fitting afterward (see example).
Examples
julia> net = readTopology("(A:2.5,((B:1,#H1:0.5::0.4):1,(C:1,(D:0.5)#H1:0.5::0.6):1):0.5);");
julia> preorder!(net)
julia> using PhyloPlots
julia> plot(net, shownodenumber=true); # to locate nodes
julia> nodes_shifts = indexin([1,-5], [n.number for n in net.node]) # Put a shift on edges ending at nodes 1 and -5
2-element Vector{Union{Nothing, Int64}}:
1
7
julia> params = ParamsBM(10, 0.1, ShiftNet(net.node[nodes_shifts], [3.0, -3.0], net))
ParamsBM:
Parameters of a BM with fixed root:
mu: 10
Sigma2: 0.1
There are 2 shifts on the network:
──────────────────────────
Edge Number Shift Value
──────────────────────────
8.0 -3.0
1.0 3.0
──────────────────────────
julia> using Random; Random.seed!(2468); # sets the seed for reproducibility
julia> sim = simulate(net, params); # simulate a dataset with shifts
julia> using DataFrames # to handle data frames
julia> dat = DataFrame(trait = sim[:Tips], tipNames = sim.M.tipNames);
julia> dat = DataFrame(trait = [13.391976856737717, 9.55741491696386, 7.17703734817448, 7.889062527849697],
tipNames = ["A","B","C","D"]) # hard-coded, to be independent of random number generator
4×2 DataFrame
Row │ trait tipNames
│ Float64 String
─────┼────────────────────
1 │ 13.392 A
2 │ 9.55741 B
3 │ 7.17704 C
4 │ 7.88906 D
julia> dfr_shift = regressorShift(net.node[nodes_shifts], net) # the regressors matching the shifts.
4×3 DataFrame
Row │ shift_1 shift_8 tipNames
│ Float64 Float64 String
─────┼────────────────────────────
1 │ 1.0 0.0 A
2 │ 0.0 0.0 B
3 │ 0.0 1.0 C
4 │ 0.0 0.6 D
julia> dfr = innerjoin(dat, dfr_shift, on=:tipNames); # join data and regressors in a single dataframe
julia> using StatsModels # for statistical model formulas
julia> fitBM = phylolm(@formula(trait ~ shift_1 + shift_8), dfr, net; reml=false) # actual fit
PhyloNetworkLinearModel
Formula: trait ~ 1 + shift_1 + shift_8
Model: Brownian motion
Parameter Estimates, using ML:
phylogenetic variance rate: 0.0112618
Coefficients:
────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept) 9.48238 0.327089 28.99 0.0220 5.32632 13.6384
shift_1 3.9096 0.46862 8.34 0.0759 -2.04479 9.86399
shift_8 -2.4179 0.422825 -5.72 0.1102 -7.7904 2.95461
────────────────────────────────────────────────────────────────────────
Log Likelihood: 1.8937302027
AIC: 4.2125395947
See also
PhyloNetworks.regressorHybrid
— FunctionregressorHybrid(net::HybridNetwork; checkPreorder=true::Bool)
Compute the regressor vectors associated with shifts on edges that imediatly below all hybrid nodes of net
. It uses function descendenceMatrix
through a call to regressorShift
, so net
might be modified to sort it in a pre-order. Return a DataFrame
with as many rows as there are tips in net, and a column for each hybrid, each labelled according to the pattern shift{numberof_edge}. It has an aditional column labelled tipNames
to allow easy fitting afterward (see example).
This function can be used to test for heterosis.
Examples
julia> using DataFrames # Needed to handle data frames.
julia> net = readTopology("(A:2.5,((B:1,#H1:0.5::0.4):1,(C:1,(D:0.5)#H1:0.5::0.6):1):0.5);");
julia> preorder!(net)
julia> using PhyloPlots
julia> plot(net, shownodenumber=true); # to locate nodes: node 5 is child of hybrid node
julia> nodes_hybrids = indexin([5], [n.number for n in net.node]) # Put a shift on edges below hybrids
1-element Vector{Union{Nothing, Int64}}:
5
julia> params = ParamsBM(10, 0.1, ShiftNet(net.node[nodes_hybrids], [3.0], net))
ParamsBM:
Parameters of a BM with fixed root:
mu: 10
Sigma2: 0.1
There are 1 shifts on the network:
──────────────────────────
Edge Number Shift Value
──────────────────────────
6.0 3.0
──────────────────────────
julia> using Random; Random.seed!(2468); # sets the seed for reproducibility
julia> sim = simulate(net, params); # simulate a dataset with shifts
julia> dat = DataFrame(trait = sim[:Tips], tipNames = sim.M.tipNames);
julia> dat = DataFrame(trait = [10.391976856737717, 9.55741491696386, 10.17703734817448, 12.689062527849698],
tipNames = ["A","B","C","D"]) # hard-code values for more reproducibility
4×2 DataFrame
Row │ trait tipNames
│ Float64 String
─────┼────────────────────
1 │ 10.392 A
2 │ 9.55741 B
3 │ 10.177 C
4 │ 12.6891 D
julia> dfr_hybrid = regressorHybrid(net) # the regressors matching the hybrids.
4×3 DataFrame
Row │ shift_6 tipNames sum
│ Float64 String Float64
─────┼────────────────────────────
1 │ 0.0 A 0.0
2 │ 0.0 B 0.0
3 │ 0.0 C 0.0
4 │ 1.0 D 1.0
julia> dfr = innerjoin(dat, dfr_hybrid, on=:tipNames); # join data and regressors in a single dataframe
julia> using StatsModels
julia> fitBM = phylolm(@formula(trait ~ shift_6), dfr, net; reml=false) # actual fit
PhyloNetworkLinearModel
Formula: trait ~ 1 + shift_6
Model: Brownian motion
Parameter Estimates, using ML:
phylogenetic variance rate: 0.041206
Coefficients:
────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept) 10.064 0.277959 36.21 0.0008 8.86805 11.26
shift_6 2.72526 0.315456 8.64 0.0131 1.36796 4.08256
────────────────────────────────────────────────────────────────────────
Log Likelihood: -0.7006021946
AIC: 7.4012043891
See also
PhyloNetworks.sharedPathMatrix
— FunctionsharedPathMatrix(net::HybridNetwork; checkPreorder=true::Bool)
This function computes the shared path matrix between all the nodes of a network. It assumes that the network is in the pre-order. If checkPreorder is true (default), then it runs function preorder!
on the network beforehand.
Returns an object of type MatrixTopologicalOrder
.
PhyloNetworks.vcv
— Functionvcv(net::HybridNetwork; model="BM"::AbstractString,
corr=false::Bool,
checkPreorder=true::Bool)
This function computes the variance covariance matrix between the tips of the network, assuming a Brownian model of trait evolution (with unit variance). If optional argument corr
is set to true
, then the correlation matrix is returned instead.
The function returns a DataFrame
object, with columns named by the tips of the network.
The calculation of the covariance matrix requires a pre-ordering of nodes to be fast. If checkPreorder
is true (default), then preorder!
is run on the network beforehand. Otherwise, the network is assumed to be already in pre-order.
This function internally calls sharedPathMatrix
, which computes the variance matrix between all the nodes of the network.
Examples
julia> tree_str = "(((t2:0.14,t4:0.33):0.59,t3:0.96):0.14,(t5:0.70,t1:0.18):0.90);";
julia> tree = readTopology(tree_str);
julia> C = vcv(tree)
5×5 DataFrame
Row │ t2 t4 t3 t5 t1
│ Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────
1 │ 0.87 0.73 0.14 0.0 0.0
2 │ 0.73 1.06 0.14 0.0 0.0
3 │ 0.14 0.14 1.1 0.0 0.0
4 │ 0.0 0.0 0.0 1.6 0.9
5 │ 0.0 0.0 0.0 0.9 1.08
The following block needs ape
to be installed (not run):
julia> using RCall # Comparison with ape vcv function
julia> R"ape::vcv(ape::read.tree(text = $tree_str))"
RCall.RObject{RCall.RealSxp}
t2 t4 t3 t5 t1
t2 0.87 0.73 0.14 0.0 0.00
t4 0.73 1.06 0.14 0.0 0.00
t3 0.14 0.14 1.10 0.0 0.00
t5 0.00 0.00 0.00 1.6 0.90
t1 0.00 0.00 0.00 0.9 1.08
The covariance can also be calculated on a network (for the model, see Bastide et al. 2018)
julia> net = readTopology("((t1:1.0,#H1:0.1::0.30):0.5,((t2:0.9)#H1:0.2::0.70,t3:1.1):0.4);");
julia> C = vcv(net)
3×3 DataFrame
Row │ t1 t2 t3
│ Float64 Float64 Float64
─────┼───────────────────────────
1 │ 1.5 0.15 0.0
2 │ 0.15 1.248 0.28
3 │ 0.0 0.28 1.5
Base.getindex
— Methodgetindex(obj, d)
Getting submatrices of an object of type TraitSimulation
.
Arguments
obj::TraitSimulation
: the matrix from which to extract.d::Symbol
: a symbol precising which sub-matrix to extract. Can be::Tips
columns and/or rows corresponding to the tips:InternalNodes
columns and/or rows corresponding to the internal nodes
discrete trait evolution
PhyloNetworks.parsimonySoftwired
— FunctionparsimonySoftwired(net, tipdata)
parsimonySoftwired(net, species, sequences)
Calculate the most parsimonious (MP) score of a network given a discrete character at the tips. The softwired parsimony concept is used: where the number of state transitions is minimized over all trees displayed in the network.
Data can given in one of the following:
tipdata
: data frame for a single trait, in which case the taxon names are to appear in column 1 or in a column named "taxon" or "species", and trait values are to appear in column 2 or in a column named "trait".tipdata
: dictionary taxon => state, for a single trait.species
: array of strings, andsequences
: array of sequences, in the order corresponding to the order of species names.
algorithm
The dynamic programming algorithm by Fischer et al. (2015) is used. The function loops over all the displayed subtrees within a blob (biconnected component), so its complexity is of the order of n * m * c^2 * 2^level
where n
is the number of tips, m
the number of traits, c
the number of states, and level
is the level of the network: the maximum number of hybridizations within a blob.
See parsimonyGF
for a different algorithm, slower but extendable to other parsimony criteria.
references
- Fischer, M., van Iersel, L., Kelk, S., Scornavacca, C. (2015). On computing the Maximum Parsimony score of a phylogenetic network. SIAM J. Discrete Math., 29(1):559-585.
PhyloNetworks.parsimonyGF
— FunctionparsimonyGF(net, tip_dictionary, criterion=:softwired)
parsimonyGF(net, species, sequenceData, criterion=:softwired)
Calculate the most parsimonious score of a network given discrete characters at the tips using a general framework (Van Iersel et al. 2018) allowing for various parsimony criteria: softwired (default), hardwired, parental etc. Only softwired is implemented at the moment.
Data can given in one of the following:
tipdata
: data frame for a single trait, in which case the taxon names are to appear in column 1 or in a column named "taxon" or "species", and trait values are to appear in column 2 or in a column named "trait".tipdata
: dictionary taxon => state, for a single trait.species
: array of strings, andsequences
: array of sequences, in the order corresponding to the order of species names.
algorithm
The complexity of the algorithm is exponential in the level of the network, that is, the maximum number of hybridizations in a single blob, or biconnected component (Fischer et al. 2015). The function loops over all the state assignments of the minor parent of each hybrid node within a blob, so its complexity is of the order of n * m * c^2 * c^level
where n
is the number of tips, m
the number of traits and c
the number of states.
See parsimonySoftwired
for a faster algorithm, but solving the softwired criterion only.
references
Leo Van Iersel, Mark Jones, Celine Scornavacca (2017). Improved Maximum Parsimony Models for Phylogenetic Networks, Systematic Biology, (https://doi.org/10.1093/sysbio/syx094).
Fischer, M., van Iersel, L., Kelk, S., Scornavacca, C. (2015). On computing the Maximum Parsimony score of a phylogenetic network. SIAM J. Discrete Math., 29(1):559-585.
Use the recursive helper function parsimonyBottomUpGF!
. Use the fields isChild1
, isExtBadTriangle
to know which nodes are at the root of a blob, and fromBadDiamondI
to know which edges are cut (below the minor parent of each hybrid).
PhyloNetworks.Q
— FunctionQ(model)
Substitution rate matrix for a given substitution model: Q[i,j] is the rate of transitioning from state i to state j.
For a BinaryTraitSubstitutionModel, the rate matrix Q is of the form:
-α α
β -β
PhyloNetworks.randomTrait
— FunctionrandomTrait(model, t, start)
randomTrait!(end, model, t, start)
Simulate traits along one edge of length t. start
must be a vector of integers, each representing the starting value of one trait. The bang version (ending with !) uses the vector end
to store the simulated values.
Examples
julia> m1 = BinaryTraitSubstitutionModel(1.0, 2.0)
Binary Trait Substitution Model:
rate 0→1 α=1.0
rate 1→0 β=2.0
julia> using Random; Random.seed!(13);
julia> randomTrait(m1, 0.2, [1,2,1,2,2])
5-element Vector{Int64}:
1
2
1
1
2
randomTrait(model, net; ntraits=1, keepInternal=true, checkPreorder=true)
Simulate evolution of discrete traits on a rooted evolutionary network based on the supplied evolutionary model. Trait sampling is uniform at the root.
optional arguments:
ntraits
: number of traits to be simulated (default: 1 trait).keepInternal
: if true, export character states at all nodes, including internal nodes. if false, export character states at tips only.
output:
- matrix of character states with one row per trait, one column per node; these states are indices in
model.label
, not the trait labels themselves. - vector of node labels (for tips) or node numbers (for internal nodes) in the same order as columns in the character state matrix
examples
julia> m1 = BinaryTraitSubstitutionModel(1.0, 2.0, ["low","high"]);
julia> net = readTopology("(((A:4.0,(B:1.0)#H1:1.1::0.9):0.5,(C:0.6,#H1:1.0::0.1):1.0):3.0,D:5.0);");
julia> using Random; Random.seed!(95);
julia> trait, lab = randomTrait(m1, net)
([1 2 … 1 1], ["-2", "D", "-3", "-6", "C", "-4", "H1", "B", "A"])
julia> trait
1×9 Matrix{Int64}:
1 2 1 1 2 2 1 1 1
julia> lab
9-element Vector{String}:
"-2"
"D"
"-3"
"-6"
"C"
"-4"
"H1"
"B"
"A"
PhyloNetworks.randomTrait!
— FunctionrandomTrait(model, t, start)
randomTrait!(end, model, t, start)
Simulate traits along one edge of length t. start
must be a vector of integers, each representing the starting value of one trait. The bang version (ending with !) uses the vector end
to store the simulated values.
Examples
julia> m1 = BinaryTraitSubstitutionModel(1.0, 2.0)
Binary Trait Substitution Model:
rate 0→1 α=1.0
rate 1→0 β=2.0
julia> using Random; Random.seed!(13);
julia> randomTrait(m1, 0.2, [1,2,1,2,2])
5-element Vector{Int64}:
1
2
1
1
2
PhyloNetworks.fitdiscrete
— Functionfitdiscrete(net, model, tipdata)
fitdiscrete(net, model, RateVariationAcrossSites, tipdata)
fitdiscrete(net, model, species, traits)
fitdiscrete(net, model, RateVariationAcrossSites, species, traits)
fitdiscrete(net, model, dnadata, dnapatternweights)
fitdiscrete(net, model, RateVariationAcrossSites, dnadata, dnapatternweights)
fitdiscrete(net, modSymbol, species, traits)
fitdiscrete(net, modSymbol, dnadata, dnapatternweights)
Calculate the maximum likelihood (ML) score of a network or tree given one or more discrete characters at the tips. Along each edge, transitions are modelled with a continous time Markov model
, whose parameters are estimated (by maximizing the likelihood). At each hybrid node, the trait is assumed to be inherited from either of the two immediate parents according to the parents' average genetic contributions (inheritance γ). The model ignores incomplete lineage sorting. The algorithm extracts all trees displayed in the network.
Data can given in one of the following:
tipdata
: dictionary taxon => state label, for a single trait.tipdata
: data frame for a single trait, in which case the taxon names are to appear in column 1 or in a column named "taxon" or "species", and trait labels are to appear in column 2 or in a column named "trait". Here, trait labels should be as they appear ingetlabels(model)
.species
: vector of strings, andtraits
: DataFrame of traits, with rows in the order corresponding to the order of species names. Again, trait labels should be as they appear ingetlabels(model)
. All traits are assumed to follow the same model, with same parameters.dnadata
: the first part of the output of readfastatodna, a dataframe of BioSequence DNA sequences, with taxon in column 1 and a column for each site.dnapatternweights
: the second part of the output of readfastatodna, an array of weights, one weights for each of the site columns. The length of the weight is equal to nsites. If using dnapatternweights, must provide dnadata.- RateVariationAcrossSites: model for rate variation (optional)
Optional arguments (default):
optimizeQ
(true): should model rate parameters be fixed, or should they be optimized?optimizeRVAS
(true): should the model optimize the parameters for the variability of rates across sites (α and/or p_invariable)?NLoptMethod
(:LN_COBYLA
, derivative-free) for the optimization algorithm. For other options, see the NLopt.- tolerance values to control when the optimization is stopped:
ftolRel
(1e-12),ftolAbs
(1e-10) on the likelihood, andxtolRel
(1e-10),xtolAbs
(1e-10) on the model parameters. - bounds for the alpha parameter of the Gamma distribution of rates across sites:
alphamin=0.05
,alphamax=50
. verbose
(false): if true, more information is output.
examples:
julia> net = readTopology("(((A:2.0,(B:1.0)#H1:0.1::0.9):1.5,(C:0.6,#H1:1.0::0.1):1.0):0.5,D:2.0);");
julia> m1 = BinaryTraitSubstitutionModel([0.1, 0.1], ["lo", "hi"]);
julia> using DataFrames
julia> dat = DataFrame(species=["C","A","B","D"], trait=["hi","lo","lo","hi"]);
julia> fit1 = fitdiscrete(net, m1, dat)
PhyloNetworks.StatisticalSubstitutionModel:
Binary Trait Substitution Model:
rate lo→hi α=0.27222
rate hi→lo β=0.34981
on a network with 1 reticulations
data:
4 species
1 trait
log-likelihood: -2.7277
julia> tips = Dict("A" => "lo", "B" => "lo", "C" => "hi", "D" => "hi");
julia> fit2 = fitdiscrete(net, m1, tips; xtolRel=1e-16, xtolAbs=1e-16, ftolRel=1e-16)
PhyloNetworks.StatisticalSubstitutionModel:
Binary Trait Substitution Model:
rate lo→hi α=0.27222
rate hi→lo β=0.34981
on a network with 1 reticulations
data:
4 species
1 trait
log-likelihood: -2.7277
Note that a copy of the network is stored in the fitted object, but the internal representation of the network may be different in fit1.net
and in the original network net
:
julia> net = readTopology("(sp1:3.0,(sp2:2.0,(sp3:1.0,sp4:1.0):1.0):1.0);");
julia> using BioSymbols
julia> tips = Dict("sp1" => BioSymbols.DNA_A, "sp2" => BioSymbols.DNA_A, "sp3" => BioSymbols.DNA_G, "sp4" => BioSymbols.DNA_G);
julia> mJC69 = JC69([0.25], false);
julia> fitJC69 = fitdiscrete(net, mJC69, tips)
PhyloNetworks.StatisticalSubstitutionModel:
Jukes and Cantor 69 Substitution Model,
absolute rate version
off-diagonal rates equal to 0.29233/3.
rate matrix Q:
A C G T
A * 0.0974 0.0974 0.0974
C 0.0974 * 0.0974 0.0974
G 0.0974 0.0974 * 0.0974
T 0.0974 0.0974 0.0974 *
on a network with 0 reticulations
data:
4 species
1 trait
log-likelihood: -4.99274
julia> rv = RateVariationAcrossSites(alpha=1.0, ncat=4)
Rate variation across sites: discretized Gamma
alpha: 1.0
categories for Gamma discretization: 4
rates: [0.146, 0.513, 1.071, 2.27]
julia> fitdiscrete(net, mJC69, rv, tips; optimizeQ=false, optimizeRVAS=false)
PhyloNetworks.StatisticalSubstitutionModel:
Jukes and Cantor 69 Substitution Model,
absolute rate version
off-diagonal rates equal to 0.25/3.
rate matrix Q:
A C G T
A * 0.0833 0.0833 0.0833
C 0.0833 * 0.0833 0.0833
G 0.0833 0.0833 * 0.0833
T 0.0833 0.0833 0.0833 *
Rate variation across sites: discretized Gamma
alpha: 1.0
categories for Gamma discretization: 4
rates: [0.146, 0.513, 1.071, 2.27]
on a network with 0 reticulations
data:
4 species
1 trait
log-likelihood: -5.2568
fixit: add option to allow users to specify root prior, using either equal frequencies or stationary frequencies for trait models.
PhyloNetworks.maxParsimonyNet
— FunctionmaxParsimonyNet(T::HybridNetwork, df::DataFrame)
Search for the most parsimonious network (or tree). A level-1 network is assumed. df
should be a data frame containing the species names in column 1, or in a column named species
or taxon
. Trait data are assumed to be in all other columns. The search starts from topology T
, which can be a tree or a network with no more than hmax
hybrid nodes (see optional arguments below for hmax
).
Output:
- estimated network in file
.out
(also in.log
): best network overall and list of networks from each individual run. - if any error occurred, file
.err
provides information (seed) to reproduce the error.
Optional arguments include
- hmax: maximum number of hybridizations allowed (default 1)
- runs: number of starting points for the search (default 10); each starting point is
T
with probabilityprobST
=0.3 or a modification ofT
otherwise (using a NNI move, or a hybrid edge direction change) - Nfail: number of failures (proposed networks with equal or worse score) before the search is aborted. 75 by default: this is quite small, which is okay for a first trial. Larger values are recommended.
- outgroup: outgroup taxon. It can be a taxon name (String) or Node number (Integer). If none provided, or if the outgroup conflicts the starting topology, the function returns an error
- filename: root name for the output files. Default is "mp". If empty (""), files are not created, progress log goes to the screen only (standard out).
- seed: seed to replicate a given search
- criterion: parsimony score could be hardwired, softwired (default) or parental. Currently, only softwired is implemented
References
Leo Van Iersel, Mark Jones, Celine Scornavacca (2017). Improved Maximum Parsimony Models for Phylogenetic Networks, Systematic Biology, (https://doi.org/10.1093/sysbio/syx094).
Fischer, M., van Iersel, L., Kelk, S., Scornavacca, C. (2015). On computing the Maximum Parsimony score of a phylogenetic network. SIAM J. Discrete Math., 29(1):559-585.
For a roadmap of the functions inside maxParsimonyNet, see maxParsimonyNetRun1!
.
PhyloNetworks.getlabels
— Functiongetlabels(model)
State labels of a substitution model.
for a given NucleicAcidSubstitutionModel
, labels are symbols from BioSymbols. For now, only ACGTs are allowed. (When fitting data, any ambiguity code in the data would be treated as missing value).
examples
julia> getlabels(JC69([0.03], false))
4-element Vector{BioSymbols.DNA}:
DNA_A
DNA_C
DNA_G
DNA_T
julia> getlabels(HKY85([.5], repeat([0.25], 4)))
4-element Vector{BioSymbols.DNA}:
DNA_A
DNA_C
DNA_G
DNA_T
PhyloNetworks.nstates
— Functionnstates(model)
Number of character states for a given evolution model.
For example, this is 4 for a NucleicAcidSubstitutionModel
.
julia> nstates(JC69([0.03], false))
4
julia> nstates(HKY85([.5], [0.25, 0.25, 0.25, 0.25]))
4
for a BinaryTraitSubstitutionModel
, this is 2:
julia> m1 = BinaryTraitSubstitutionModel([1.0,2.0], ["low","high"])
Binary Trait Substitution Model:
rate low→high α=1.0
rate high→low β=2.0
julia> nstates(m1)
2
PhyloNetworks.nparams
— Functionnparams(model)
Number of parameters for a given trait evolution model (length of field model.rate
).
for JC69
model: 0 if relative, 1 if absolute
for HKY85
model: 1 if relative, 2 if absolute
PhyloNetworks.stationary
— Functionstationary(substitutionmodel)
Stationary distribution of a Markov model
PhyloNetworks.empiricalDNAfrequencies
— FunctionempiricalDNAfrequencies(DNAdata::AbstractDataFrame, DNAweights,
correction=true, useambiguous=true)
Estimate base frequencies in DNA data DNAdata
, ordered ACGT.
DNAdata
: data frame. All columns are used. If the first column gives species names, find a way to ignore it before calculating empirical frequencies, e.g.empiricalDNAfrequencies(view(DNAdata, :, 2:size(DNAdata, 2)))
. Data type must beBioSymbols.DNA
orChar
orString
. WARNING: this is checked on the first column only.DNAweights
: vector of weights, to weigh each column inDNAdata
.correction
: iftrue
, add 1 to each count and 4 to the denominator for a more stable estimator, similar to Bayes prior of 1/4 and the Agresti-Coull interval in binomial estimation.useambiguous
: iftrue
, ambiguous bases are used (except gaps and Ns). For example,Y
adds 0.5 weight toC
and 0.5 weight toT
.