fitting DNA substitution parameters on a network
The methods below model each DNA site as a trait, assuming that sites are unlinked, that is, they evolve independently of each other. In other words, this is a "concatenation" approach where sites from the same locus do not share information about their evolutionary path. This is appropriate if recombination is assumed to have occurred within genes.
reading in an alignment
Like for trait evolution, fitdiscrete
can be used. It can be given data in a variety of ways. For DNA, this is one way, using PhyloNetworks.readfastatodna
:
# read in network
dna_net = readnewick("((((((((((((((Ae_caudata_Tr275:1.0,Ae_caudata_Tr276:1.0):1.0,Ae_caudata_Tr139:1.0):1.0)#H1:1.0::0.6,((((((Ae_longissima_Tr241:1.0,Ae_longissima_Tr242:1.0):1.0,Ae_longissima_Tr355:1.0):1.0,(Ae_sharonensis_Tr265:1.0,Ae_sharonensis_Tr264:1.0):1.0):1.0,((Ae_bicornis_Tr408:1.0,Ae_bicornis_Tr407:1.0):1.0,Ae_bicornis_Tr406:1.0):1.0):1.0,((Ae_searsii_Tr164:1.0,Ae_searsii_Tr165:1.0):1.0,Ae_searsii_Tr161:1.0):1.0):1.0)#H2:1.0::0.6):1.0,(((Ae_umbellulata_Tr266:1.0,Ae_umbellulata_Tr257:1.0):1.0,Ae_umbellulata_Tr268:1.0):1.0,#H1:1.0::0.4):1.0):1.0,((Ae_comosa_Tr271:1.0,Ae_comosa_Tr272:1.0):1.0,(((Ae_uniaristata_Tr403:1.0,Ae_uniaristata_Tr357:1.0):1.0,Ae_uniaristata_Tr402:1.0):1.0,Ae_uniaristata_Tr404:1.0):1.0):1.0):1.0,(((Ae_tauschii_Tr352:1.0,Ae_tauschii_Tr351:1.0):1.0,(Ae_tauschii_Tr180:1.0,Ae_tauschii_Tr125:1.0):1.0):1.0,(#H2:1.0::0.4,((((Ae_mutica_Tr237:1.0,Ae_mutica_Tr329:1.0):1.0,Ae_mutica_Tr244:1.0):1.0,Ae_mutica_Tr332:1.0):1.0)#H4:1.0::0.6):1.0):1.0):1.0,(((T_boeoticum_TS8:1.0,(T_boeoticum_TS10:1.0,T_boeoticum_TS3:1.0):1.0):1.0,T_boeoticum_TS4:1.0):1.0,((T_urartu_Tr315:1.0,T_urartu_Tr232:1.0):1.0,(T_urartu_Tr317:1.0,T_urartu_Tr309:1.0):1.0):1.0):1.0):1.0,(((((Ae_speltoides_Tr320:1.0,Ae_speltoides_Tr323:1.0):1.0,Ae_speltoides_Tr223:1.0):1.0,Ae_speltoides_Tr251:1.0):1.0):1.0,#H4:1.0::0.4):1.0):1.0):1.0,Ta_caputMedusae_TB2:1.0):1.0,S_vavilovii_Tr279:1.0):1.0,Er_bonaepartis_TB1:1.0):1.0,H_vulgare_HVens23:1.0);");
# read in alignment in FASTA format
fastafile = joinpath(dirname(pathof(PhyloTraits)), "..","examples","Ae_bicornis_Tr406_Contig10132.aln");
dna_dat, dna_weights = readfastatodna(fastafile, true);
julia> dna_dat
22×210 DataFrame Row │ taxon x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 ⋯ │ String DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Ae_uniaristata_Tr357 - - - - - - - - - - ⋯ 2 │ Ae_mutica_Tr237 - - - - - - - - - - 3 │ Ae_tauschii_Tr352 - - - - - - - - - - 4 │ Ae_searsii_Tr165 T C A G G A A C G C 5 │ Ae_sharonensis_Tr265 - - - - - - - - - - ⋯ 6 │ Ae_bicornis_Tr407 T C A G G A A C G C 7 │ Ae_longissima_Tr241 T C A G G A A C G C 8 │ Ae_bicornis_Tr406 T C A G G A A C G C ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 16 │ Ae_comosa_Tr271 - - - - - - - - - C ⋯ 17 │ Ae_searsii_Tr164 T C A G G A A C G C 18 │ Ae_speltoides_Tr251 - - - - - - - - - - 19 │ Ae_bicornis_Tr408 - - - - - - - - - - 20 │ Ae_tauschii_Tr351 T C A G G G A C A C ⋯ 21 │ Ae_longissima_Tr355 T C A G A A A C G C 22 │ Ae_caudata_Tr139 - - - - - - - - - - 199 columns and 7 rows omitted
julia> dna_weights
209-element Vector{Float64}: 23.0 18.0 13.0 16.0 1.0 1.0 1.0 1.0 1.0 9.0 ⋮ 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 3.0
Here, dna_dat
is a single data frame containing both species names and trait data (site patterns). The alignment was summarized by listing each observed site pattern only once in dna_dat
. dna_weights
is a vector of weights, containing the number of times that each site pattern was observed.
substitution models for DNA
DNA-specific substitution models have 4 states: the 4 nucleotides from BioSymbols (listed here). Each model has a relative and an absolute version.
:JC69
Jukes & Cantor 1969 model: one single rate for all transitions. The relative version has values -1 along the diagonal of the rate matrix (1 expected transition / unit of time). The absolute version has an extra parameter to scale the rate matrix.:HKY85
Hasegawa, Kishino & Yano 1985: treats transitions differently from transversions. The relative is scaled to predict an average of 1 transition / unit of time.
We may allow for rate variation across sites using the :RV
option.
likelihood of a fixed network
In the examples below, none of the rate parameters are optimized, so we get to see the default starting values.
julia> d1 = fitdiscrete(dna_net, :JC69, dna_dat, dna_weights, :Gamma; optimizeQ=false, optimizeRVAS=false)
┌ Warning: the network contains taxa with no data: those will be pruned └ @ PhyloTraits ~/work/PhyloTraits.jl/PhyloTraits.jl/src/fit_discrete.jl:984 PhyloTraits.StatisticalSubstitutionModel: Jukes and Cantor 69 Substitution Model, relative rate version off-diagonal rates equal to 1/3 rate matrix Q: A C G T A * 0.3333 0.3333 0.3333 C 0.3333 * 0.3333 0.3333 G 0.3333 0.3333 * 0.3333 T 0.3333 0.3333 0.3333 * Rate variation across sites: discretized Gamma alpha: 1.0 categories for Gamma discretization: 4 rates: [0.146, 0.513, 1.071, 2.27] on a network with 3 reticulations data: 22 species 1362 sites 209 distinct patterns log-likelihood: -14442.27642
julia> d2 = fitdiscrete(dna_net, :HKY85, dna_dat, dna_weights, :Gamma; optimizeQ=false, optimizeRVAS=false, suppresswarnings=true)
PhyloTraits.StatisticalSubstitutionModel: HKY85 Substitution Model base frequencies: [0.29100263035741913, 0.2682964567538295, 0.23638403218319665, 0.2043168807055547] relative rate version with transition/tranversion ratio kappa = 1.0, scaled so that there is one substitution per unit time rate matrix Q: A C G T A * 0.3598 0.3170 0.2740 C 0.3902 * 0.3170 0.2740 G 0.3902 0.3598 * 0.2740 T 0.3902 0.3598 0.3170 * Rate variation across sites: discretized Gamma alpha: 1.0 categories for Gamma discretization: 4 rates: [0.146, 0.513, 1.071, 2.27] on a network with 3 reticulations data: 22 species 1362 sites 209 distinct patterns log-likelihood: -14420.83584
When allowing for rate variation across sites with gamma-distributed rates, the default shape parameter α is 1.
In the more interesting examples below, we optimize the evolutionary rates and the way rates vary across sites (which is the default).
julia> d3 = fitdiscrete(dna_net, :JC69, dna_dat, dna_weights, :Gamma; ftolAbs=0.1, xtolAbs=0.01, suppresswarnings=true)
PhyloTraits.StatisticalSubstitutionModel: Jukes and Cantor 69 Substitution Model, relative rate version off-diagonal rates equal to 1/3 rate matrix Q: A C G T A * 0.3333 0.3333 0.3333 C 0.3333 * 0.3333 0.3333 G 0.3333 0.3333 * 0.3333 T 0.3333 0.3333 0.3333 * Rate variation across sites: discretized Gamma alpha: 0.05828 categories for Gamma discretization: 4 rates: [0.0, 0.0, 0.012, 3.988] on a network with 3 reticulations data: 22 species 1362 sites 209 distinct patterns log-likelihood: -3717.89988
Lenient tolerance parameters ftolAbs
etc. have been chosen here to make this example faster. Note that the fitted object contains a separate version of the input network, where any taxon without data has been pruned, and where branch numbers may have been modified.
julia> d3.net
PhyloNetworks.HybridNetwork, Rooted Network 53 edges 51 nodes: 22 tips, 3 hybrid nodes, 26 internal tree nodes. tip labels: Ae_caudata_Tr275, Ae_caudata_Tr139, Ae_longissima_Tr241, Ae_longissima_Tr242, ... (H_vulgare_HVens23:1.0,((((Ae_speltoides_Tr251:2.0):1.0,#H4:1.0::0.4):1.0,((((((Ae_caudata_Tr139:1.0,Ae_caudata_Tr275:2.0):1.0)#H1:1.0::0.6,((((((Ae_longissima_Tr241:1.0,Ae_longissima_Tr242:1.0):1.0,Ae_longissima_Tr355:1.0):1.0,Ae_sharonensis_Tr265:2.0):1.0,((Ae_bicornis_Tr408:1.0,Ae_bicornis_Tr407:1.0):1.0,Ae_bicornis_Tr406:1.0):1.0):1.0,(Ae_searsii_Tr164:1.0,Ae_searsii_Tr165:1.0):2.0):1.0)#H2:1.0::0.6):1.0,#H1:2.0::0.4):1.0,((Ae_comosa_Tr271:1.0,Ae_comosa_Tr272:1.0):1.0,((Ae_uniaristata_Tr403:1.0,Ae_uniaristata_Tr357:1.0):1.0,Ae_uniaristata_Tr402:1.0):2.0):1.0):1.0,(((Ae_tauschii_Tr352:1.0,Ae_tauschii_Tr351:1.0):1.0,Ae_tauschii_Tr125:2.0):1.0,(#H2:1.0::0.4,(Ae_mutica_Tr237:4.0)#H4:1.0::0.6):1.0):1.0):2.0):1.0):4.0);