fitting DNA substitution parameters on a network

The methods below model each DNA site as a trait, assuming that sites are unlinked, that is, they evolve independently of each other. In other words, this is a "concatenation" approach where sites from the same locus do not share information about their evolutionary path. This is appropriate if recombination is assumed to have occurred within genes.

reading in an alignment

Like for trait evolution, fitdiscrete can be used. It can be given data in a variety of ways. For DNA, this is one way, using PhyloNetworks.readfastatodna:

# read in network
dna_net = readnewick("((((((((((((((Ae_caudata_Tr275:1.0,Ae_caudata_Tr276:1.0):1.0,Ae_caudata_Tr139:1.0):1.0)#H1:1.0::0.6,((((((Ae_longissima_Tr241:1.0,Ae_longissima_Tr242:1.0):1.0,Ae_longissima_Tr355:1.0):1.0,(Ae_sharonensis_Tr265:1.0,Ae_sharonensis_Tr264:1.0):1.0):1.0,((Ae_bicornis_Tr408:1.0,Ae_bicornis_Tr407:1.0):1.0,Ae_bicornis_Tr406:1.0):1.0):1.0,((Ae_searsii_Tr164:1.0,Ae_searsii_Tr165:1.0):1.0,Ae_searsii_Tr161:1.0):1.0):1.0)#H2:1.0::0.6):1.0,(((Ae_umbellulata_Tr266:1.0,Ae_umbellulata_Tr257:1.0):1.0,Ae_umbellulata_Tr268:1.0):1.0,#H1:1.0::0.4):1.0):1.0,((Ae_comosa_Tr271:1.0,Ae_comosa_Tr272:1.0):1.0,(((Ae_uniaristata_Tr403:1.0,Ae_uniaristata_Tr357:1.0):1.0,Ae_uniaristata_Tr402:1.0):1.0,Ae_uniaristata_Tr404:1.0):1.0):1.0):1.0,(((Ae_tauschii_Tr352:1.0,Ae_tauschii_Tr351:1.0):1.0,(Ae_tauschii_Tr180:1.0,Ae_tauschii_Tr125:1.0):1.0):1.0,(#H2:1.0::0.4,((((Ae_mutica_Tr237:1.0,Ae_mutica_Tr329:1.0):1.0,Ae_mutica_Tr244:1.0):1.0,Ae_mutica_Tr332:1.0):1.0)#H4:1.0::0.6):1.0):1.0):1.0,(((T_boeoticum_TS8:1.0,(T_boeoticum_TS10:1.0,T_boeoticum_TS3:1.0):1.0):1.0,T_boeoticum_TS4:1.0):1.0,((T_urartu_Tr315:1.0,T_urartu_Tr232:1.0):1.0,(T_urartu_Tr317:1.0,T_urartu_Tr309:1.0):1.0):1.0):1.0):1.0,(((((Ae_speltoides_Tr320:1.0,Ae_speltoides_Tr323:1.0):1.0,Ae_speltoides_Tr223:1.0):1.0,Ae_speltoides_Tr251:1.0):1.0):1.0,#H4:1.0::0.4):1.0):1.0):1.0,Ta_caputMedusae_TB2:1.0):1.0,S_vavilovii_Tr279:1.0):1.0,Er_bonaepartis_TB1:1.0):1.0,H_vulgare_HVens23:1.0);");
# read in alignment in FASTA format
fastafile = joinpath(dirname(pathof(PhyloTraits)), "..","examples","Ae_bicornis_Tr406_Contig10132.aln");
dna_dat, dna_weights = readfastatodna(fastafile, true);
julia> dna_dat22×210 DataFrame
 Row  taxon                 x1   x2   x3   x4   x5   x6   x7   x8   x9   x10  ⋯
     │ String                DNA  DNA  DNA  DNA  DNA  DNA  DNA  DNA  DNA  DNA  ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Ae_uniaristata_Tr357  -    -    -    -    -    -    -    -    -    -    ⋯
   2 │ Ae_mutica_Tr237       -    -    -    -    -    -    -    -    -    -
   3 │ Ae_tauschii_Tr352     -    -    -    -    -    -    -    -    -    -
   4 │ Ae_searsii_Tr165      T    C    A    G    G    A    A    C    G    C
   5 │ Ae_sharonensis_Tr265  -    -    -    -    -    -    -    -    -    -    ⋯
   6 │ Ae_bicornis_Tr407     T    C    A    G    G    A    A    C    G    C
   7 │ Ae_longissima_Tr241   T    C    A    G    G    A    A    C    G    C
   8 │ Ae_bicornis_Tr406     T    C    A    G    G    A    A    C    G    C
  ⋮  │          ⋮             ⋮    ⋮    ⋮    ⋮    ⋮    ⋮    ⋮    ⋮    ⋮    ⋮   ⋱
  16 │ Ae_comosa_Tr271       -    -    -    -    -    -    -    -    -    C    ⋯
  17 │ Ae_searsii_Tr164      T    C    A    G    G    A    A    C    G    C
  18 │ Ae_speltoides_Tr251   -    -    -    -    -    -    -    -    -    -
  19 │ Ae_bicornis_Tr408     -    -    -    -    -    -    -    -    -    -
  20 │ Ae_tauschii_Tr351     T    C    A    G    G    G    A    C    A    C    ⋯
  21 │ Ae_longissima_Tr355   T    C    A    G    A    A    A    C    G    C
  22 │ Ae_caudata_Tr139      -    -    -    -    -    -    -    -    -    -
                                                  199 columns and 7 rows omitted
julia> dna_weights209-element Vector{Float64}: 23.0 18.0 13.0 16.0 1.0 1.0 1.0 1.0 1.0 9.0 ⋮ 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 3.0

Here, dna_dat is a single data frame containing both species names and trait data (site patterns). The alignment was summarized by listing each observed site pattern only once in dna_dat. dna_weights is a vector of weights, containing the number of times that each site pattern was observed.

substitution models for DNA

DNA-specific substitution models have 4 states: the 4 nucleotides from BioSymbols (listed here). Each model has a relative and an absolute version.

  • :JC69 Jukes & Cantor 1969 model: one single rate for all transitions. The relative version has values -1 along the diagonal of the rate matrix (1 expected transition / unit of time). The absolute version has an extra parameter to scale the rate matrix.
  • :HKY85 Hasegawa, Kishino & Yano 1985: treats transitions differently from transversions. The relative is scaled to predict an average of 1 transition / unit of time.

We may allow for rate variation across sites using the :RV option.

likelihood of a fixed network

In the examples below, none of the rate parameters are optimized, so we get to see the default starting values.

julia> d1 = fitdiscrete(dna_net, :JC69, dna_dat, dna_weights, :Gamma; optimizeQ=false, optimizeRVAS=false)┌ Warning: the network contains taxa with no data: those will be pruned
@ PhyloTraits ~/work/PhyloTraits.jl/PhyloTraits.jl/src/fit_discrete.jl:984
PhyloTraits.StatisticalSubstitutionModel:
Jukes and Cantor 69 Substitution Model,
  relative rate version
  off-diagonal rates equal to 1/3
  rate matrix Q:
                 A       C       G       T
         A       *  0.3333  0.3333  0.3333
         C  0.3333       *  0.3333  0.3333
         G  0.3333  0.3333       *  0.3333
         T  0.3333  0.3333  0.3333       *
Rate variation across sites: discretized Gamma
  alpha: 1.0
  categories for Gamma discretization: 4
  rates: [0.146, 0.513, 1.071, 2.27]
on a network with 3 reticulations
data:
  22 species
  1362 sites
  209 distinct patterns
log-likelihood: -14442.27642
julia> d2 = fitdiscrete(dna_net, :HKY85, dna_dat, dna_weights, :Gamma; optimizeQ=false, optimizeRVAS=false, suppresswarnings=true)PhyloTraits.StatisticalSubstitutionModel: HKY85 Substitution Model base frequencies: [0.29100263035741913, 0.2682964567538295, 0.23638403218319665, 0.2043168807055547] relative rate version with transition/tranversion ratio kappa = 1.0, scaled so that there is one substitution per unit time rate matrix Q: A C G T A * 0.3598 0.3170 0.2740 C 0.3902 * 0.3170 0.2740 G 0.3902 0.3598 * 0.2740 T 0.3902 0.3598 0.3170 * Rate variation across sites: discretized Gamma alpha: 1.0 categories for Gamma discretization: 4 rates: [0.146, 0.513, 1.071, 2.27] on a network with 3 reticulations data: 22 species 1362 sites 209 distinct patterns log-likelihood: -14420.83584

When allowing for rate variation across sites with gamma-distributed rates, the default shape parameter α is 1.

In the more interesting examples below, we optimize the evolutionary rates and the way rates vary across sites (which is the default).

julia> d3 = fitdiscrete(dna_net, :JC69, dna_dat, dna_weights, :Gamma; ftolAbs=0.1, xtolAbs=0.01, suppresswarnings=true)PhyloTraits.StatisticalSubstitutionModel:
Jukes and Cantor 69 Substitution Model,
  relative rate version
  off-diagonal rates equal to 1/3
  rate matrix Q:
                 A       C       G       T
         A       *  0.3333  0.3333  0.3333
         C  0.3333       *  0.3333  0.3333
         G  0.3333  0.3333       *  0.3333
         T  0.3333  0.3333  0.3333       *
Rate variation across sites: discretized Gamma
  alpha: 0.05828
  categories for Gamma discretization: 4
  rates: [0.0, 0.0, 0.012, 3.988]
on a network with 3 reticulations
data:
  22 species
  1362 sites
  209 distinct patterns
log-likelihood: -3717.89988

Lenient tolerance parameters ftolAbs etc. have been chosen here to make this example faster. Note that the fitted object contains a separate version of the input network, where any taxon without data has been pruned, and where branch numbers may have been modified.

julia> d3.netPhyloNetworks.HybridNetwork, Rooted Network
53 edges
51 nodes: 22 tips, 3 hybrid nodes, 26 internal tree nodes.
tip labels: Ae_caudata_Tr275, Ae_caudata_Tr139, Ae_longissima_Tr241, Ae_longissima_Tr242, ...
(H_vulgare_HVens23:1.0,((((Ae_speltoides_Tr251:2.0):1.0,#H4:1.0::0.4):1.0,((((((Ae_caudata_Tr139:1.0,Ae_caudata_Tr275:2.0):1.0)#H1:1.0::0.6,((((((Ae_longissima_Tr241:1.0,Ae_longissima_Tr242:1.0):1.0,Ae_longissima_Tr355:1.0):1.0,Ae_sharonensis_Tr265:2.0):1.0,((Ae_bicornis_Tr408:1.0,Ae_bicornis_Tr407:1.0):1.0,Ae_bicornis_Tr406:1.0):1.0):1.0,(Ae_searsii_Tr164:1.0,Ae_searsii_Tr165:1.0):2.0):1.0)#H2:1.0::0.6):1.0,#H1:2.0::0.4):1.0,((Ae_comosa_Tr271:1.0,Ae_comosa_Tr272:1.0):1.0,((Ae_uniaristata_Tr403:1.0,Ae_uniaristata_Tr357:1.0):1.0,Ae_uniaristata_Tr402:1.0):2.0):1.0):1.0,(((Ae_tauschii_Tr352:1.0,Ae_tauschii_Tr351:1.0):1.0,Ae_tauschii_Tr125:2.0):1.0,(#H2:1.0::0.4,(Ae_mutica_Tr237:4.0)#H4:1.0::0.6):1.0):1.0):2.0):1.0):4.0);