Network estimation

SNaQ (v1.0) implements the statistical inference method in Solís-Lemus & Ané 2016. The procedure involves a numerical optimization of branch lengths and inheritance probabilities and a heuristic search in the space of phylogenetic networks.

We suggest that you create a special directory for running these examples, where input files can be downloaded and where output files will be created (with estimated networks for instance). Enter this directory and run Julia from there.

This documentation pertains to SNaQ v1.0 as originally described in Solís-Lemus & Ané (2016)

Inputs for SNaQ

SNaQ uses has two main inputs for estimating phylogenetic networks: concordance factors (CFs) and a starting tree (or network).

Concordance factors

Concordance factors denote frequency of each quartet topology present among the gene trees, which can be estimated using MrBayes or RAxML for example.

PhyloUtilities has a step-by-step tutorial to go from multiple sequence alignments to a table of quartet gene frequencies (concordance factors: CFs), through BUCKy (to integrate out gene tree uncertainty) or through RAxML (see also the snaq tutorial). This pipeline uses sequence alignments in MrBayes to estimate gene trees, then BUCKy to produce the table of estimated CFs and their credibility intervals. Both steps in this pipeline are parallelized.

input quartet CFs: BUCKy versus gene trees

When quartet CFs are estimated using BUCKy on each subset of 4 taxa, then error in gene trees is accounted for: BUCKy aims to estimate true gene tree discordance, beyond discordance that is due to uncertainty in estimated gene trees. This is achieved in a Bayesian framework, using a full posterior sample of trees for each individual gene. So this method of obtaining quartet CFs

is not what is sometimes referred to as a "summary" method
is expected to be robust to gene tree estimation error.

When quartet CFs are calculated based on a single tree per gene, then gene tree error is not accounted for. Estimation error in gene trees can cause gene tree discordance. Downstream analysis with SNaQ would need to invoke deep coalescence or introgression to account for this apparent discordance. So this method

may be called a "summary" method
may be sensitive to gene tree estimation error.

CFs from gene trees

Suppose you have a file with a list of gene trees in parenthetical format called raxmltrees.tre. You can access the example file of input trees here (or here for easier download).

Do not copy-paste into a "smart" text-editor. Instead, save the file directly into your working directory using "save link as" or "download linked file as". This file contains 30 gene trees, each in parenthetical format on 6 taxa like this (with rounded branch lengths):

(E:0.038,((A:0.014,B:0.010):0.010,(C:0.008,D:0.002):0.010):0.025,O:0.078);

If raxmltrees.tre is in your working directory, you can read in all gene trees and directly summarize them by a list of quartet CFs (proportion of input trees with a given quartet):

julia> using PhyloNetworks,SNaQ
julia> raxmltrees = joinpath(dirname(pathof(SNaQ)), "..","examples","raxmltrees.tre");
julia> raxmlCF = readtrees2CF(raxmltrees) # read in the file and produce a "DataCF" objectwill use all quartets on 6 taxa
calculating obsCF from 30 gene trees and for 15 quartets
Reading in quartets...
0+---------------+100%  
  ***************  
table of obsCF printed to file tableCF.txt
descriptive stat of input data printed to file summaryTreesQuartets.txt
Object DataCF
number of quartets: 15
number of trees: 30

In this table (tableCF.txt), each 4-taxon set is listed in one row. The 3 "CF" columns gives the proportion of genes that has each of the 3 possible trees on these 4 taxa.

When there are many more taxa, the number of quartets might be very large and we might want to use a subset to speed things up. Here, if we wanted to use a random sample of 10 quartets instead of all quartets, we could do:

raxmlCF = readtrees2CF(raxmltrees, whichQ="rand", numQ=10, CFfile="tableCF10.txt")

Be careful to use a numQ value smaller than the total number of possible 4-taxon subsets, which is n choose 4 on n taxa (e.g. 15 on 6 taxa). To get a predictable random sample, you may set the seed with using Random; Random.seed!(12321) (for instance) prior to sampling the quartets as above. Additionally, providing a file name for the optional argument CFile saves the quartet CFs to file for later use.

CFs from large datasets

When we want to get all quartet CFs, the readtrees2CF is much slower than the PhyloNetworks function PhyloNetworks.countquartetsintrees to read in trees and calculate the quartet CFs observed in the trees:

julia> trees = readmultinewick(raxmltrees);
julia> q,t = countquartetsintrees(trees);Reading in trees, looking at 15 quartets in each...
0+------------------------------+100%
  ******************************
julia> nt = tablequartetCF(q,t); # named tuple
julia> using DataFrames
julia> df = DataFrame(nt, copycols=false); # convert to a data frame, without copying the column data
julia> show(df, allcols=true) # data frames are displayed much more nicely than named tuples15×9 DataFrame
 Row │ qind   t1      t2      t3      t4      CF12_34   CF13_24    CF14_23    ngenes  
     │ Int64  String  String  String  String  Float64   Float64    Float64    Float64 
─────┼────────────────────────────────────────────────────────────────────────────────
   1 │     1  A       B       C       D       1.0       0.0        0.0           30.0
   2 │     2  A       B       C       E       0.833333  0.0333333  0.133333      30.0
   3 │     3  A       B       D       E       0.833333  0.0333333  0.133333      30.0
   4 │     4  A       C       D       E       0.0       0.0        1.0           30.0
   5 │     5  B       C       D       E       0.0       0.0        1.0           30.0
   6 │     6  A       B       C       O       0.866667  0.0        0.133333      30.0
   7 │     7  A       B       D       O       0.866667  0.0        0.133333      30.0
   8 │     8  A       C       D       O       0.0       0.0        1.0           30.0
   9 │     9  B       C       D       O       0.0       0.0        1.0           30.0
  10 │    10  A       B       E       O       0.833333  0.0666667  0.1           30.0
  11 │    11  A       C       E       O       0.533333  0.333333   0.133333      30.0
  12 │    12  B       C       E       O       0.666667  0.266667   0.0666667     30.0
  13 │    13  A       D       E       O       0.533333  0.333333   0.133333      30.0
  14 │    14  B       D       E       O       0.666667  0.266667   0.0666667     30.0
  15 │    15  C       D       E       O       1.0       0.0        0.0           30.0

Reading CFs directly from file

If we already have a table of quartet concordance factors (CFs) saved as a table in this format

Taxon1	Taxon2	Taxon3	Taxon4	CF12_34	CF13_24	CF14_23
D	A	E	O	0.565	0.0903	0.3447
...						...

then we could read it in one step using the readtableCF function.

Concordance factors (CF), i.e. gene tree frequencies, for each 4-taxon subset can be obtained from BUCKy to account for gene tree uncertainty.

An example file comes with the package, available here or here.

julia> buckyCFfile = joinpath(dirname(pathof(SNaQ)), "..","examples","buckyCF.csv");
julia> buckyCF = readtableCF(buckyCFfile)30.0 gene trees per 4-taxon set
Object DataCF
number of quartets: 15

The same thing could be done in 2 steps: first to read the file and convert it to a DataFrame object from the DataFrames package and then to convert this DataFrame into a DataCF object.

julia> using CSV, DataFrames
julia> dat = CSV.read(buckyCFfile, DataFrame);
julia> first(dat, 6) # to see the first 6 rows6×14 DataFrame
 Row │ taxon1   taxon2   taxon3   taxon4   CF12.34     CF12.34_lo  CF12.34_hi  ⋯
     │ String1  String1  String1  String1  Float64     Float64     Float64     ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ D        A        E        O        0.565033           0.5    0.633333  ⋯
   2 │ D        A        E        B        0.0005             0.0    0.0
   3 │ C        A        E        B        0.0005             0.0    0.0
   4 │ C        A        E        O        0.565033           0.5    0.633333
   5 │ D        A        O        B        3.33333e-5         0.0    0.0       ⋯
   6 │ C        D        E        O        0.999867           1.0    1.0
                                                               7 columns omitted
julia> buckyCF = readtableCF(dat)30.0 gene trees per 4-taxon set
Object DataCF
number of quartets: 15
julia> tablequartetCF(buckyCF)(t1 = ["D", "D", "C", "C", "D", "C", "C", "C", "C", "C", "D", "C", "C", "A", "C"], t2 = ["A", "A", "A", "A", "A", "D", "E", "D", "D", "D", "E", "D", "D", "E", "A"], t3 = ["E", "E", "E", "E", "O", "E", "O", "A", "O", "E", "O", "A", "A", "O", "O"], t4 = ["O", "B", "B", "O", "B", "O", "B", "E", "B", "B", "B", "O", "B", "B", "B"], CF12_34 = [0.565033333333333, 0.0005, 0.0005, 0.565033333333333, 3.33333333333333e-5, 0.999866666666667, 0.0401666666666667, 1.0, 0.999866666666667, 1.0, 0.0401666666666667, 0.999866666666667, 1.0, 0.0731666666666667, 3.33333333333333e-5], CF13_24 = [0.0903, 0.8599, 0.8599, 0.0903, 0.8885, 6.66666666666667e-5, 0.263066666666667, 0.0, 6.66666666666667e-5, 0.0, 0.263066666666667, 6.66666666666667e-5, 0.0, 0.0424666666666667, 0.8885], CF14_23 = [0.344666666666667, 0.139566666666667, 0.139566666666667, 0.344666666666667, 0.111466666666667, 6.66666666666667e-5, 0.696766666666667, 0.0, 6.66666666666667e-5, 0.0, 0.696766666666667, 6.66666666666667e-5, 0.0, 0.884366666666667, 0.111466666666667], ngenes = Union{Missing, Float64}[30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0, 30.0])

In the input file, columns need to be in the right order: with the first 4 columns giving the names of the taxa in each 4-taxon set. The CF values are assumed to be in columns named "CF12_34", etc., or else in columns 5,6,7. If available, a column named "ngenes" will be taken to have the the number of genes for each 4-taxon subset.

Starting tree

The other input for SNaQ is a starting tree (or network) to be used as a starting point in optimization.

If we have a tree for the data set at hand, it can be used as a starting point for the optimization. From our gene trees, we estimated a species tree with ASTRAL. This tree comes with the package in file astral.tre here. This file has 102 trees: 100 bootstrap species trees, followed by their greedy consensus, followed by the best tree on the original data. It's this last tree that we are most interested in. We can read it with

astralfile = joinpath(dirname(pathof(SNaQ)), "..","examples","astral.tre");
astraltree = readmultinewick(astralfile)[102] # 102th tree: last tree here

HybridNetwork, Rooted Network
10 edges
11 nodes: 6 tips, 0 hybrid nodes, 5 internal tree nodes.
tip labels: E, O, C, D, ...
(E,(O,((C,D)100.0:3.029,(B,A)100.0:1.324)100.0:0.464));

Instead of a starting tree (astraltree in this case), we can start the optimization in SNaQ with a network, but this network needs to be "level-1". Note that all trees and all networks with 1 hybridization are of level 1. To make sure that a network with 2 or more hybridizations is of level 1, we can read it in with readnewicklevel1 (which also unroots the tree, resolves polytomies, replaces missing branch lengths by 1 for starting values etc.):

T=readnewicklevel1("startNetwork.txt")

Here startNetwork.txt is a hypothetical file: replace this by the name of a file that contains your network of interest.

Estimating a network

After we have Inputs for SNaQ, we can estimate the network using the input data raxmlCF and starting from tree (or network) astraltree. We first set hmax=0 to impose the constraint of at most 0 hybrid node, that is, we estimate a tree.

net0 = snaq!(astraltree,raxmlCF, hmax=0, filename="net0", seed=1234)

Part of the screen output shows this:

MaxNet is (C,D,((B,A):1.395762055180493,(O,E):0.48453400554506426):10.0);
with -loglik 53.53150526187732

This parenthetical (extended Newick) description is not very human-friendly, so we use PhyloPlots to plot the tree (more about plotting networks can be found in the PhyloNetworks guide Network Visualization and in PhyloPlots documentation).

using PhyloPlots
plot(net0);

net0_1

We can use this tree as a starting point to search for the best network allowing for at most hmax=1 hybridizations (which is the default if we do not specify a hmax value).

net1 = snaq!(net0, raxmlCF, hmax=1, filename="net1", seed=2345)

part of screen output:

best network and networks with different hybrid/gene flow directions printed to .networks file
MaxNet is (C,D,((O,(E,#H7:::0.19558838614943078):0.31352437658618976):0.6640664399202987,(B,(A)#H7:::0.8044116138505693):10.0):10.0);
with -loglik 28.31506721890958

We can visualize the estimated network and its inheritance values γ, which measure the proportion of genes inherited via each parent at a reticulation event (e.g. proportion of genes inherited via gene flow).

plot(net1, showgamma=true);

net1_1

This network has A as a hybrid, 80.4% sister to B, and 19.6% sister to E (which is otherwise sister to O). C & D are sister to each other.

Note that SNaQ infers an unrooted semi-directed network; the lack of rooting is depicted visually with a trifurcationon on the leftmost side of the plot. The direction of hybrid edges can be inferred, but the direction of tree edges cannot be inferred. To obtain a representative visualization, it is best to root the network first, using one or more outgroup. PhyloNetworks has a guide on Re-rooting trees and networks for this. If your outgroup conflicts with the direction of reticulations in the estimated network, see the section Candidate networks compatible with a known outgroup.

We can also check the output files created by snaq!:

less("net1.err") # would provide info about errors, if any
less("net1.out") # main output file with the estimated network from each run
less("net1.networks") # extra info

when viewing these result files with less within Julia, use arrows to scroll down and type q to quit viewing the files.

The .networks file contains a list of networks that are slight modifications of the best (estimated) network net1. The modifications change the direction of one reticulation at a time, by moving the placement of one hybrid node to another node inside the same cycle. For each modified network, the pseudolikelihood score was calculated (the loglik or -Ploglik values give a pseudo deviance actually).
The .out file contains the best network among all runs, and the best network per run, includes also the pseudolikelihood score and the computation time.
The .log file contains a description of each run, convergence criterion, and seed information.
The .err file has seed information on runs that failed, empty when nothing failed.

The function name snaq! ends with ! because it modifies the argument raxmlCF by including the expected CF. Type ? then snaq! to get help on that function.

The main output file, here net1.out (or snaq.out by default) has the estimated network in parenthetical format, but we can also print it directly to the screen:

julia> net1HybridNetwork, Rooted Network
12 edges
12 nodes: 6 tips, 1 hybrid nodes, 5 internal tree nodes.
tip labels: C, D, O, E, ...
(C,D,((O,(E,#H7:::0.196):0.314):0.664,((A)#H7:::0.804,B):10.0):10.0);
julia> writenewick(net1)  # writes to screen, full precision for branch lengths and γ"(C,D,((O,(E,#H7:::0.19558838614943078):0.31352437658618976):0.6640664399202987,((A)#H7:::0.8044116138505693,B):10.0):10.0);"
julia> writenewick(net1, round=true, digits=2)"(C,D,((O,(E,#H7:::0.2):0.31):0.66,((A)#H7:::0.8,B):10.0):10.0);"
julia> writenewick(net1, di=true) # γ omitted: for dendroscope"(C,D,((O,(E,#H7):0.31352437658618976):0.6640664399202987,((A)#H7,B):10.0):10.0);"
julia> writenewick(net1, "bestnet_h1.tre") # writes to file: creates or overwrites file

From a set of candidate networks, one might simply need to score of each network to pick the best. Here, the score is the negative log pseudo-deviance, and the lower the better. See the section to get the score of Candidate networks.

Choosing the number of hybridizations (`hmax`)

We change the hmax argument to change the search space to let the network have up to 2 or 3 hybrid nodes:

net2 = snaq!(net1,raxmlCF, hmax=2, filename="net2", seed=3456)
net3 = snaq!(net0,raxmlCF, hmax=3, filename="net3", seed=4567)

and plot them (the optimized networks are identical and they both have a single reticulation):

using RCall                  # to be able to tweak our plot within R
R"layout(matrix(1:2, 1, 2))" # to get 2 plots into a single figure: 1 row, 2 columns
R"par"(mar=[0,0,1,0])        # for smaller margins
plot(net2, showgamma=true);
R"mtext"("hmax=2")           # add text annotation: title here
plot(net3, showgamma=true);
R"mtext"("hmax=3")

net23

with this screen output for net2 (only 1 hybrid node found):

MaxNet is (C,D,((B,(A)#H7:::0.804411606649347):10.0,(O,(#H7:::0.19558839335065303,E):0.3135243143217013):0.664066456871298):10.0);
with -loglik 28.31506721890957

and this output for net3 (again, only 1 hybrid found):

MaxNet is (D,C,((O,(E,#H7:::0.19558839257941849):0.3135243301652981):0.6640664138384673,(B,(A)#H7:::0.8044116074205815):10.0):10.0);
with -loglik 28.315067218909626

Each network has a loglik attribute, which is its log pseudo deviance: a multiple of the negative log-likelihood up to a constant (the constant is such that the score is 0 if the network fits the data perfectly). The lower the better. We can plot these scores across hybrid values:

scores = [loglik(net0), loglik(net1), loglik(net2), loglik(net3)]
hmax = collect(0:3)
R"plot"(hmax, scores, type="b", ylab="network score", xlab="hmax", col="blue");

scores_heuristic

Here the slope heuristic suggests a single hybrid node: the score does not get much better beyond h=1.

We made the plot via R above. A more Julian way would use a Julia plotting package such as Plots, Makie or packages with similar principles to ggplot2 in R: AlgebraOfGraphics (based on Makie) or Gadfly. Also see this a cool (but now somewhat old) blog about using ggplot within julia.

Note that since SNaQ assumes level-1 networks (i.e., no intersecting cycles), it might not be possible to add more hybridizations to networks with few taxa:

level1

Suggestions and best practices

For big datasets, do only one run (i.e., setting runs=1) first to get an idea of computing time:
```
net1 = snaq!(net0,raxmlCF, hmax=3, filename="net3", runs=1)
```
Increase the number of hybridizations sequentially: hmax=0,1,2,..., and use the best network at h-1 as starting point to estimate the best network at h.
Whenever possible, do many independent runs (default runs=10) to avoid getting stuck on local maxima. We do not need to do all runs sequentially. We can parallize by doing different runs on different cores (or computers). If we are on a machine or on a cluster that has many different cores, we can ask Julia to use these multiple cores, and snaq! will send different runs to different cores, like we did earlier.
For long jobs, run as a script in the terminal: julia runSNaQ.jl, arguments to the script are passed to Julia as a vector called ARGS. See the example script runSNaQ.jl in the examples folder of SNaQ, or the file located here. More on this topic in here: Parallel computations

This documentation pertains to SNaQ v1.0 as originally described in Solís-Lemus & Ané (2016)