Public Documentation
Documentation for SNaQ
's public (exported) interface.
Functions & Types
SNaQ.bootsnaq
— Methodbootsnaq(T::HybridNetwork, df::DataFrame)
bootsnaq(T::HybridNetwork, vector of tree lists)
Bootstrap analysis for SNaQ. Bootstrap data can be quartet concordance factors (CF), drawn from sampling uniformly in their credibility intervals, as given in the data frame df
. Alternatively, bootstrap data can be gene trees sampled from a vector of tree lists: one list of bootstrap trees per locus (see readmultinewick_files
in PhyloNetworks
to generate this, from a file containing a list of bootstrap files: one per locus).
From each bootstrap replicate, a network is estimated with snaq!
, with a search starting from topology T
. Optional arguments include the following, with default values in parentheses:
hmax
(1): max number of reticulations in the estimated networksnrep
(10): number of bootstrap replicates.runs
(10): number of independent optimization runs for each replicatefilename
("bootsnaq"): root name for output files. No output files if "".seed
(0 to get a random seed from the clock): seed for random number generatorotherNet
(empty): another starting topology so that each replicate will start prcnet% runs on otherNet and (1-prcnet)% runs onT
prcnet
(0): percentage of runs starting onotherNet
; error if different than 0.0, and otherNet not specified.ftolRel
,ftolAbs
,xtolRel
,xtolAbs
,liktolAbs
,Nfail
,probST
,verbose
,outgroup
: seesnaq!
, same defaults.
If T
is a tree, its branch lengths are first optimized roughly with updateBL!
(by using the average CF of all quartets defining each branch and calculating the coalescent units corresponding to this quartet CF). If T
has one or more reticulations, its branch lengths are taken as is to start the search. The branch lengths of otherNet
are always taken as is to start the search.
SNaQ.fittedquartetCF
— FunctionfittedquartetCF(d::DataCF, format::Symbol)
Data frame with the observed and expected quartet concordance factors after estimation of a network with snaq!
, or fitting of quartet CF data on a fixed network. The format can be :wide
(default) or :long
.
- if
wide
, the output has one row per 4-taxon set, and each row has 10 columns: 4 columns for the taxon names, 3 columns for the observed CFs and 3 columns for the expected CF. - if
long
, the output has one row per quartet, i.e. 3 rows per 4-taxon sets, and 7 columns: 4 columns for the taxon names, one column to give the quartet resolution, one column for the observed CF and the last column for the expected CF.
see also: topologyQpseudolik!
and topologymaxQpseudolik!
to update the fitted quartet CF expected under a specific network, inside the DataCF object d
.
SNaQ.mapallelesCFtable
— MethodmapallelesCFtable(mapping file, CF file; filename, columns, delim)
Create a new DataFrame containing the same concordance factors as in the input CF file, but with modified taxon names. Each allele name in the input CF table is replaced by the species name that the allele maps onto, based on the mapping file. The mapping file should have column names: allele and species.
Optional arguments:
filename
to write/save resulting CF table. If not specified, then the output data frame is not saved to a file.columns
column numbers for the taxon names. 1-4 by default.- any keyword arguments that
CSV.File
would accept. For example,delim=','
by default: columns are delimited by commas. Unless specified otherwise by the user,pool=false
(to read taxon names as Strings, not levels of a categorical factor, for combining the 4 columns with taxon names more easily). The same CSV arguments are used to read both input file (mapping file and quartet file)
See also mapallelesCFtable!
to input DataFrames instead of file names.
If a filename
is specified, such as "quartetCF_speciesNames.csv" in the example below, this file is best read later with the option pool=false
. example:
mapallelesCFtable("allele-species-map.csv", "allele-quartet-CF.csv";
filename = "quartetCF_speciesNames.csv")
df_sp = CSV.read("quartetCF_speciesNames.csv", DataFrame); # DataFrame object
dataCF_specieslevel = readtableCF!(df_sp, mergerows=true); # DataCF object
SNaQ.readmultinewicklevel1
— Methodreadmultinewicklevel1(file)
Read a text file with a list of trees/networks in extended newick format (one tree per line) and transform them like readnewicklevel1
. Namely, in each tree/network
- the root is suppressed (becomes of degree 3 if it was of degree 2)
- any polytomy is resolved arbitrarily
- any missing branch length is set to 1
- any branch length above 10 is set to 10 (this assumes branch lengths in coalescent units)
- any missing γ's are set to (0.1, 0.9)
and more (see readnewicklevel1
).
See PhyloNetworks.readmultinewick
to read multiple trees or networks with no modification.
Output: array of HybridNetwork objects.
Each line starting with "(" will be considered as describing one topology. The file can have extra lines that are ignored.
SNaQ.readnewicklevel1
— Methodreadnewicklevel1(filename)
readnewicklevel1(parenthetical format)
Similarly to PhyloNetworks.readnewick
: read a tree or network in parenthetical format, but this function enforces the necessary conditions for any starting topology in SNaQ: non-intersecting cycles, no polytomies, unrooted. It sets any missing branch length to 1.0, and reduces any branch length above 10 to 10 (as it assumes branch lengths are in coalescent units).
If the network has a bad diamond II (in which edge lengths are γ's are not identifiable) and if the edge below this diamond has a length t
different from 0, then this length is set back to 0 and the major parent hybrid edge is lengthened by t
.
SNaQ.readsnaqnetwork
— Methodreadsnaqnetwork(output file)
Read the estimated network from a .out
file generated by snaq!
. The network score is read also, and stored in the network's field .loglik
.
Warning: despite the name "loglik", this score is only proportional to the network's pseudo-deviance: the lower, the better. Do NOT use this score to calculate an AIC or BIC (etc.) value.
SNaQ.readtableCF!
— MethodreadtableCF!(data frame, columns; mergerows=false)
Read in quartet CFs from data frame, assuming information is in columns numbered columns
, of length 7 or 8: 4 taxon labels then 3 CFs then ngenes possibly.
If some species appears more than once in the same 4-taxon set (e.g. t1,t1,t2,t3), then the data frame is modified to remove rows (4-taxon sets) that are uninformative about between-species relationships. This situation may occur if multiple individuals are sampled from the same species. A 4-taxon set is uninformative (and its row is removed) if one taxon is repeated 3 or 4 times (like t1,t1,t1,t1 or t1,t2,t2,t2). The list of species appearing twice in some 4-taxon sets is stored in the output DataCF object. For these species, the length of their external edge is identifiable (in coalescent units). If multiple rows correspond to the same 4-taxon set, these rows are merged and their CF values (and number of genes) are averaged. If none of the species is repeated within any 4-taxon set, then this averaging is attempted only if mergerows
is true.
readtableCF!(DataCF, data frame, columns)
Modify the .quartet.obsCF
values in the DataCF
object with those read from the data frame in columns numbered columns
. columns
should have 3 columns numbers for the 3 CFs in this order: 12_34
, 13_24
and 14_23
.
Assumptions:
- same 4-taxon sets in
DataCF
and in the data frame, and in the same order, but this assumption is not checked (for speed, e.g. during bootstrapping). - one single row per 4-taxon set (multiple individuals representatives of the same 4-taxon set should have been already merged); basically: the DataCF should have been created from the data frame by
readtableCF!(df, colums)
SNaQ.readtableCF
— MethodreadtableCF(file)
readtableCF(data frame)
readtableCF!(data frame)
Read a file or DataFrame object containing a table of concordance factors (CF), with one row per 4-taxon set. The first 4 columns are assumed to give the labels of the 4 taxa in each set (tx1, tx2, tx3, tx4). Columns containing the CFs are assumed to be named CF12_34
, CF13_24
and CF14_23
; or CF12.34
, CF13.24
and CF14.23
; or else are assumed to be columns 5,6,7. If present, a column named 'ngenes' will be used to get the number of loci used to estimate the CFs for each 4-taxon set.
Output: DataCF
object
Optional arguments:
summaryfile
: if specified, a summary file will be created with that name.delim
(for the first form only): to specify how columns are delimited, with single quotes: delim=';'. Default is acsv
file, i.e.delim=','
.mergerows
: false by default. When true, will attempt to merge multiple rows corresponding to the same four-taxon set (by averaging their quartet CFs) even if none of the species is repeated within any row (that is, in any set of 4 taxa)
The last version modifies the input data frame, if species are represented by multiple alleles for instance (see readtableCF!
(data frame, columns)).
SNaQ.readtrees2CF
— Methodreadtrees2CF(treefile)
readtrees2CF(vector of trees)
Read trees in parenthetical format from a file, or take a vector of trees already read, and calculate the proportion of these trees having a given quartet (concordance factor: CF), for all quartets or for a sample of quartets. Optional arguments include:
quartetfile
: name of text file with list of 4-taxon subsets to be analyzed. If none is specified, the function will list all possible 4-taxon subsets.whichQ="rand"
: to choose a random sample of 4-taxon subsetsnumQ
: size of random sample (ignored if whichQ is not set to "rand")writeTab=false
: does not write the observedCF to a table (default true)CFfile
: name of file to save the observedCF (default tableCF.txt)writeQ=true
: save intermediate files with the list of all 4-taxon subsets and chosen random sample (default false).writeSummary
: write descriptive stats of input data (default: true)nexus
: if true, it assumes the gene trees are written in nexus file (default: false)
See also: PhyloNetworks.countquartetsintrees
, which uses a much faster algorithm; readtableCF
to read a table of quartet CFs directly.
SNaQ.snaq!
— Methodsnaq!(T::HybridNetwork, d::DataCF)
Estimate the network (or tree) to fit observed quartet concordance factors (CFs) stored in a DataCF object, using maximum pseudo-likelihood. A level-1 network is assumed. The search starts from topology T
, which can be a tree or a network with no more than hmax
hybrid nodes. The function name ends with ! because it modifies the CF data d
by updating its attributes expCF
: CFs expected under the network model. It does not modify T
. The quartet pseudo-deviance is the negative log pseudo-likelihood, up to an additive constant, such that a perfect fit corresponds to a deviance of 0.0.
Output:
- estimated network in file
.out
(also in.log
): best network overall and list of networks from each individual run. - the best network and modifications of it, in file
.networks
. All networks in this file have the same undirected topology as the best network, but have different hybrid/gene flow directions. These other networks are reported with their pseudo-likelihood scores, because non-identifiability issues can cause them to have very similar scores, and because SNaQ was shown to estimate the undirected topology accurately but not the direction of hybridization in cases of near non-identifiability. - if any error occurred, file
.err
provides information (seed) to reproduce the error.
There are many optional arguments, including
hmax
(default 1): maximum number of hybridizations allowedverbose
(default false): if true, print information about the numerical optimizationruns
(default 10): number of independent starting points for the searchoutgroup
(default none): outgroup taxon to root the estimated topology at the very endfilename
(default "snaq"): root name for the output files (.out
,.err
). If empty (""), files are not created, progress log goes to the screen only (standard out).seed
(default 0 to get it from the clock): seed to replicate a given searchprobST
(default 0.3): probability to start fromT
at each given run. With problability 1-probST, the search is started from an NNI modification ofT
along a tree edge with no hybrid neighbor, with a possible modification of one reticulation ifT
has one.updateBL
(default true): If true and ifT
is a tree, the branch lengths inT
are first optimized roughly withupdateBL!
by using the average CF of all quartets defining each branch and back-calculating the coalescent units.
The following optional arguments control when to stop the optimization of branch lengths and γ's on each individual candidate network. Defaults are in parentheses:
ftolRel
(1e-6) andftolAbs
(1e-6): relative and absolute differences of the network score between the current and proposed parameters,xtolRel
(1e-2) andxtolAbs
(1e-3): relative and absolute differences between the current and proposed parameters.
Greater values will result in a less thorough but faster search. These parameters are used when evaluating candidate networks only. The following optional arguments control when to stop proposing new network topologies:
Nfail
(75): maximum number of times that new topologies are proposed and rejected (in a row).liktolAbs
(1e-6): the proposed network is accepted if its score is better than the current score by at least liktolAbs.
Lower values of Nfail
and greater values of liktolAbs
and ftolAbs
would result in a less thorough but faster search.
At the end, branch lengths and γ's are optimized on the last "best" network with different and very thorough tolerance parameters: 1e-12 for ftolRel
, 1e-10 for ftolAbs
, xtolRel
, xtolAbs
.
See also: topologymaxQpseudolik!
to optimize parameters on a fixed topology, and topologyQpseudolik!
to get the deviance (pseudo log-likelihood up to a constant) of a fixed topology with fixed parameters.
Reference: Claudia Solís-Lemus and Cécile Ané (2016). Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS Genetics 12(3):e1005896
SNaQ.summarizedataCF
— MethodsummarizedataCF(d::DataCF)
function to summarize the information contained in a DataCF object. It has the following optional arguments:
filename
: if provided, the summary will be saved in the filename, not to screenpc
(number between (0,1)): threshold of percentage of missing genes to identify 4-taxon subsets with fewer genes than the threshold
SNaQ.topologyQpseudolik!
— MethodtopologyQpseudolik!(net::HybridNetwork, d::DataCF)
Calculate the quartet pseudo-deviance of a given network/tree for DataCF d
. This is the negative log pseudo-likelihood, up to an additive constant, such that a perfect fit corresponds to a deviance of 0.0.
Note that the network's parameters (edge lengths and γs) are not optimized. So be careful if net
does not have all internal branch lengths specified, because then the pseudolikelihood will be meaningless. See topologymaxQpseudolik!
if you want branch lengths and numerical parameters optimized on the given network.
The loglik value of the network is updated, and d
is updated with the expected concordance factors under the input network.
SNaQ.topologymaxQpseudolik!
— MethodtopologymaxQpseudolik!(
net::HybridNetwork,
d::DataCF;
verbose = true,
ftolRel = 1e-6,
ftolAbs = 1e-6,
xtolRel = 1e-2,
xtolAbs = 1e-3
)
Estimate the branch lengths and inheritance probabilities (γ's) for a given network topology. The network is not modified, only the object d
is, with updated expected concordance factors.
Ouput: new network, with optimized parameters (branch lengths and gammas). The maximized quartet pseudo-deviance is the negative log pseudo-likelihood, up to an additive constant, such that a perfect fit corresponds to a deviance of 0.0. This is also an attribute of the network, which can be accessed with loglik(net)
.
Optional arguments:
verbose
: iftrue
, information on the numerical optimization is printed to screenftolRel
,ftolAbs
,xtolRel
,xtolAbs
: absolute and relative tolerance values for the pseudo-deviance function (f
) and the parameters (x
).
The default tolerance values are quite lenient, for faster running time. They are more lenient than those used in snaq!
, so we can expect that snaq!
returns a better (lower) score for the same network topology. It is highly recommended to use more stringent value than the default, for example 1e-12 for ftolRel
, and 1e-10 for ftolAbs
, xtolRel
, xtolAbs
to match those in snaq!
.
To further optimize branch lengths and γs, another strategy is the run the topologymaxQpseudolik!
multiple times, because each time the edge parameters in the network are improved, and re-starting a search from a good place leads to finding even better edge parameter values. If snaq!
finds a better score for the given network topology, then using this strategy (effectively used by snaq!
when optimizing edge parameters at each trial) and using stringent tolerances should eliminate the difference.
SNaQ.DataCF
— TypeDataCF
type that contains the following attributes:
quartet
(vector of Quartets)numQuartets
tree
(vector of trees: empty if a table of CF was input instead of list of trees)numTrees
(-1 if a table CF was input instead of list of trees)repSpecies
(taxon names that were repeated in table of CF or input gene trees: used inside snaq for multiple alleles case)
The list of Quartet
may be accessed with the attribute .quartet
. If the input was a list of trees, the HybridNetwork
's can be accessed with the attribute .tree
. For example, if the DataCF
object is named d
, d.quartet[1]
will show the first quartet and d.tree[1]
will print the first input tree.
SNaQ.Quartet
— TypeQuartet
type that saves the information on a given 4-taxon subset. It contains the following attributes:
number
: integertaxon
: vector of taxon names, like t1 t2 t3 t4obsCF
: vector of observed CF, in order 12|34, 13|24, 14|23logPseudoLik
ngenes
: number of gene trees used to compute the observed CF; -1.0 if unknownqnet
:QuartetNetwork
, which saves the expCF after snaq estimation to emphasize that the expCF depend on a specific network, not the data
see also: PhyloNetworks.QuartetT
for quartet with data of user-defined type T
, using a mapping between quartet indices and quartet taxa.