Public Documentation

Documentation for SNaQ's public (exported) interface.

Functions & Types

SNaQ.bootsnaqMethod
bootsnaq(T::HybridNetwork, df::DataFrame)
bootsnaq(T::HybridNetwork, vector of tree lists)

Bootstrap analysis for SNaQ. Bootstrap data can be quartet concordance factors (CF), drawn from sampling uniformly in their credibility intervals, as given in the data frame df. Alternatively, bootstrap data can be gene trees sampled from a vector of tree lists: one list of bootstrap trees per locus (see readmultinewick_files in PhyloNetworks to generate this, from a file containing a list of bootstrap files: one per locus).

From each bootstrap replicate, a network is estimated with snaq!, with a search starting from topology T. Optional arguments include the following, with default values in parentheses:

  • hmax (1): max number of reticulations in the estimated networks
  • nrep (10): number of bootstrap replicates.
  • runs (10): number of independent optimization runs for each replicate
  • filename ("bootsnaq"): root name for output files. No output files if "".
  • seed (0 to get a random seed from the clock): seed for random number generator
  • otherNet (empty): another starting topology so that each replicate will start prcnet% runs on otherNet and (1-prcnet)% runs on T
  • prcnet (0): percentage of runs starting on otherNet; error if different than 0.0, and otherNet not specified.
  • ftolRel, ftolAbs, xtolRel, xtolAbs, liktolAbs, Nfail, probST, verbose, outgroup: see snaq!, same defaults.

If T is a tree, its branch lengths are first optimized roughly with updateBL! (by using the average CF of all quartets defining each branch and calculating the coalescent units corresponding to this quartet CF). If T has one or more reticulations, its branch lengths are taken as is to start the search. The branch lengths of otherNet are always taken as is to start the search.

source
SNaQ.fittedquartetCFFunction
fittedquartetCF(d::DataCF, format::Symbol)

Data frame with the observed and expected quartet concordance factors after estimation of a network with snaq!, or fitting of quartet CF data on a fixed network. The format can be :wide (default) or :long.

  • if wide, the output has one row per 4-taxon set, and each row has 10 columns: 4 columns for the taxon names, 3 columns for the observed CFs and 3 columns for the expected CF.
  • if long, the output has one row per quartet, i.e. 3 rows per 4-taxon sets, and 7 columns: 4 columns for the taxon names, one column to give the quartet resolution, one column for the observed CF and the last column for the expected CF.

see also: topologyQpseudolik! and topologymaxQpseudolik! to update the fitted quartet CF expected under a specific network, inside the DataCF object d.

source
SNaQ.mapallelesCFtableMethod
mapallelesCFtable(mapping file, CF file; filename, columns, delim)

Create a new DataFrame containing the same concordance factors as in the input CF file, but with modified taxon names. Each allele name in the input CF table is replaced by the species name that the allele maps onto, based on the mapping file. The mapping file should have column names: allele and species.

Optional arguments:

  • filename to write/save resulting CF table. If not specified, then the output data frame is not saved to a file.
  • columns column numbers for the taxon names. 1-4 by default.
  • any keyword arguments that CSV.File would accept. For example, delim=',' by default: columns are delimited by commas. Unless specified otherwise by the user, pool=false (to read taxon names as Strings, not levels of a categorical factor, for combining the 4 columns with taxon names more easily). The same CSV arguments are used to read both input file (mapping file and quartet file)

See also mapallelesCFtable! to input DataFrames instead of file names.

If a filename is specified, such as "quartetCF_speciesNames.csv" in the example below, this file is best read later with the option pool=false. example:

mapallelesCFtable("allele-species-map.csv", "allele-quartet-CF.csv";
                  filename = "quartetCF_speciesNames.csv")
df_sp = CSV.read("quartetCF_speciesNames.csv", DataFrame); # DataFrame object
dataCF_specieslevel = readtableCF!(df_sp, mergerows=true); # DataCF object
source
SNaQ.readmultinewicklevel1Method
readmultinewicklevel1(file)

Read a text file with a list of trees/networks in extended newick format (one tree per line) and transform them like readnewicklevel1. Namely, in each tree/network

  • the root is suppressed (becomes of degree 3 if it was of degree 2)
  • any polytomy is resolved arbitrarily
  • any missing branch length is set to 1
  • any branch length above 10 is set to 10 (this assumes branch lengths in coalescent units)
  • any missing γ's are set to (0.1, 0.9)

and more (see readnewicklevel1).

See PhyloNetworks.readmultinewick to read multiple trees or networks with no modification.

Output: array of HybridNetwork objects.

Each line starting with "(" will be considered as describing one topology. The file can have extra lines that are ignored.

source
SNaQ.readnewicklevel1Method
readnewicklevel1(filename)
readnewicklevel1(parenthetical format)

Similarly to PhyloNetworks.readnewick: read a tree or network in parenthetical format, but this function enforces the necessary conditions for any starting topology in SNaQ: non-intersecting cycles, no polytomies, unrooted. It sets any missing branch length to 1.0, and reduces any branch length above 10 to 10 (as it assumes branch lengths are in coalescent units).

If the network has a bad diamond II (in which edge lengths are γ's are not identifiable) and if the edge below this diamond has a length t different from 0, then this length is set back to 0 and the major parent hybrid edge is lengthened by t.

source
SNaQ.readsnaqnetworkMethod
readsnaqnetwork(output file)

Read the estimated network from a .out file generated by snaq!. The network score is read also, and stored in the network's field .loglik.

Warning: despite the name "loglik", this score is only proportional to the network's pseudo-deviance: the lower, the better. Do NOT use this score to calculate an AIC or BIC (etc.) value.

source
SNaQ.readtableCF!Method
readtableCF!(data frame, columns; mergerows=false)

Read in quartet CFs from data frame, assuming information is in columns numbered columns, of length 7 or 8: 4 taxon labels then 3 CFs then ngenes possibly.

If some species appears more than once in the same 4-taxon set (e.g. t1,t1,t2,t3), then the data frame is modified to remove rows (4-taxon sets) that are uninformative about between-species relationships. This situation may occur if multiple individuals are sampled from the same species. A 4-taxon set is uninformative (and its row is removed) if one taxon is repeated 3 or 4 times (like t1,t1,t1,t1 or t1,t2,t2,t2). The list of species appearing twice in some 4-taxon sets is stored in the output DataCF object. For these species, the length of their external edge is identifiable (in coalescent units). If multiple rows correspond to the same 4-taxon set, these rows are merged and their CF values (and number of genes) are averaged. If none of the species is repeated within any 4-taxon set, then this averaging is attempted only if mergerows is true.

readtableCF!(DataCF, data frame, columns)

Modify the .quartet.obsCF values in the DataCF object with those read from the data frame in columns numbered columns. columns should have 3 columns numbers for the 3 CFs in this order: 12_34, 13_24 and 14_23.

Assumptions:

  • same 4-taxon sets in DataCF and in the data frame, and in the same order, but this assumption is not checked (for speed, e.g. during bootstrapping).
  • one single row per 4-taxon set (multiple individuals representatives of the same 4-taxon set should have been already merged); basically: the DataCF should have been created from the data frame by readtableCF!(df, colums)
source
SNaQ.readtableCFMethod
readtableCF(file)
readtableCF(data frame)
readtableCF!(data frame)

Read a file or DataFrame object containing a table of concordance factors (CF), with one row per 4-taxon set. The first 4 columns are assumed to give the labels of the 4 taxa in each set (tx1, tx2, tx3, tx4). Columns containing the CFs are assumed to be named CF12_34, CF13_24 and CF14_23; or CF12.34, CF13.24 and CF14.23; or else are assumed to be columns 5,6,7. If present, a column named 'ngenes' will be used to get the number of loci used to estimate the CFs for each 4-taxon set.

Output: DataCF object

Optional arguments:

  • summaryfile: if specified, a summary file will be created with that name.
  • delim (for the first form only): to specify how columns are delimited, with single quotes: delim=';'. Default is a csv file, i.e. delim=','.
  • mergerows: false by default. When true, will attempt to merge multiple rows corresponding to the same four-taxon set (by averaging their quartet CFs) even if none of the species is repeated within any row (that is, in any set of 4 taxa)

The last version modifies the input data frame, if species are represented by multiple alleles for instance (see readtableCF!(data frame, columns)).

source
SNaQ.readtrees2CFMethod
readtrees2CF(treefile)
readtrees2CF(vector of trees)

Read trees in parenthetical format from a file, or take a vector of trees already read, and calculate the proportion of these trees having a given quartet (concordance factor: CF), for all quartets or for a sample of quartets. Optional arguments include:

  • quartetfile: name of text file with list of 4-taxon subsets to be analyzed. If none is specified, the function will list all possible 4-taxon subsets.
  • whichQ="rand": to choose a random sample of 4-taxon subsets
  • numQ: size of random sample (ignored if whichQ is not set to "rand")
  • writeTab=false: does not write the observedCF to a table (default true)
  • CFfile: name of file to save the observedCF (default tableCF.txt)
  • writeQ=true: save intermediate files with the list of all 4-taxon subsets and chosen random sample (default false).
  • writeSummary: write descriptive stats of input data (default: true)
  • nexus: if true, it assumes the gene trees are written in nexus file (default: false)

See also: PhyloNetworks.countquartetsintrees, which uses a much faster algorithm; readtableCF to read a table of quartet CFs directly.

source
SNaQ.snaq!Method
snaq!(T::HybridNetwork, d::DataCF)

Estimate the network (or tree) to fit observed quartet concordance factors (CFs) stored in a DataCF object, using maximum pseudo-likelihood. A level-1 network is assumed. The search starts from topology T, which can be a tree or a network with no more than hmax hybrid nodes. The function name ends with ! because it modifies the CF data d by updating its attributes expCF: CFs expected under the network model. It does not modify T. The quartet pseudo-deviance is the negative log pseudo-likelihood, up to an additive constant, such that a perfect fit corresponds to a deviance of 0.0.

Output:

  • estimated network in file .out (also in .log): best network overall and list of networks from each individual run.
  • the best network and modifications of it, in file .networks. All networks in this file have the same undirected topology as the best network, but have different hybrid/gene flow directions. These other networks are reported with their pseudo-likelihood scores, because non-identifiability issues can cause them to have very similar scores, and because SNaQ was shown to estimate the undirected topology accurately but not the direction of hybridization in cases of near non-identifiability.
  • if any error occurred, file .err provides information (seed) to reproduce the error.

There are many optional arguments, including

  • hmax (default 1): maximum number of hybridizations allowed
  • verbose (default false): if true, print information about the numerical optimization
  • runs (default 10): number of independent starting points for the search
  • outgroup (default none): outgroup taxon to root the estimated topology at the very end
  • filename (default "snaq"): root name for the output files (.out, .err). If empty (""), files are not created, progress log goes to the screen only (standard out).
  • seed (default 0 to get it from the clock): seed to replicate a given search
  • probST (default 0.3): probability to start from T at each given run. With problability 1-probST, the search is started from an NNI modification of T along a tree edge with no hybrid neighbor, with a possible modification of one reticulation if T has one.
  • updateBL (default true): If true and if T is a tree, the branch lengths in T are first optimized roughly with updateBL! by using the average CF of all quartets defining each branch and back-calculating the coalescent units.

The following optional arguments control when to stop the optimization of branch lengths and γ's on each individual candidate network. Defaults are in parentheses:

  • ftolRel (1e-6) and ftolAbs (1e-6): relative and absolute differences of the network score between the current and proposed parameters,
  • xtolRel (1e-2) and xtolAbs (1e-3): relative and absolute differences between the current and proposed parameters.

Greater values will result in a less thorough but faster search. These parameters are used when evaluating candidate networks only. The following optional arguments control when to stop proposing new network topologies:

  • Nfail (75): maximum number of times that new topologies are proposed and rejected (in a row).
  • liktolAbs (1e-6): the proposed network is accepted if its score is better than the current score by at least liktolAbs.

Lower values of Nfail and greater values of liktolAbs and ftolAbs would result in a less thorough but faster search.

At the end, branch lengths and γ's are optimized on the last "best" network with different and very thorough tolerance parameters: 1e-12 for ftolRel, 1e-10 for ftolAbs, xtolRel, xtolAbs.

See also: topologymaxQpseudolik! to optimize parameters on a fixed topology, and topologyQpseudolik! to get the deviance (pseudo log-likelihood up to a constant) of a fixed topology with fixed parameters.

Reference: Claudia Solís-Lemus and Cécile Ané (2016). Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS Genetics 12(3):e1005896

source
SNaQ.summarizedataCFMethod
summarizedataCF(d::DataCF)

function to summarize the information contained in a DataCF object. It has the following optional arguments:

  • filename: if provided, the summary will be saved in the filename, not to screen
  • pc (number between (0,1)): threshold of percentage of missing genes to identify 4-taxon subsets with fewer genes than the threshold
source
SNaQ.topologyQpseudolik!Method
topologyQpseudolik!(net::HybridNetwork, d::DataCF)

Calculate the quartet pseudo-deviance of a given network/tree for DataCF d. This is the negative log pseudo-likelihood, up to an additive constant, such that a perfect fit corresponds to a deviance of 0.0.

Note that the network's parameters (edge lengths and γs) are not optimized. So be careful if net does not have all internal branch lengths specified, because then the pseudolikelihood will be meaningless. See topologymaxQpseudolik! if you want branch lengths and numerical parameters optimized on the given network.

The loglik value of the network is updated, and d is updated with the expected concordance factors under the input network.

source
SNaQ.topologymaxQpseudolik!Method
topologymaxQpseudolik!(
    net::HybridNetwork,
    d::DataCF;
    verbose = true,
    ftolRel = 1e-6,
    ftolAbs = 1e-6,
    xtolRel = 1e-2,
    xtolAbs = 1e-3
)

Estimate the branch lengths and inheritance probabilities (γ's) for a given network topology. The network is not modified, only the object d is, with updated expected concordance factors.

Ouput: new network, with optimized parameters (branch lengths and gammas). The maximized quartet pseudo-deviance is the negative log pseudo-likelihood, up to an additive constant, such that a perfect fit corresponds to a deviance of 0.0. This is also an attribute of the network, which can be accessed with loglik(net).

Optional arguments:

  • verbose: if true, information on the numerical optimization is printed to screen
  • ftolRel, ftolAbs, xtolRel, xtolAbs: absolute and relative tolerance values for the pseudo-deviance function (f) and the parameters (x).

The default tolerance values are quite lenient, for faster running time. They are more lenient than those used in snaq!, so we can expect that snaq! returns a better (lower) score for the same network topology. It is highly recommended to use more stringent value than the default, for example 1e-12 for ftolRel, and 1e-10 for ftolAbs, xtolRel, xtolAbs to match those in snaq!.

To further optimize branch lengths and γs, another strategy is the run the topologymaxQpseudolik! multiple times, because each time the edge parameters in the network are improved, and re-starting a search from a good place leads to finding even better edge parameter values. If snaq! finds a better score for the given network topology, then using this strategy (effectively used by snaq! when optimizing edge parameters at each trial) and using stringent tolerances should eliminate the difference.

source
SNaQ.DataCFType
DataCF

type that contains the following attributes:

  • quartet (vector of Quartets)
  • numQuartets
  • tree (vector of trees: empty if a table of CF was input instead of list of trees)
  • numTrees (-1 if a table CF was input instead of list of trees)
  • repSpecies (taxon names that were repeated in table of CF or input gene trees: used inside snaq for multiple alleles case)

The list of Quartet may be accessed with the attribute .quartet. If the input was a list of trees, the HybridNetwork's can be accessed with the attribute .tree. For example, if the DataCF object is named d, d.quartet[1] will show the first quartet and d.tree[1] will print the first input tree.

source
SNaQ.QuartetType
Quartet

type that saves the information on a given 4-taxon subset. It contains the following attributes:

  • number: integer
  • taxon: vector of taxon names, like t1 t2 t3 t4
  • obsCF: vector of observed CF, in order 12|34, 13|24, 14|23
  • logPseudoLik
  • ngenes: number of gene trees used to compute the observed CF; -1.0 if unknown
  • qnet: QuartetNetwork, which saves the expCF after snaq estimation to emphasize that the expCF depend on a specific network, not the data

see also: PhyloNetworks.QuartetT for quartet with data of user-defined type T, using a mapping between quartet indices and quartet taxa.

source

Index