SNaQ
Documentation for SNaQ.
SNaQ.fAbs
SNaQ.DataCF
SNaQ.Quartet
SNaQ.QuartetNetwork
SNaQ.QuartetT
SNaQ.afterOptBL!
SNaQ.afterOptBLAll!
SNaQ.afterOptBLRepeat!
SNaQ.bootsnaq
SNaQ.calculateObsCFAll!
SNaQ.checkMapDF
SNaQ.countquartetsintrees
SNaQ.deleteLeaf!
SNaQ.fittedQuartetCF
SNaQ.gammaZero!
SNaQ.getNeighborsTarget
SNaQ.hybridatnode
SNaQ.hybridatnode!
SNaQ.hybridatnode!
SNaQ.mapAllelesCFtable
SNaQ.mapAllelesCFtable!
SNaQ.moveHybrid!
SNaQ.moveTargetUpdate!
SNaQ.nchoose1234
SNaQ.optBL!
SNaQ.optTopLevel!
SNaQ.optTopRun1!
SNaQ.optTopRuns!
SNaQ.proposedTop!
SNaQ.quartetdata_columnnames
SNaQ.quartetrank
SNaQ.readInputData
SNaQ.readInputTrees
SNaQ.readSnaqNetwork
SNaQ.readTableCF
SNaQ.readTableCF!
SNaQ.readTopologyLevel1
SNaQ.readTrees2CF
SNaQ.sameTaxa
SNaQ.sampleCFfromCI
SNaQ.setGammaBLfromGammaz!
SNaQ.setLength!
SNaQ.setNonIdBL!
SNaQ.snaq!
SNaQ.sort_stringasinteger!
SNaQ.sorttaxa!
SNaQ.summarizeDataCF
SNaQ.taxadiff
SNaQ.topologyMaxQPseudolik!
SNaQ.topologyQPseudolik!
SNaQ.traverseContainRoot!
SNaQ.undirectedOtherNetworks
SNaQ.undoGammaz!
SNaQ.updateBL!
SNaQ.updateContainRoot!
SNaQ.writeTableCF
SNaQ.writeTableCF
SNaQ.writeTopologyLevel1
SNaQ.fAbs
— Constantdefault values for tolerance parameters, used in the optimization of branch lengths (fAbs, fRel, xAbs, xRel) and in the acceptance of topologies (likAbs, numFails).
if changes are made here, make the same in the docstring for snaq! below
version | fAbs | fRel | xAbs | xRel | numFails | likAbs | multiplier |
---|---|---|---|---|---|---|---|
v0.5.1 | 1e-6 | 1e-6 | 1e-3 | 1e-2 | 75 | 1e-6 | |
v0.3.0 | 1e-6 | 1e-5 | 1e-4 | 1e-3 | 100 | 0.01 | |
v0.0.1 | 1e-6 | 1e-5 | 1e-4 | 1e-3 | 100 | 10000 | |
older | 1e-10 | 1e-12 | 1e-10 | 1e-10 |
v0.5.1: based on Nan Ji's work. same xAbs and xRel as in phylonet (as of 2015). earlier: multiplier was used; later: likAbs = multiplier*fAbs) "older": values from GLM.jl, Prof Bates
default values used on a single topology, to optimize branch lengths and gammas, at the very end of snaq!, and by topologyMaxQPseudolik! since v0.5.1.
version | fAbsBL | fRelBL | xAbsBL | xRelBL |
---|---|---|---|---|
v0.0.1 | 1e-10 | 1e-12 | 1e-10 | 1e-10 |
SNaQ.DataCF
— TypeDataCF
type that contains the following attributes:
- quartet (vector of Quartets)
- numQuartets
- tree (vector of trees: empty if a table of CF was input instead of list of trees)
- numTrees (-1 if a table CF was input instead of list of trees)
- repSpecies (taxon names that were repeated in table of CF or input gene trees: used inside snaq for multiple alleles case)
The list of Quartet may be accessed with the attribute .quartet. If the input was a list of trees, the HybridNetwork's can be accessed with the attribute .tree. For example, if the DataCF object is named d, d.quartet[1] will show the first quartet and d.tree[1] will print the first input tree.
SNaQ.Quartet
— TypeQuartet
type that saves the information on a given 4-taxon subset. It contains the following attributes:
- number: integer
- taxon: vector of taxon names, like t1 t2 t3 t4
- obsCF: vector of observed CF, in order 12|34, 13|24, 14|23
- logPseudoLik
- ngenes: number of gene trees used to compute the observed CF; -1.0 if unknown
- qnet:
QuartetNetwork
, which saves the expCF after snaq estimation to emphasize that the expCF depend on a specific network, not the data
see also: QuartetT
for quartet with data of user-defined type T
, using a mapping between quartet indices and quartet taxa.
SNaQ.QuartetNetwork
— TypeQuartetNetwork(net::HybridNetwork)
Subtype of Network
abstract type. A QuartetNetwork
object is an internal type used to calculate the expected CFs of quartets on a given network. Attributes of the QuartetNetwork
objects need not be updated at a given time (see below).
The procedure to calculate expected CFs for a given network is as follows:
- A
QuartetNetwork
object is created for eachQuartet
usingextractQuartet!(net,d)
fornet::HybridNetwork
andd::DataCF
- The vector
d.quartet
has all theQuartet
objects, each with aQuartetNetwork
object (q.qnet
). Attibutes inQuartetNetwork
are not updated at this point - Attributes in
QuartetNetwork
are partially updated when calculating the expected CF (calculateExpCFAll!
). To calculate the expected CF for this quartet, we need to update the attributes:which
,typeHyb
,t1
,split
,formula
,expCF
. To do this, we need to modify theQuartetNetwork
object (i.e. merge edges,...). But we do not want to modify it directly because it is connected to the originalnet
via a map of the edges and nodes, so we use a deep copy:qnet=deepcopy(q.qnet)
and thencalculateExpCFAll!(qnet)
. Attributes that are updated on the originalQuartetNetwork
objectq.qnet
are:q.qnet.hasEdge
: array of booleans of length equal tonet.edge
that shows which identifiable edges and gammas ofnet
(net.ht
) are inqnet
(and still identifiable). Note that the first elements of the vector correspond to the gammas.q.qnet.index
: length should match the number of trues inqnet.hasEdge
. It has the indexes inqnet.edge
from the edges inqnet.hasEdge
. Note that the first elements of the vector correspond to the gammas.q.qnet.edge
: list of edges inQuartetNetwork
. Note that external edges innet
are collapsed when they appear inQuartetNetwork
, so only internal edges map directly to edges innet
q.qnet.expCF
: expected CF for thisQuartet
Why not modify the original QuartetNetwork
? We wanted to keep the original QuartetNetwork
stored in DataCF
with all the identifiable edges, to be able to determine if this object had been changed or not after a certain optimization.
The process is:
- Deep copy of full network to create
q.qnet
forQuartet q
. ThisQuartetNetwork
object has only 4 leaves now, but does not have merged edges (the identifiable ones) so that we can correspond to the edges in net. ThisQuartetNetwork
does not have other attributes updated. - For the current set of branch lengths and gammas, we can update the attributes in
q.qnet
to compute the expected CF. The functions that do this will "destroy" theQuartetNetwork
object by merging edges, removing nodes, etc... So, we do this process inqnet=deepcopy(q.qnet)
, and at the end, only updateq.qnet.expCF
. - After we optimize branch lengths in the full network, we want to update the branch lengths in
q.qnet
. The edges need to be there (which is why we do not want to modify thisQuartetNetwork
object by merging edges), and we do not do a deep-copy of the full network again. We only change the values of branch lengths and gammas inq.qnet
, and we can re-calculate the expCF by creating a deep copyqnet=deepcopy(q.qnet)
and run the other functions (which merge edges, etc) to get theexpCF
.
Future work: there are definitely more efficient ways to do this (without the deep copies). In addition, currently edges that are no longer identifiable in QuartetNetwork
do not appear in hasEdge
nor index
. Need to study this.
julia> net0 = readTopology("(s17:13.76,(((s3:10.98,(s4:8.99,s5:8.99)I1:1.99)I2:0.47,(((s6:2.31,s7:2.31)I3:4.02,(s8:4.97,#H24:0.0::0.279)I4:1.36)I5:3.64,((s9:8.29,((s10:2.37,s11:2.37)I6:3.02,(s12:2.67,s13:2.67)I7:2.72)I8:2.89)I9:0.21,((s14:2.83,(s15:1.06,s16:1.06)I10:1.78)I11:2.14)#H24:3.52::0.72)I12:1.47)I13:1.48)I14:1.26,(((s18:5.46,s19:5.46)I15:0.59,(s20:4.72,(s21:2.40,s22:2.40)I16:2.32)I17:1.32)I18:2.68,(s23:8.56,(s1:4.64,s2:4.64)I19:3.92)I20:0.16)I21:3.98)I22:1.05);");
julia> net = readTopologyLevel1(writeTopology(net0)) ## need level1 attributes for functions below
HybridNetwork, Un-rooted Network
46 edges
46 nodes: 23 tips, 1 hybrid nodes, 22 internal tree nodes.
tip labels: s17, s3, s4, s5, ...
(s4:8.99,s5:8.99,(s3:10.0,((((s6:2.31,s7:2.31)I3:4.02,(s8:4.97,#H24:0.0::0.279)I4:1.36)I5:3.64,((s9:8.29,((s10:2.37,s11:2.37)I6:3.02,(s12:2.67,s13:2.67)I7:2.72)I8:2.89)I9:0.21,((s14:2.83,(s15:1.06,s16:1.06)I10:1.78)I11:2.14)#H24:3.52::0.721)I12:1.47)I13:1.48,((((s18:5.46,s19:5.46)I15:0.59,(s20:4.72,(s21:2.4,s22:2.4)I16:2.32)I17:1.32)I18:2.68,(s23:8.56,(s1:4.64,s2:4.64)I19:3.92)I20:0.16)I21:3.98,s17:10.0)I22:1.26)I14:0.47)I2:1.99)I1;
julia> q1 = Quartet(1,["s1", "s16", "s18", "s23"],[0.296,0.306,0.398])
number: 1
taxon names: ["s1", "s16", "s18", "s23"]
observed CF: [0.296, 0.306, 0.398]
pseudo-deviance under last used network: 0.0 (meaningless before estimation)
expected CF under last used network: Float64[] (meaningless before estimation)
julia> qnet = SNaQ.extractQuartet!(net,q1)
taxa: ["s1", "s16", "s18", "s23"]
number of hybrid nodes: 1
julia> sum([e.istIdentifiable for e in net.edge]) ## 23 identifiable edges in net
23
julia> idedges = [ee.number for ee in net.edge[[e.istIdentifiable for e in net.edge]]];
julia> print(idedges)
[5, 6, 9, 11, 12, 13, 17, 20, 21, 22, 26, 27, 28, 29, 30, 31, 34, 38, 39, 40, 44, 45, 46]
julia> length(qnet.hasEdge) ## 24 = 1 gamma + 23 identifiable edges
24
julia> sum(qnet.hasEdge) ## 8 = 1 gamma + 7 identifiable edges in qnet
8
julia> print(idedges[qnet.hasEdge[2:end]]) ## 7 id. edges: [12, 13, 29, 30, 31, 45, 46]
[12, 13, 29, 30, 31, 45, 46]
julia> qnet.edge[qnet.index[1]].number ## 11 = minor hybrid edge
11
SNaQ.QuartetT
— TypeQuartetT{T}
Generic type for 4-taxon sets. Fields:
number
: rank of the 4-taxon settaxonnumber
: static vector of 4 integers, assumed to be distinct and sorteddata
: object of typeT
For easier look-up, a unique mapping is used between the rank (number
) of a 4-taxon set and its 4 taxa (see quartetrank
and nchoose1234
):
rank-1 = (t1-1) choose 1 + (t2-1) choose 2 + (t3-1) choose 3 + (t4-1) choose 4
examples
julia> nCk = SNaQ.nchoose1234(5)
6×4 Matrix{Int64}:
0 0 0 0
1 0 0 0
2 1 0 0
3 3 1 0
4 6 4 1
5 10 10 5
julia> SNaQ.QuartetT(1,3,4,6, [.92,.04,.04, 100], nCk)
4-taxon set number 8; taxon numbers: 1,3,4,6
data: [0.92, 0.04, 0.04, 100.0]
SNaQ.afterOptBL!
— MethodafterOptBL
road map
Function that will check if there are h==0,1;t==0,hz==0,1
cases in a network after calling optBL!
.
Arguments:
closeN=true
will move origin/target, if false, add/delete N times before giving up (we have only testedcloseN=true
)origin=true
will move origin, false will move target. We added this to avoid going back and forth between the same networksmovesgamma
vector of counts of number of times each move is proposed to fix a gamma zero problem:(add,mvorigin,mvtarget,chdir,delete,nni)
Procedure:
First we split the
ht
vector innh,nt,nhz
(gammas, lengths, gammaz)If we find a
h==0,1
, we loop throughnh
to find a hybrid edge with h==0 or 1 and want to try to fix this by doing:gammaZero!(currT,d,edge,closeN,origin,N,movesgamma)
which returns true if there was a successful change, and we stop the loop
If we find a
t==0
, we loop through allnt
to find such edge, and do NNI move on this edge; return true if change successful and we stop the loopIf we find a
hz==0,1
, we loop throughnhz
to find such hybrid edge and callgammaZero
againIf we did a successful change, we run
optBL
again, and recheck if there are no more problems.Returns successchange, flagh, flagt,flaghz (flag=true means no problems)
If it is the multiple alleles case, it will not try to fix
h==0,1;hz==0,1
because it can reach a case that violates the multiple alleles condition. If we add a check here, things become horribly slow and inefficient, so we just delete a hybridization that hash==0,1;hz==0,1
** Important: ** afterOptBL
is doing only one change, but we need to repeat multiple times to be sure that we fix all the gamma zero problems, which is why we call afterOptBLRepeat
SNaQ.afterOptBLAll!
— MethodafterOptBLAll
road map
After optBL
, we want to call afterOptBLAll
(or afterOptBLAllMultipleAlleles
) to check if there are h==0,1
; t==0
; hz==0,1
. This function will try to fix the gamma zero problem, but if it cannot, it will call moveDownLevel
, to delete the hybridization from the network.
Procedure:
While startover=true
and tries<N
- While
badliks < N2
(number of bad pseudolikelihoods are less thanN2
)- Run
success = afterOptBLRepeat
- If
success = true
(it changed something):- If worse pseudolik, then go back to original topology
currT
, setstartover=true
andbadliks++
- If better pseudolik, then check flags. If all good, then
startover=false
; otherwisestartover = true
- If worse pseudolik, then go back to original topology
- If
success = false
(nothing changed), then setbadliks=N2+1
(to end the while oncurrT
)- If all flags are ok, then
startover = false
- If bad h or hz, then call
moveDownLevel
(delete one hybridization), and setstartover = true
(maybe deleting that hybridization did not fix other gamma zero problems) - If bad t, then set
startover = false
- If all flags are ok, then
- Run
- If left second while by back to original
currT
, and still bad h/hz, then move down one level, andstartover=true
; otherwisestartover=false
If first while ends by tries>N
, then it checks one last time the flags, if bad h/hz will move down one level, and exit
SNaQ.afterOptBLRepeat!
— MethodafterOptBLRepeat
road map
afterOptBL
is doing only one change, but we need to repeat multiple times to be sure that we fix all the gamma zero problems, which is why we call afterOptBLRepeat
. This function will repeat afterOptBL
every time a successful change happened; this is done only if closeN=false
, because we would delete/add hybridizations and need to stop after tried N times. If closeN=true
(default), then afterOptBLRepeat
only does one afterOptBL
, because in this case, only the neighbor edges need to be tested, and this would have been done already in gammaZero
.
SNaQ.bootsnaq
— Methodbootsnaq(T::HybridNetwork, df::DataFrame)
bootsnaq(T::HybridNetwork, vector of tree lists)
Bootstrap analysis for SNaQ. Bootstrap data can be quartet concordance factors (CF), drawn from sampling uniformly in their credibility intervals, as given in the data frame df
. Alternatively, bootstrap data can be gene trees sampled from a vector of tree lists: one list of bootstrap trees per locus (see readBootstrapTrees
to generate this, from a file containing a list of bootstrap files: one per locus).
From each bootstrap replicate, a network is estimated with snaq!, with a search starting from topology T
. Optional arguments include the following, with default values in parentheses:
hmax
(1): max number of reticulations in the estimated networksnrep
(10): number of bootstrap replicates.runs
(10): number of independent optimization runs for each replicatefilename
("bootsnaq"): root name for output files. No output files if "".seed
(0 to get a random seed from the clock): seed for random number generatorotherNet
(empty): another starting topology so that each replicate will start prcnet% runs on otherNet and (1-prcnet)% runs onT
prcnet
(0): percentage of runs starting onotherNet
; error if different than 0.0, and otherNet not specified.ftolRel
,ftolAbs
,xtolRel
,xtolAbs
,liktolAbs
,Nfail
,probST
,verbose
,outgroup
: seesnaq!
, same defaults.
If T
is a tree, its branch lengths are first optimized roughly with updateBL!
(by using the average CF of all quartets defining each branch and calculating the coalescent units corresponding to this quartet CF). If T
has one or more reticulations, its branch lengths are taken as is to start the search. The branch lengths of otherNet
are always taken as is to start the search.
SNaQ.calculateObsCFAll!
— MethodcalculateObsCFAll!(DataCF, taxa::Union{Vector{<:AbstractString}, Vector{Int}})
Calculate observed concordance factors: update the .quartet[i].obsCF
values of the DataCF
object based on its .tree vector.
calculateObsCFAll!(vector of quartets, vector of trees, taxa)
Calculate observed concordance factors: update the .obsCF
values of the quartets, based on the trees, and returns a new DataCF
object with these updated quartets and trees.
calculateObsCFAll_noDataCF!(vector of quartets, vector of trees, taxa)
update the .obsCF
values of the quartets based on the trees, but returns nothing.
Warning: all these functions need input trees (without any reticulations: h=0).
See also: countquartetsintrees
, which uses a faster algorithm, processing each input tree only once. calculateObsCFAll_noDataCF!
processes each input tree # quartet
times.
SNaQ.checkMapDF
— MethodcheckMapDF(mapping_allele2species::DataFrame)
Check that the data frame has one column named "allele" or "individual", and one column named "species". Output: indices of these column.
SNaQ.countquartetsintrees
— Functioncountquartetsintrees(trees [, taxonmap]; which=:all, weight_byallele=true)
Calculate the quartet concordance factors (CF) observed in the trees
vector. If present, taxonmap
should be a dictionary that maps each allele name to it's species name. To save to a file, first convert to a data frame using writeTableCF
. When which=:all
, quartet CFs are calculated for all 4-taxon sets. (Other options are not implemented yet.)
The algorithm runs in O(mn⁴) where m is the number of trees and n is the number of tips in the trees.
CFs are calculated at the species level only, that is, considering 4-taxon sets made of 4 distinct species, even if the gene trees may have multiple alleles from the same species. For 4 distinct species a,b,c,d
, all alleles from each species (a
etc.) will be considered to calculate the quartet CF.
By default, each gene has a weight of 1. So if there are n_a
alleles from a
, n_b
alleles from b
etc. in a given gene, then each set of 4 alleles has a weight of 1/(n_a n_b b_c n_c)
in the calculation of the CF for a,b,c,d
. With option weight_byallele=true
, then each set of 4 alleles is given a weight of 1 instead. This inflates the total number of sets used to calculate the quartet CFs (to something larger than the number of genes). This may also affect the CF values if the number of alleles varies across genes: genes with more alleles will be given more weight.
examples
julia> tree1 = readTopology("(E,(A,B),(C,D),O);"); tree2 = readTopology("(((A,B),(C,D)),E);");
julia> q,t = countquartetsintrees([tree1, tree2]);
Reading in trees, looking at 15 quartets in each...
0+--+100%
**
julia> t # taxon order: t[i] = name of taxon number i
6-element Vector{String}:
"A"
"B"
"C"
"D"
"E"
"O"
julia> length(q) # 15 four-taxon sets on 6 taxa
15
julia> q[1] # both trees agree on AB|CD: resolution 1
4-taxon set number 1; taxon numbers: 1,2,3,4
data: [1.0, 0.0, 0.0, 2.0]
julia> q[8] # tree 2 is missing O (taxon 6), tree 1 wants resolution 3: AO|CD
4-taxon set number 8; taxon numbers: 1,3,4,6
data: [0.0, 0.0, 1.0, 1.0]
julia> q[11] # tree 1 has ACEO unresolved, and tree 2 is missing O: no data for this quartet
4-taxon set number 11; taxon numbers: 1,3,5,6
data: [0.0, 0.0, 0.0, 0.0]
julia> tree1 = readTopology("(E,(a1,B),(a2,D),O);"); tree2 = readTopology("(((a1,a2),(B,D)),E);");
julia> q,t = countquartetsintrees([tree1, tree2], Dict("a1"=>"A", "a2"=>"A"); showprogressbar=false);
julia> t
5-element Vector{String}:
"A"
"B"
"D"
"E"
"O"
julia> q[1] # tree 1 has discordance: a1B|DE and a2D|BE. tree 2 has AE|BD for both alleles of A
4-taxon set number 1; taxon numbers: 1,2,3,4
data: [0.25, 0.25, 0.5, 2.0]
julia> q[3] # tree 2 is missing O (taxon 5), and a2 is unresolved in tree 1. There's only a1B|EO
4-taxon set number 3; taxon numbers: 1,2,4,5
data: [1.0, 0.0, 0.0, 0.5]
julia> df = writeTableCF(q,t); # to get a DataFrame that can be saved to a file later
julia> show(df, allcols=true)
5×9 DataFrame
Row │ qind t1 t2 t3 t4 CF12_34 CF13_24 CF14_23 ngenes
│ Int64 String String String String Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────
1 │ 1 A B D E 0.25 0.25 0.5 2.0
2 │ 2 A B D O 0.5 0.5 0.0 1.0
3 │ 3 A B E O 1.0 0.0 0.0 0.5
4 │ 4 A D E O 1.0 0.0 0.0 0.5
5 │ 5 B D E O 0.0 0.0 0.0 0.0
julia> # using CSV; CSV.write(df, "filename.csv");
julia> tree2 = readTopology("((A,(B,D)),E);");
julia> q,t = countquartetsintrees([tree1, tree2], Dict("a1"=>"A", "a2"=>"A"); weight_byallele=true);
Reading in trees, looking at 5 quartets in each...
0+--+100%
**
julia> show(writeTableCF(q,t), allcols=true)
5×9 DataFrame
Row │ qind t1 t2 t3 t4 CF12_34 CF13_24 CF14_23 ngenes
│ Int64 String String String String Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────
1 │ 1 A B D E 0.333333 0.333333 0.333333 3.0
2 │ 2 A B D O 0.5 0.5 0.0 2.0
3 │ 3 A B E O 1.0 0.0 0.0 1.0
4 │ 4 A D E O 1.0 0.0 0.0 1.0
5 │ 5 B D E O 0.0 0.0 0.0 0.0
SNaQ.deleteLeaf!
— MethoddeleteLeaf!(net::HybridNetwork, leaf::AbstractString)
deleteLeaf!(net::Network, leaf::Node)
Deletes the leaf taxon from the network. The leaf argument is the name of the taxon to delete.
Warnings:
- requires a level-1 network with up-to-date attributes for snaq! (e.g. non-missing branch lengths, gammaz, etc.)
- does not care where the root is and does not update it to a sensible location if the root is affected by the leaf removal.
- does not merge edges, i.e. does not remove all nodes of degree 2. Within snaq!, this is used to extract quartets and to keep track of which edge lengths in the original network map to the quartet network.
SNaQ.fittedQuartetCF
— FunctionfittedQuartetCF(d::DataCF, format::Symbol)
Data frame with the observed and expected quartet concordance factors after estimation of a network with snaq!
, or fitting of quartet CF data on a fixed network. The format can be :wide (default) or :long.
- if wide, the output has one row per 4-taxon set, and each row has 10 columns: 4 columns for the taxon names, 3 columns for the observed CFs and 3 columns for the expected CF.
- if long, the output has one row per quartet, i.e. 3 rows per 4-taxon sets, and 7 columns: 4 columns for the taxon names, one column to give the quartet resolution, one column for the observed CF and the last column for the expected CF.
see also: topologyQPseudolik!
and topologyMaxQPseudolik!
to update the fitted quartet CF expected under a specific network, inside the DataCF object d
.
SNaQ.gammaZero!
— MethodgammaZero
road map
Function that tries to fix a gamma zero problem (h==0,1; t==0; hz==0,1
)
- First tries to do
changeDirection
- If not successful from start, we call
moveHybrid
- If successful move (change direction), we call
optBL
and check if we fixed the problem - If problem fixed and we do not have worse pseudolik, we return
success=true
- If still problem or worse pseudolik, we call
moveHybrid
** Important: ** Any function (afterOptBL
) calling gammaZero
is assuming that it only made a change, so if the returned value is true, then a change was made, and the other function needs to run optBL
and check that all parameters are 'valid'. If the returned value is false, then no change was possible and we need to remove a hybridization if the problem is h==0,1; hz==0,1. If the problem is t==0, we ignore this problem.
SNaQ.getNeighborsTarget
— MethodgetNeighborsTarget(hybrid_node, majoredge)
Vector of edges that are incident to either:
- the node incident to
majoredge
other thanhybrid_node
, or - the tree child of
hybrid_node
.
This vector of edges is used as the list of suitable neighbors of "othermin" to move the target of a hybrid edge, in moveTargetUpdateRepeat!
.
SNaQ.hybridatnode!
— Methodhybridatnode!(net::HybridNetwork, nodeNumber::Integer)
Change the direction and status of edges in network net
, to move the hybrid node in a cycle to the node with number nodeNumber
. This node must be in one (and only one) cycle, otherwise an error will be thrown. Check and update the nodes' field inCycle
.
Output: net
after hybrid modification.
Assumption: net
must be of level 1, that is, each blob has a single cycle with a single reticulation.
example
net = readTopology("(A:1.0,((B:1.1,#H1:0.2::0.2):1.2,(((C:0.52,(E:0.5)#H2:0.02::0.7):0.6,(#H2:0.01::0.3,F:0.7):0.8):0.9,(D:0.8)#H1:0.3::0.8):1.3):0.7):0.1;");
using PhyloPlots
plot(net, shownodenumber=true); # to locate nodes and their numbers. D of hybrid origin
hybridatnode!(net, -4)
plot(net, shownodenumber=true); # hybrid direction reversed: now 2B of hybrid origin
SNaQ.hybridatnode!
— Methodhybridatnode!(net::HybridNetwork, hybrid::Node, newNode::Node)
Move the reticulation from hybrid
to newNode
, which must in the same cycle. net
is assumed to be of level 1, but no checks are made and fields are supposed up-to-date.
Called by hybridatnode!(net, nodenumber)
, which is itself called by undirectedOtherNetworks
.
SNaQ.hybridatnode
— Methodhybridatnode(net::HybridNetwork, nodeNumber::Integer)
Move the hybrid node in a cycle to make node number nodeNumber
a hybrid node Compared to [hybridatnode!
], this method checks that net
is of level 1 (required) and does not modify it.
SNaQ.mapAllelesCFtable!
— MethodmapAllelesCFtable!(quartet CF DataFrame, mapping DataFrame, columns, write?, filename)
Modify (and return) the quartet concordance factor (CF) DataFrame: replace each allele name by the species name that the allele maps onto based on the mapping data frame. This mapping data frame should have columns named "allele" and "species" (see rename!
to change column names if need be).
If write?
is true
, the modified data frame is written to a file named "filename".
Warning: mapAllelesCFtable
takes the quartet data file as its second argument, while mapAllelesCFtable!
takes the quartet data (which it modifies) as its first argument.
SNaQ.mapAllelesCFtable
— MethodmapAllelesCFtable(mapping file, CF file; filename, columns, delim)
Create a new DataFrame containing the same concordance factors as in the input CF file, but with modified taxon names. Each allele name in the input CF table is replaced by the species name that the allele maps onto, based on the mapping file. The mapping file should have column names: allele and species.
Optional arguments:
- file name to write/save resulting CF table. If not specified, then the output data frame is not saved to a file.
- column numbers for the taxon names. 1-4 by default.
- any keyword arguments that
CSV.File
would accept. For example, delim=',' by default: columns are delimited by commas. Unless specified otherwise by the user,pool
=false (to read taxon names as Strings, not levels of a categorical factor, for combining the 4 columns with taxon names more easily). The same CSV arguments are used to read both input file (mapping file and quartet file)
See also mapAllelesCFtable!
to input DataFrames instead of file names.
If a filename
is specified, such as "quartetCF_speciesNames.csv" in the example below, this file is best read later with the option pool=false
. example:
mapAllelesCFtable("allele-species-map.csv", "allele-quartet-CF.csv";
filename = "quartetCF_speciesNames.csv")
df_sp = CSV.read("quartetCF_speciesNames.csv", DataFrame); # DataFrame object
dataCF_specieslevel = readTableCF!(df_sp, mergerows=true); # DataCF object
SNaQ.moveHybrid!
— MethodmoveHybrid
road map
Function that tries to fix a gamma zero problem (h==0,1; t==0; hz==0,1
) after changing direction of hybrid edge failed. This function is called in gammaZero
.
Arguments:
closeN=true
will try move origin/target on all neighbors (first choose minor/major edge at random, then make list of all neighbor edges and tries to put the hybrid node in all the neighbors until successful move); if false, will delete and add hybrid until successful move up to N times (this is never tested)
Returns true if change was successful (not testing optBL
again), and false if we could not move anything
SNaQ.moveTargetUpdate!
— MethodmoveTargetUpdate!(net, hybrid_node, majoredge, newedge)
Modify a level-1 network net
by moving majoredge
, which should be a hybrid edge parent of hybrid_node
. Within SNaQ, majoredge
is chosen by chooseMinorMajor
.
- calls
moveTarget!(net,hybrid_node, majoredge, treeedge_belowhybrid, newedge)
, which does the move but does not update any attributes - updates all level-1 attributes needed for SNaQ: gammaz, containRoot
- un-does the move and updates if the move is invalid, through another call to
moveTarget!
but with the "undo" option.
newedge
should be a tree edge (enforced by chooseEdgeOriginTarget!
) adjacent to the parent node of majoredge
or to the tree child of hybrid_node
(enforced by getNeighborsTarget
)
Output: tuple of 3 booleans (success, flag_triangle, flag_root)
.
success
is false if the move failed (lead to an invalid network for SNaQ)flag_triangle
is false ifnet.hasVeryBadTriangle
flag_root
is false if the set of edges to place the root is empty
If success
is false, then the flags are not meant to be used downstream.
SNaQ.nchoose1234
— Methodnchoose1234(nmax)
nmax+1 x 4
matrix containing the binomial coefficient "n choose k" in row n+1
and column k
. In other words, M[i,k]
gives "i-1 choose k". It is useful to store these values and look them up to rank (a large number of) 4-taxon sets: see quartetrank
.
SNaQ.optBL!
— MethodoptBL
road map
Function that optimizes the numerical parameters (branch lengths and inheritance probabilities) for a given network. This function is called multiple times inside optTopLevel!
.
- Input: network
net
, datad
- Numerical tolerances:
ftolAbs, ftolRel, xtolAbs, xtolRel
- Function based on
MixedModels
fit
function - The function assumes
net
has all the right attributes, and cannot check this inside because it would be inefficient
Procedure:
ht = parameters!(net)
extracts the vector of parameters to estimate(h,t,gammaz)
, and sets asnet.ht
; identifies a bad diamond I, setsnet.numht
(vector of hybrid node numbers for h, edge numbers for t, hybrid node numbers for gammaz), andnet.index
to keep track of the vector of parameters to estimateextractQuartet!(net,d)
does the following for all quartets ind.quartet
:- Extract quartet by deleting all leaves not in q -> create
QuartetNetwork
object saved inq.qnet
- This network is ugly and does not have edges collapsed. This is done to keep a one-to-one correspondence between the edges in
q.qnet
and the edges innet
(if we remove nodes with only two edges, we will lose this correspondence) - Calculate expected CF with
calculateExpCFAll
for a copy ofq.qnet
. We do this copy because we want to keepq.qnet
as it is (without collapsed edges into one). The function will then save theexpCF
inq.qnet.expCF
- Extract quartet by deleting all leaves not in q -> create
calculateExpCFAll!(qnet)
will- identify the type of quartet as type 1 (equivalent to a tree) or type 2 (minor CF different). Here the code will first clean up any hybrid node by removing nodes with only two edges before identifying the
qnet
(because identification depends on neighbor nodes to hybrid node); later, setqnet.which
(1 or 2),node.prev
(neighbor node to hybrid node), updatesnode.k
(number of nodes in hybridization cycle, this can change after deleting the nodes with only two edges),node.typeHyb
(1,2,3,4,5 depending on the number of nodes in the hybridization cycle and the origin/target of the minor hybrid edge; this attribute is never used). - eliminate hybridization: this will remove type 1 hybridizations first. If
qnet.which=1
, then theqnet
is similar to a tree quartet, so it will calculate the internal length of the tree quartet:qnet.t1
. - update split for
qnet.which=1
, to determine which taxa are together. For example, for the quartet 12|34, the split is [1,1,2,2] or [2,2,1,1], that is, taxon 1 and 2 are on the same side of the split. This will updateqnet.split
- update formula for
qnet.which=1
to know the order of minorCF and majorCF in the vectorqnet.expCF
. That is, if the quartet is 1342 (order inqnet.quartet.taxon
), then the expected CF should match the observed CF in 13|42, 14|32, 12|34 and theqnet
is 12|34 (given byqnet.split
),qnet.formula
will be [2,2,1] minor, minor, major calculateExpCF!(qnet)
forqnet.which=1
, it will do1-2/3exp(-qnet.t1)
ifqnet.formula[i]==1
, and1/3exp(qnet.t1)
ifqnet.formula[i]==2
. Forqnet.which=2
, we need to make sure that there is only one hybrid node, and compute the major, minor1,minor2 expected CF in the order 12|34, 13|24, 14|23 of the taxa inqnet.quartet.taxon
- identify the type of quartet as type 1 (equivalent to a tree) or type 2 (minor CF different). Here the code will first clean up any hybrid node by removing nodes with only two edges before identifying the
Then we create a NLopt
object with algorithm BOBYQA and k parameters (length of ht). We define upper and lower bounds and define the objective function that should only depend on x=(h,t,gz)
and g (gradient, which we do not have, but still need to put as argument).
The objective function obj(x,g)
calls
calculateExpCFAll!(d,x,net)
needs to be run afterextractQuartet(net,d)
that will updateq.qnet
for all quartet. Assumes thatqnet.indexht
is updated already: we only need to do this at the beginning ofoptBL!
because the topology is fixed at this point)- First it will update the edge lengths according to x
- If the
q.qnet.changed=true
(that is, any ofqnet
branches changed value), we need to callcalculateExpCFAll!(qnet)
on a copy ofq.qnet
(again because we want to leaveq.qnet
with the edge correspondence tonet
)
update!(net,x)
simply saves the new x innet.ht
Finally, we call NLopt.optimize
, and we update the net.loglik
and net.ht
at the end. After optBL
, we want to call afterOptBLAll
(or afterOptBLAllMultipleAlleles
) to check if there are h==0,1
; t==0
; hz==0,1
.
SNaQ.optTopLevel!
— MethodoptTopLevel
road map
Function that does most of the heavy-lifting of snaq
. It optimizes the pseudolikelihood for a given starting topology, and returns the best network. Assumes that the starting topology is level-1 network, and has all the attributes correctly updated.
Input parameters:
- Starting topology
currT
, input dataDataCF
d
, maximum number of hybridizationshmax
- Numerical optimization parameters:
liktolAbs, Nfail, ftolRel, ftolAbs, xtolRel, xtolAbs
- Print parameters:
verbose, logfile, writelog
- Parameters to tune the search in space of networks:
closeN=true
only propose move origin/target to neighbor edges (coded, but not tested withcloseN=false
),Nmov0
vector with maximum number of trials allowed per type of move(add, mvorigin, mvtarget, chdir, delete, nni)
, by default computed inside with coupon’s collector formulas
The optimization procedure keeps track of
movescount
: count of proposed moves,movesgamma
: count of proposed moves to fix a gamma zero situation (see below for definition of this situation),movesfail
: count of failed moves by violation of level-1 network (inCycle
attribute) or worse pseudolikelihood than current,failures
: number of failed proposals that had a worse pseudolikelihood
Optimization procedure:
While the difference between current loglik and proposed loglik is greater than liktolAbs
, or failures<Nfail
, or stillmoves=true
:
Nmov
is updated based onnewT
. The type of move proposed will depend onnewT
(which is the same ascurrT
at this point). For example, ifcurrT
is a tree, we cannot propose move origin/target.move = whichMove
selects randomly a type of move, depending onNmov,movesfail,hmax,newT
with weights 1/5 by default for all, and 0 for delete. These weights are adjusted depending onnewT.numHybrids
andhmax
. IfnewT.numHybrids
is far fromhmax
, we give higher probability to adding a new hybrid (we want to reach thehmax
sooner, maybe not the best strategy, easy to change). Later, we adjust the weights bymovesfail
(first, give weight of 0 ifmovesfail[i]>Nmov[i]
, that is, if we reached the maximum possible number of moves allowed for a certain type) and then increase the probability of the other moves. So, unless one move hasw=0
, nothing changes. This could be improved by using the outlier quartets to guide the proposal of moves.whichMove
will choose a move randomly from the weights, it will returnnone
if no more moves allowed, in which case, the optimization endsflag=proposedTop!(move, newT)
will modifynewT
based onmove
. The functionproposedTop
will returnflag=true
if the move was successful (the move succeeded byinCycle
,containRoot
, available edge to make the move (more details inproposedTop
)). Ifflag=false
, thennewT
is cleaned, except for the case of multiple alleles. The functionproposedTop
keeps count ofmovescount
(successful move),movesfail
(unsuccessful move),Options:
random=true
: moves major/minor hybrid edge with prob h,1-h, respectivelyN=10
: number of trials for NNI edge.if(flag) Optimize branch lengths with
optBL
If
newT.loglik
is better thancurrT.loglik
byliktolAbs
, jump tonewT
(accepted=true
) and fixgamma=0, t=0
problems (more info onafterOptBL
)If(accepted)
failures=0
,movesfail=zeros
,movescount
for successful move +1
end while
After choosing the best network newT
, we do one last more thorough optimization of branch lengths with optBL
, we change non identifiable branch lengths to -1 (only in debug mode) and return newT
SNaQ.optTopRun1!
— MethodoptTopRun1!(net, liktolAbs, Nfail, d::DataCF, hmax, etc.)
The function will run 1 run by modifying the starting topology and calling optTopLevel
. See optTopRuns!
for a roadmap.
probST
(default in snaq is 0.3) is the probability of starting one run at the same input tree. So, with probability 1-probST
, we will change the topology by a NNI move on a tree edge without neighbor hybrid. If the starting topology is a network, then with probability 1-probST
it will also modify one randomly chosen hybrid edge: with prob 0.5, the function will move origin, with prob 0.5 will do move target.
If there are multiple alleles (d.repSpecies
not empty), then the function has to check that the starting topology does not violate the multiple alleles condition.
After modifying the starting topology with NNI and/or move origin/target, optTopLevel
is called.
SNaQ.optTopRuns!
— MethodRoad map for various functions behind snaq!
snaq!
optTopRuns!
optTopRun1!
optTopLevel!
optBL!
All return their optimized network.
snaq!
callsoptTopRuns!
once, after a deep copy of the starting network. If the data contain multiple alleles from a given species,snaq!
first expands the leaf for that species into 2 separate leaves, and merges them back into a single leaf after callingoptTopRuns!
.optTopRuns!
callsoptTopRun1!
several (nrun
) times. assumes level-1 network with >0 branch lengths. assumes same tips in network as in data: i.e. 2 separate tips per species that has multiple alleles. each call tooptTopRun1!
gets the same starting network.optTopRun1!
callsoptTopLevel!
once, after deep copying + changing the starting network slightly.optTopLevel!
callsoptBL!
various times and proposes new network with various moves.
SNaQ.proposedTop!
— MethodproposedTop!(move,newT,random,count,N,movescount,movesfail,multall)
road map
Function to change the current network newT
by a given move
, and checks that the move was successful (correct attributes). If not successful, newT
is changed back to its original state, except for the case of multiple alleles.
Note that the update of attributes by each move is not done in all the network, but only in the local edges that were changed by the move. This is efficient (and makes a move easy to undo), but makes the code of each move function very clunky.
Arguments:
- move chosen from
whichMove
as described inoptTopLevel
newT
is the topology that will be modified inside with the moverandom=true
: chooses minor hybrid edge with prob 1-h, and major edge with prob h, if false, always chooses minor hybrid edgecount
: simply which likelihood step we are in in the optimization atoptTopLevel
movescount
andmovesfail
: vector of counts of number of moves proposedmultall=true
if multiple alleles case: we need to check if the move did not violate the multiple alleles condition (sister alleles together and no gene flow into the alleles). This is inefficient because we are proposing moves that we can reject later, instead of being smart about the moves we propose: for example, move origin/target could rule out some neighbors that move gene flow into the alleles, the same for add hybridization; nni move can check if it is trying to separate the alleles)
Moves:
addHybridizationUpdate(newT,N)
:
will choose a partition first (to avoid choosing edges that will create a non level-1 network) will choose two edges from this partition randomly, will not allow two edges in a cherry (non-identifiable), or sister edges that are not identifiable (the blacklist was a way to keep track of "bad edges" were we should not waste time trying to put hybridizations, it has never been used nor tested). Also choose gamma from U(0,0.5). The "Update" in the function name means that it creates the new hybrid, and also updates all the attributes of newT
node = chooseHybrid(newT)
choose a hybrid randomly for the next moves:moveOriginUpdateRepeat!(newT,node,random)
will choose randomly the minor/major hybrid edge to move (if random=true
); will get the list of all neighbor edges where to move the origin, will move the origin and update all the attributes and check if the move was successful (not conflicting attributes); if not, will undo the move, and try with a different neighbor until it runs out of neighbors. Return true if the move was successful.
moveTargetUpdateRepeat!(newT,node,random)
same as move origin but moving the target
changeDirectionUpdate!(newT,node,random)
chooses minor/major hybrid edge at random (if `random=true), and changes the direction, and updates all the attributes. Checks if the move was successful (returns true), or undoes the change and returns false.
deleteHybridizationUpdate!(newT,node)
removes the hybrid node, updates the attributes, no need to check any attributes, always successful move
- NNIRepeat!(newT,N)
choose an edge for nni that does not have a neighbor hybrid. It will try to find such an edge N times, and if it fails, it will return false (unsuccessful move). N=10 by default. If N=1, it rarely finds such an edge if the network is small or complex. The function cannot choose an external edge. it will update locally the attributes.
** Important: ** All the moves undo what they did if the move was not successful, so at the end you either have a newT
with a new move and with all good attributes, or the same newT
that started. This is important to avoid having to do deepcopy of the network before doing the move. Also, after each move, when we update the attributes, we do not update the attributes of the whole network, we only update the attributes of the edges that were affected by the move. This saves time, but makes the code quite clunky. Only the case of multiple alleles the moves does not undo what it did, because it finds out that it failed after the function is over, so just need to treat this case special.
SNaQ.quartetdata_columnnames
— Methodquartetdata_columnnames(T) where T <: StaticArray
Vector of column names to hold the quartet data of type T
in a data frame. If T is a length-3 vector type, they are "CF1234","CF1324","CF1423". If T is a length-4 vector type, the 4th name is "ngenes". If T is a 3×n matrix type, the output vector contains 3×n names, 3 for each of "CF", "V2", "V3", ... "Vn".
Used by writeTableCF
to build a data frame from a vector of QuartetT
objects.
SNaQ.quartetrank
— Methodquartetrank(t1,t2,t3,t4, nCk::Matrix)
quartetrank([t1,t2,t3,t4], nCk)
Return the rank of a four-taxon set with taxon numbers t1,t2,t3,t4
, assuming that ti
s are positive integers such that t1<t2, t2<t3 and t3<t4 (assumptions not checked!). nCk
should be a matrix of "n choose k" binomial coefficients: see nchoose1234
.
examples
julia> nCk = SNaQ.nchoose1234(5)
6×4 Matrix{Int64}:
0 0 0 0
1 0 0 0
2 1 0 0
3 3 1 0
4 6 4 1
5 10 10 5
julia> SNaQ.quartetrank([1,2,3,4], nCk)
1
julia> SNaQ.quartetrank([3,4,5,6], nCk)
15
SNaQ.readInputData
— MethodreadInputData(trees, quartetfile, whichQuartets, numQuartets, writetable, tablename, writeQfile, writesummary)
readInputData(trees, whichQuartets, numQuartets, taxonlist, writetable, tablename, writeQfile, writesummary)
Read gene trees and calculate the observed quartet concordance factors (CF), that is, the proportion of genes (and the number of genes) that display each quartet for a given list of four-taxon sets.
Input:
trees
: name of a file containing a list of input gene trees, or vector of trees (HybridNetwork
objects)
Optional arguments (defaults):
quartetfile
: name of a file containing a list of quartets, or more precisely, a list of four-taxon setswhichQuartets
(:all
): which quartets to sample.:all
for all of them,:rand
for a random sample.numQuartets
: number of quartets in the sample. default: total number of quartets ifwhichQuartets=:all
and 10% of total ifwhichQuartets=:rand
taxonlist
(all in the input gene trees): Iftaxonlist
is used,whichQuartets
will consist of all sets of 4 taxa in thetaxonlist
.writetable
(true): write the table of observed CF?tablename
("tableCF.txt"): ifwritetable
is true, the table of observed CFs is write to filetablename
writeQfile
(false): write intermediate file with sampled quartets?writesummary
(true): write a summary file? if so, the summary will go in file "summaryTreesQuartets.txt".
Uses calculateObsCFAll!
, which implements a slow algorithm.
See also: countquartetsintrees
, which uses a much faster algorithm; readTrees2CF
, which is basically a re-naming of readInputData
, and readTableCF
to read a table of quartet CFs directly.
SNaQ.readInputTrees
— MethodreadInputTrees(file)
Read a text file with a list of trees/networks in parenthetical format (one tree per line) and transform them like readTopologyLevel1
does: to be unrooted, with resolved polytomies, missing branch lengths set to 1.0, etc. See PhyloNetworks.readMultiTopology
to read multiple trees or networks with no modification.
Output: array of HybridNetwork objects.
Each line starting with "(" will be considered as describing one topology. The file can have extra lines that are ignored.
SNaQ.readSnaqNetwork
— MethodreadSnaqNetwork(output file)
Read the estimated network from a .out
file generated by snaq!
. The network score is read also, and stored in the network's field .loglik
.
Warning: despite the name "loglik", this score is only proportional to the network's pseudo-deviance: the lower, the better. Do NOT use this score to calculate an AIC or BIC (etc.) value.
SNaQ.readTableCF!
— MethodreadTableCF!(data frame, columns; mergerows=false)
Read in quartet CFs from data frame, assuming information is in columns numbered columns
, of length 7 or 8: 4 taxon labels then 3 CFs then ngenes possibly.
If some species appears more than once in the same 4-taxon set (e.g. t1,t1,t2,t3), then the data frame is modified to remove rows (4-taxon sets) that are uninformative about between-species relationships. This situation may occur if multiple individuals are sampled from the same species. A 4-taxon set is uninformative (and its row is removed) if one taxon is repeated 3 or 4 times (like t1,t1,t1,t1 or t1,t2,t2,t2). The list of species appearing twice in some 4-taxon sets is stored in the output DataCF object. For these species, the length of their external edge is identifiable (in coalescent units). If multiple rows correspond to the same 4-taxon set, these rows are merged and their CF values (and number of genes) are averaged. If none of the species is repeated within any 4-taxon set, then this averaging is attempted only if mergerows
is true.
readTableCF!(DataCF, data frame, columns)
Modify the .quartet.obsCF
values in the DataCF
object with those read from the data frame in columns numbered columns
. columns
should have 3 columns numbers for the 3 CFs in this order: 12_34
, 13_24
and 14_23
.
Assumptions:
- same 4-taxon sets in
DataCF
and in the data frame, and in the same order, but this assumption is not checked (for speed, e.g. during bootstrapping). - one single row per 4-taxon set (multiple individuals representatives of the same 4-taxon set should have been already merged); basically: the DataCF should have been created from the data frame by
readTableCF!(df, colums)
SNaQ.readTableCF
— MethodreadTableCF(file)
readTableCF(data frame)
readTableCF!(data frame)
Read a file or DataFrame object containing a table of concordance factors (CF), with one row per 4-taxon set. The first 4 columns are assumed to give the labels of the 4 taxa in each set (tx1, tx2, tx3, tx4). Columns containing the CFs are assumed to be named CF12_34
, CF13_24
and CF14_23
; or CF12.34
, CF13.24
and CF14.23
; or else are assumed to be columns 5,6,7. If present, a column named 'ngenes' will be used to get the number of loci used to estimate the CFs for each 4-taxon set.
Output: DataCF
object
Optional arguments:
- summaryfile: if specified, a summary file will be created with that name.
- delim (for the first form only): to specify how columns are delimited, with single quotes: delim=';'. Default is a
csv
file, i.e.delim=','
. mergerows
: false by default. When true, will attempt to merge multiple rows corresponding to the same four-taxon set (by averaging their quartet CFs) even if none of the species is repeated within any row (that is, in any set of 4 taxa)
The last version modifies the input data frame, if species are represented by multiple alleles for instance (see readTableCF!
(data frame, columns)).
SNaQ.readTopologyLevel1
— MethodreadTopologyLevel1(filename)
readTopologyLevel1(parenthetical format)
same as readTopology, reads a tree or network from parenthetical format, but this function enforces the necessary conditions for any starting topology in SNaQ: non-intersecting cycles, no polytomies, unrooted. It sets any missing branch length to 1.0.
If the network has a bad diamond II (in which edge lengths are γ's are not identifiable) and if the edge below this diamond has a length t
different from 0, then this length is set back to 0 and the major parent hybrid edge is lengthened by t
.
SNaQ.readTrees2CF
— MethodreadTrees2CF(treefile)
readTrees2CF(vector of trees)
Read trees in parenthetical format from a file, or take a vector of trees already read, and calculate the proportion of these trees having a given quartet (concordance factor: CF), for all quartets or for a sample of quartets. Optional arguments include:
- quartetfile: name of text file with list of 4-taxon subsets to be analyzed. If none is specified, the function will list all possible 4-taxon subsets.
- whichQ="rand": to choose a random sample of 4-taxon subsets
- numQ: size of random sample (ignored if whichQ is not set to "rand")
- writeTab=false: does not write the observedCF to a table (default true)
- CFfile: name of file to save the observedCF (default tableCF.txt)
- writeQ=true: save intermediate files with the list of all 4-taxon subsets and chosen random sample (default false).
- writeSummary: write descriptive stats of input data (default: true)
- nexus: if true, it assumes the gene trees are written in nexus file (default: false)
See also: countquartetsintrees
, which uses a much faster algorithm; readTableCF
to read a table of quartet CFs directly.
SNaQ.sameTaxa
— MethodsameTaxa(Quartet, HybridNetwork)
Return true
if all taxa in the quartet are represented in the network, false
if one or more taxa in the quartet does not appear in the network.
warning: the name can cause confusion. A more appropriate name might be "in", or "taxain", or "taxonsubset", or etc.
SNaQ.sampleCFfromCI
— FunctionsampleCFfromCI(data frame, seed=0)
sampleCFfromCI!(data frame, seed=0)
Read a data frame containing CFs and their credibility intervals, and sample new obsCF uniformly within the CIs. These CFs are then rescaled to sum up to 1 for each 4-taxon sets. Return a data frame with taxon labels in first 4 columns, sampled obs CFs in columns 5-7 and credibility intervals in columns 8-13.
- The non-modifying function creates a new data frame (with re-ordered columns) and returns it. If
seed=-1
, the new df is a deep copy of the input df, with no call to the random number generator. Otherwise,seed
is passed to the modifying function. - The modifying function overwrites the input data frame with the sampled CFs and returns it. If
seed=0
, the random generator is seeded from the clock. Otherwise the random generator is seeded usingseed
.
Warning: the modifying version does not check the data frame: assumes correct columns.
optional argument: delim=','
by default: how columns are delimited.
SNaQ.setGammaBLfromGammaz!
— MethodsetGammaBLfromGammaz!(node, network)
Update the γ values of the two sister hybrid edges in a bad diamond I, given the gammaz
values of their parent nodes, and update the branch lengths t1 and t2 of their parent edges (those across from the hybrid nodes), in such a way that t1=t2 and that these branch lengths and γ values are consistent with the gammaz
values in the network.
Similar to the first section of undoGammaz!
, but does not update anything else than γ and t's. Unlike undoGammaz!
, no error if non-hybrid node
or not at bad diamond I.
SNaQ.setLength!
— MethodsetLength!(edge, newlength)`
Set the length of edge
, and set edge.y
and edge.z
accordingly. Warning: specific to SNaQ.jl
. Consider PhyloNetworks.setlengths!
for a more generic tool.
- The new length is censored to 10: if the new length is above 10, the edge's length will be set to 10. Lengths are interpreted in coalescent units, and 10 is close to infinity: near perfect gene tree concordance. 10 is used as an upper limit to coalescent units that can be reliably estimated.
- The new length is allowed to be negative, but must be greater than -log(1.5), to ensure that the major quartet concordance factor (1 - 2/3 exp(-length)) is >= 0.
SNaQ.setNonIdBL!
— MethodsetNonIdBL!(net)
Set non-identifiable edge branch lengths to -1.0 (i.e. missing) for a level-1 network net
, except for edges in
- a good triangle: the edge below the hybrid is constrained to 0.
- a bad diamond II: the edge below the hybrid is constrained to 0
- a bad diamond I: the edges across from the hybrid node have non identifiable lengths but are kept, because the two γ*(1-exp(-t)) values are identifiable.
will break if inCycle
attributes are not initialized (at -1) or giving a correct node number.
see PhyloNetworks.Node
for the meaning of boolean attributes isBadTriangle
(which corresponds to a "good" triangle above), isBadDiamondI
and isBadDiamondII
.
SNaQ.snaq!
— Methodsnaq!(T::HybridNetwork, d::DataCF)
Estimate the network (or tree) to fit observed quartet concordance factors (CFs) stored in a DataCF object, using maximum pseudo-likelihood. A level-1 network is assumed. The search starts from topology T
, which can be a tree or a network with no more than hmax
hybrid nodes. The function name ends with ! because it modifies the CF data d
by updating its attributes expCF
: CFs expected under the network model. It does not modify T
. The quartet pseudo-deviance is the negative log pseudo-likelihood, up to an additive constant, such that a perfect fit corresponds to a deviance of 0.0.
Output:
- estimated network in file
.out
(also in.log
): best network overall and list of networks from each individual run. - the best network and modifications of it, in file
.networks
. All networks in this file have the same undirected topology as the best network, but have different hybrid/gene flow directions. These other networks are reported with their pseudo-likelihood scores, because non-identifiability issues can cause them to have very similar scores, and because SNaQ was shown to estimate the undirected topology accurately but not the direction of hybridization in cases of near non-identifiability. - if any error occurred, file
.err
provides information (seed) to reproduce the error.
There are many optional arguments, including
hmax
(default 1): maximum number of hybridizations allowedverbose
(default false): if true, print information about the numerical optimizationruns
(default 10): number of independent starting points for the searchoutgroup
(default none): outgroup taxon to root the estimated topology at the very endfilename
(default "snaq"): root name for the output files (.out
,.err
). If empty (""), files are not created, progress log goes to the screen only (standard out).seed
(default 0 to get it from the clock): seed to replicate a given searchprobST
(default 0.3): probability to start fromT
at each given run. With problability 1-probST, the search is started from an NNI modification ofT
along a tree edge with no hybrid neighbor, with a possible modification of one reticulation ifT
has one.updateBL
(default true): If true and ifT
is a tree, the branch lengths inT
are first optimized roughly withupdateBL!
by using the average CF of all quartets defining each branch and back-calculating the coalescent units.
The following optional arguments control when to stop the optimization of branch lengths and γ's on each individual candidate network. Defaults are in parentheses:
ftolRel
(1e-6) andftolAbs
(1e-6): relative and absolute differences of the network score between the current and proposed parameters,xtolRel
(1e-2) andxtolAbs
(1e-3): relative and absolute differences between the current and proposed parameters.
Greater values will result in a less thorough but faster search. These parameters are used when evaluating candidate networks only. The following optional arguments control when to stop proposing new network topologies:
Nfail
(75): maximum number of times that new topologies are proposed and rejected (in a row).liktolAbs
(1e-6): the proposed network is accepted if its score is better than the current score by at least liktolAbs.
Lower values of Nfail
and greater values of liktolAbs
and ftolAbs
would result in a less thorough but faster search.
At the end, branch lengths and γ's are optimized on the last "best" network with different and very thorough tolerance parameters: 1e-12 for ftolRel, 1e-10 for ftolAbs, xtolRel, xtolAbs.
See also: topologyMaxQPseudolik!
to optimize parameters on a fixed topology, and topologyQPseudolik!
to get the deviance (pseudo log-likelihood up to a constant) of a fixed topology with fixed parameters.
Reference: Claudia Solís-Lemus and Cécile Ané (2016). Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS Genetics 12(3):e1005896
SNaQ.sort_stringasinteger!
— Methodsort_stringasinteger!(taxa)
Sort a vector of strings taxa
, numerically if elements can be parsed as an integer, alphabetically otherwise.
SNaQ.sorttaxa!
— Methodsorttaxa!(DataFrame, columns)
Reorder the 4 taxa and reorders the observed concordance factors accordingly, on each row of the data frame. If columns
is ommitted, taxon names are assumed to be in columns 1-4 and CFs are assumed to be in columns 5-6 with quartets in this order: 12_34
, 13_24
, 14_23
. Does not reorder credibility interval values, if present.
sorttaxa!(DataCF)
sorttaxa!(Quartet, permutation_tax, permutation_cf)
Reorder the 4 taxa in each element of the DataCF quartet
. For a given Quartet, reorder the 4 taxa in its fields taxon
and qnet.quartetTaxon
(if non-empty) and reorder the 3 concordance values accordingly, in obsCF
and qnet.expCF
.
permutation_tax
and permutation_cf
should be vectors of short integers (Int8) of length 4 and 3 respectively, whose memory allocation gets reused. Their length is not checked.
qnet.names
is unchanged: the order of taxon names here relates to the order of nodes in the network (???)
SNaQ.summarizeDataCF
— MethodsummarizeDataCF(d::DataCF)
function to summarize the information contained in a DataCF object. It has the following optional arguments:
- filename: if provided, the summary will be saved in the filename, not to screen
- pc (number between (0,1)): threshold of percentage of missing genes to identify 4-taxon subsets with fewer genes than the threshold
SNaQ.taxadiff
— Methodtaxadiff(Vector{Quartet}, network; multiplealleles=true)
taxadiff(DataCF, network; multiplealleles=true)
Return 2 vectors:
- taxa in at least 1 of the quartets but not in the network, and
- taxa in the network but in none of the quartets.
When multiplealleles
is true, the taxon names that end with "__2" are ignored in the quartets: they are not expected to appear in the networks that users give as input, or get as output.
SNaQ.topologyMaxQPseudolik!
— MethodtopologyMaxQPseudolik!(net::HybridNetwork, d::DataCF)
Estimate the branch lengths and inheritance probabilities (γ's) for a given network topology. The network is not modified, only the object d
is, with updated expected concordance factors.
Ouput: new network, with optimized parameters (branch lengths and gammas). The maximized quartet pseudo-deviance is the negative log pseudo-likelihood, up to an additive constant, such that a perfect fit corresponds to a deviance of 0.0. This is also an attribute of the network, which can be accessed with net.loglik
.
Optional arguments (default value):
- verbose (false): if true, information on the numerical optimization is printed to screen
- ftolRel (1e-5), ftolAbs (1e-6), xtolRel (1e-3), xtolAbs (1e-4): absolute and relative tolerance values for the pseudo-deviance function and the parameters
SNaQ.topologyQPseudolik!
— MethodtopologyQPseudolik!(net::HybridNetwork, d::DataCF)
Calculate the quartet pseudo-deviance of a given network/tree for DataCF d
. This is the negative log pseudo-likelihood, up to an additive constant, such that a perfect fit corresponds to a deviance of 0.0.
Be careful if the net object does not have all internal branch lengths specified because then the pseudolikelihood will be meaningless.
The loglik attribute of the network is undated, and d
is updated with the expected concordance factors under the input network.
SNaQ.traverseContainRoot!
— MethodupdateContainRoot!(HybridNetwork, Node)
traverseContainRoot!(Node, Edge, edges_changed::Array{Edge,1}, rightDir::Vector{Bool})
The input node
to updateContainRoot!
must be a hybrid node (can come from searchHybridNode). updateContainRoot!
starts at the input node and calls traverseContainRoot!
, which traverses the network recursively. By default, containRoot attributes of edges are true. Changes containRoot
to false for all the visited edges: those below the input node, but not beyond any other hybrid node.
updateContainRoot!
Returns a flag
and an array of edges whose containRoot has been changed from true to false. flag
is false if the set of edges to place the root is empty
In traverseContainRoot!
, rightDir
turns false if hybridizations have incompatible directions (vector of length 1, to be modified).
Warning:
- does not update
containRoot
of minor hybrid edges. - assumes correct
isMajor
attributes: to stop the recursion at minor hybrid edges. - assumes correct hybrid attributes of both nodes & edges: to check if various hybridizations have compatible directions. For each hybrid node that is encountered, checks if it was reached via a hybrid edge (ok) or tree edge (not ok).
rightDir
: vector of length 1 boolean, to be mutable and modified by the function
SNaQ.undirectedOtherNetworks
— MethodundirectedOtherNetworks(net::HybridNetwork)
Return a vector of HybridNetwork objects, obtained by switching the placement of each hybrid node to other nodes inside its cycle. This amounts to changing the direction of a gene flow event (recursively to move around the whole cycle of each reticulation).
Optional argument: outgroup
, as a String. If an outgroup is specified, then networks conflicting with the placement of the root are avoided.
Assumptions: net
is assumed to be of level 1, that is, each blob has a single cycle with a single reticulation. All level-1 fields of net
are assumed up-to-date.
Example
julia> net = readTopology("(A:1.0,((B:1.1,#H1:0.2::0.2):1.2,(((C:0.52,(E:0.5)#H2:0.02::0.7):0.6,(#H2:0.01::0.3,F:0.7):0.8):0.9,(D:0.8)#H1:0.3::0.8):1.3):0.7):0.1;");
julia> vnet = undirectedOtherNetworks(net)
SNaQ.undoGammaz!
— MethodundoGammaz!(node, network)
Undo updateGammaz!
for the 2 cases: bad diamond I,II. node
should be a hybrid node. Set length to edges that were not identifiable and change edges' gammaz
attribute to -1.0. Recalculate branch lengths in terms of gammaz
. warning: needs to know incycle
attributes
SNaQ.updateBL!
— MethodupdateBL!(net::HybridNetwork, d::DataCF)
Update internal branch lengths of net
based on the average quartet concordance factor (CF) across all quartets that exactly correspond to a given branch: new branch length = -log(3/2(1-mean(CF observed in d)))
. net
is assumed to be a tree, such that the above equation holds.
SNaQ.updateContainRoot!
— FunctionupdateContainRoot!(HybridNetwork, Node)
traverseContainRoot!(Node, Edge, edges_changed::Array{Edge,1}, rightDir::Vector{Bool})
The input node
to updateContainRoot!
must be a hybrid node (can come from searchHybridNode). updateContainRoot!
starts at the input node and calls traverseContainRoot!
, which traverses the network recursively. By default, containRoot attributes of edges are true. Changes containRoot
to false for all the visited edges: those below the input node, but not beyond any other hybrid node.
updateContainRoot!
Returns a flag
and an array of edges whose containRoot has been changed from true to false. flag
is false if the set of edges to place the root is empty
In traverseContainRoot!
, rightDir
turns false if hybridizations have incompatible directions (vector of length 1, to be modified).
Warning:
- does not update
containRoot
of minor hybrid edges. - assumes correct
isMajor
attributes: to stop the recursion at minor hybrid edges. - assumes correct hybrid attributes of both nodes & edges: to check if various hybridizations have compatible directions. For each hybrid node that is encountered, checks if it was reached via a hybrid edge (ok) or tree edge (not ok).
rightDir
: vector of length 1 boolean, to be mutable and modified by the function
SNaQ.writeTableCF
— MethodwriteTableCF(vector of Quartet objects)
writeTableCF(DataCF)
Build a DataFrame containing observed quartet concordance factors, with columns named:
t1
,t2
,t3
,t4
for the four taxon names in each quartetCF12_34
,CF13_24
,CF14_23
for the 3 quartets of a given four-taxon setngenes
if this information is available for some quartets
SNaQ.writeTableCF
— MethodwriteTableCF(quartetlist::Vector{QuartetT} [, taxonnames]; colnames)
Convert a vector of QuartetT
objects to a data frame, with 1 row for each four-taxon set in the list. Each four-taxon set contains quartet data of some type T
, which determines the number of columns in the data frame. This data type T
should be a vector of length 3 or 4, or a 3×n matrix.
In the output data frame, the columns are, in this order:
qind
: contains the quartet'snumber
t1, t2, t3, t4
: contain the quartet'staxonnumber
s if notaxonnames
are given, or the taxon names otherwise. The name of taxon numberi
is taken to betaxonnames[i]
.- 3 columns for each column in the quartet's
data
. The first 3 columns are namedCF12_34, CF13_24, CF14_23
. The next columns are namedV2_12_34, V2_13_24, V2_14_23
and contain the data in the second column of the quartet's data matrix. And so on. For the data frame to have non-default column names, provide the desired 3, 4, or 3×n names as a vector via the optional argumentcolnames
.
SNaQ.writeTopologyLevel1
— MethodwriteTopologyLevel1(net::HybridNetwork)
Write the extended Newick parenthetical format of a level-1 network object with many optional arguments (see below). Makes a deep copy of net: does not modify net
.
- di=true: write in format for Dendroscope (default false)
- namelabel=true: If
namelabel
is true, taxa are labelled by their names;
otherwise taxa are labelled by their numbers (unique identifiers).
- outgroup (string): name of outgroup to root the tree/network. if "none" is given, the root is placed wherever possible.
- printID=true, only print branch lengths for identifiable egdes according to the snaq estimation procedure (default false) (true inside of
snaq!
.) - round: rounds branch lengths and heritabilities γ (default: true)
- digits: digits after the decimal place for rounding (defult: 3)
- string: if true (default), returns a string, otherwise returns an IOBuffer object.
- multall: (default false). set to true when there are multiple alleles per population.
The topology may be written using a root different than net.root, if net.root is incompatible with one of more hybrid node. Missing hybrid names are written as "#Hi" where "i" is the hybrid node number if possible.