Internal Documentation

Documentation for SNaQ's internal functions. These functions are not exported and their access (API) should not be considered stable. But they can still be used, like this for example: SNaQ.foo() for a function named foo().

Functions & Types

SNaQ.QuartetNetworkType
QuartetNetwork(net::HybridNetwork)

Subtype of Network abstract type. A QuartetNetwork object is an internal type used to calculate the expected CFs of quartets on a given network. Attributes of the QuartetNetwork objects need not be updated at a given time (see below).

The procedure to calculate expected CFs for a given network is as follows:

  1. A QuartetNetwork object is created for each Quartet using extractQuartet!(net,d) for net::HybridNetwork and d::DataCF
  2. The vector d.quartet has all the Quartet objects, each with a QuartetNetwork object (q.qnet). Attibutes in QuartetNetwork are not updated at this point
  3. Attributes in QuartetNetwork are partially updated when calculating the expected CF (calculateExpCFAll!). To calculate the expected CF for this quartet, we need to update the attributes: which, typeHyb, t1, split, formula, expCF. To do this, we need to modify the QuartetNetwork object (i.e. merge edges,...). But we do not want to modify it directly because it is connected to the original net via a map of the edges and nodes, so we use a deep copy: qnet=deepcopy(q.qnet) and then calculateExpCFAll!(qnet). Attributes that are updated on the original QuartetNetwork object q.qnet are:
    • q.qnet.hasEdge: array of booleans of length equal to net.edge that shows which identifiable edges and gammas of net (ht(net)) are in qnet (and still identifiable). Note that the first elements of the vector correspond to the gammas.
    • q.qnet.index: length should match the number of trues in qnet.hasEdge. It has the indexes in qnet.edge from the edges in qnet.hasEdge. Note that the first elements of the vector correspond to the gammas.
    • q.qnet.edge: list of edges in QuartetNetwork. Note that external edges in net are collapsed when they appear in QuartetNetwork, so only internal edges map directly to edges in net
    • q.qnet.expCF: expected CF for this Quartet

Why not modify the original QuartetNetwork? We wanted to keep the original QuartetNetwork stored in DataCF with all the identifiable edges, to be able to determine if this object had been changed or not after a certain optimization.

The process is:

  1. Deep copy of full network to create q.qnet for Quartet q. This QuartetNetwork object has only 4 leaves now, but does not have merged edges (the identifiable ones) so that we can correspond to the edges in net. This QuartetNetwork does not have other attributes updated.
  2. For the current set of branch lengths and gammas, we can update the attributes in q.qnet to compute the expected CF. The functions that do this will "destroy" the QuartetNetwork object by merging edges, removing nodes, etc... So, we do this process in qnet=deepcopy(q.qnet), and at the end, only update q.qnet.expCF.
  3. After we optimize branch lengths in the full network, we want to update the branch lengths in q.qnet. The edges need to be there (which is why we do not want to modify this QuartetNetwork object by merging edges), and we do not do a deep-copy of the full network again. We only change the values of branch lengths and gammas in q.qnet, and we can re-calculate the expCF by creating a deep copy qnet=deepcopy(q.qnet) and run the other functions (which merge edges, etc) to get the expCF.

Future work: there are definitely more efficient ways to do this (without the deep copies). In addition, currently edges that are no longer identifiable in QuartetNetwork do not appear in hasEdge nor index. Need to study this.

julia> net0 = readnewick("(s17:13.76,(((s3:10.98,(s4:8.99,s5:8.99)I1:1.99)I2:0.47,(((s6:2.31,s7:2.31)I3:4.02,(s8:4.97,#H24:0.0::0.279)I4:1.36)I5:3.64,((s9:8.29,((s10:2.37,s11:2.37)I6:3.02,(s12:2.67,s13:2.67)I7:2.72)I8:2.89)I9:0.21,((s14:2.83,(s15:1.06,s16:1.06)I10:1.78)I11:2.14)#H24:3.52::0.72)I12:1.47)I13:1.48)I14:1.26,(((s18:5.46,s19:5.46)I15:0.59,(s20:4.72,(s21:2.40,s22:2.40)I16:2.32)I17:1.32)I18:2.68,(s23:8.56,(s1:4.64,s2:4.64)I19:3.92)I20:0.16)I21:3.98)I22:1.05);");

julia> net = readnewicklevel1(writenewick(net0)) ## need level1 attributes for functions below
HybridNetwork, Semidirected Network
46 edges
46 nodes: 23 tips, 1 hybrid nodes, 22 internal tree nodes.
tip labels: s17, s3, s4, s5, ...
(s4:8.99,s5:8.99,(s3:10.0,((((s6:2.31,s7:2.31)I3:4.02,(s8:4.97,#H24:0.0::0.279)I4:1.36)I5:3.64,((s9:8.29,((s10:2.37,s11:2.37)I6:3.02,(s12:2.67,s13:2.67)I7:2.72)I8:2.89)I9:0.21,((s14:2.83,(s15:1.06,s16:1.06)I10:1.78)I11:2.14)#H24:3.52::0.721)I12:1.47)I13:1.48,((((s18:5.46,s19:5.46)I15:0.59,(s20:4.72,(s21:2.4,s22:2.4)I16:2.32)I17:1.32)I18:2.68,(s23:8.56,(s1:4.64,s2:4.64)I19:3.92)I20:0.16)I21:3.98,s17:10.0)I22:1.26)I14:0.47)I2:1.99)I1;

julia> q1 = Quartet(1,["s1", "s16", "s18", "s23"],[0.296,0.306,0.398])
number: 1
taxon names: ["s1", "s16", "s18", "s23"]
observed CF: [0.296, 0.306, 0.398]
pseudo-deviance under last used network: 0.0 (meaningless before estimation)
expected CF under last used network: Float64[] (meaningless before estimation)

julia> qnet = SNaQ.extractQuartet!(net,q1)
taxa: ["s1", "s16", "s18", "s23"]
number of hybrid nodes: 1

julia> sum([SNaQ.istIdentifiable(e) for e in net.edge]) ## 23 identifiable edges in net
23

julia> idedges = [ee.number for ee in net.edge[[SNaQ.istIdentifiable(e) for e in net.edge]]];

julia> print(idedges)
[5, 6, 9, 11, 12, 13, 17, 20, 21, 22, 26, 27, 28, 29, 30, 31, 34, 38, 39, 40, 44, 45, 46]

julia> length(qnet.hasEdge) ## 24 = 1 gamma + 23 identifiable edges
24

julia> sum(qnet.hasEdge) ## 8 = 1 gamma + 7 identifiable edges in qnet
8

julia> print(idedges[qnet.hasEdge[2:end]]) ## 7 id. edges: [12, 13, 29, 30, 31, 45, 46]
[12, 13, 29, 30, 31, 45, 46]

julia> qnet.edge[qnet.index[1]].number ## 11 = minor hybrid edge
11
source
PhyloNetworks.printedgesMethod
printedges([io::IO,] quartetnet)

Print information on the edges of a QuartetNetwork object quartetnet:

  • edge number
  • numbers of nodes attached to it
  • edge length
  • whether it's a hybrid edge
  • whether it's a major edge
  • its γ inheritance value
  • if it could contain the root (this field is not always updated, though)
  • in which cycle it is contained (-1 if no cycle)
  • if its length is identifiable from quartet concordance factors.
source
PhyloNetworks.tablequartetCFMethod
tablequartetCF(vector of Quartet objects)
tablequartetCF(DataCF)
tablequartetCF(gene trees)
tablequartetCF(gene tree file)

Build a NamedTuple containing observed quartet concordance factors, with the fields named:

  • t1, t2, t3, t4 for the four taxon names in each quartet
  • CF12_34, CF13_24, CF14_23 for the 3 quartets of a given four-taxon set
  • ngenes if this information is available for some quartets

Some downstream functions may require the observed quartet concordance factors to be in a DataFrame, this can be easily converted by wrapping the output NamedTuple in the DataFrame() function

source
SNaQ.afterOptBL!Method
afterOptBL!

Check if there are h==0,1;t==0,hz==0,1 cases in a network after calling optBL!.

Output: (successchange,flagh,flagt,flaghz) where successchange is false if could not add new hybrid; true otherwise. Flags flag* is false if there is problem with gamma, t=0 or gammaz.

Arguments:

  • closeN: move origin/target iftrue; iffalseadd/deleteNtimes before giving up (we have only testedcloseN=true`)
  • origin=true will move origin, false will move target. We added this to avoid going back and forth between the same networks
  • movesgamma: vector of integers. Counts the number of times each move is proposed to fix a gamma zero problem. Proposal types and order: (add, mvorigin, mvtarget, chdir, delete, nni).

Procedure:

  • First we split the ht vector in nh,nt,nhz (gammas, lengths, gammaz)

  • If we find a h==0,1, we loop through nh to find a hybrid edge with h==0 or 1 and want to try to fix this by doing:

    • gammaZero!(currT,d,edge,closeN,origin,N,movesgamma) which returns true if there was a successful change, and we stop the loop
  • If we find a t==0, we loop through all nt to find such edge, and do NNI move on this edge; return true if change successful and we stop the loop

  • If we find a hz==0,1, we loop through nhz to find such hybrid edge and call gammaZero! again

  • If we did a successful change, we run optBL again, and recheck if there are no more problems.

  • Returns successchange, flagh, flagt,flaghz (flag=true means no problems)

  • If it is the multiple alleles case, it will not try to fix h==0,1;hz==0,1 because it can reach a case that violates the multiple alleles condition. If we add a check here, things become horribly slow and inefficient, so we just delete a hybridization that has h==0,1;hz==0,1

** Important**: afterOptBL is doing only one change, but we need to repeat multiple times to be sure that we fix all the gamma zero problems, which is why we call afterOptBLRepeat.

source
SNaQ.afterOptBLAll!Method
afterOptBLAll!(currT, args...)

Try to fix any gamma zero problem (h==0,1; t==0; hz==0,1) by calling afterOptBLRepeat!. If problems cannot be fixed, it will call moveDownLevel to delete the hybridization from the network. Like afterOptBLAllMultipleAlleles, this function is called after optBL.

Output: new approved network currT (no gammas=0.0)

Procedure:

While startover=true and tries<N

  • While badliks < N2 (number of bad pseudolikelihoods are less than N2)
    • Run success = afterOptBLRepeat
    • If success = true (it changed something):
      • If worse pseudolik, then go back to original topology currT, set startover=true and badliks++
      • If better pseudolik, then check flags. If all good, then startover=false; otherwise startover = true
    • If success = false (nothing changed), then set badliks=N2+1 (to end the while on currT)
      • If all flags are ok, then startover = false
      • If bad h or hz, then call moveDownLevel (delete one hybridization), and set startover = true (maybe deleting that hybridization did not fix other gamma zero problems)
      • If bad t, then set startover = false
  • If left second while by back to original currT, and still bad h/hz, then move down one level, and startover=true; otherwise startover=false

If first while ends by tries>N, then it checks one last time the flags, if bad h/hz will move down one level, and exit

source
SNaQ.afterOptBLRepeat!Method
afterOptBLRepeat!

Repeat afterOptBL!, which only does one change. Repeating multiple times is to be sure that we fix all the gamma zero problems, after every time a successful change happened. This is done only if closeN=false, because we would delete/add hybridizations and need to stop after tried N times. If closeN=true (default), then afterOptBLRepeat! only does one afterOptBL!, because in this case, only the neighbor edges need to be tested, and this would have been done already in gammaZero!.

source
SNaQ.calculateObsCFAll!Method
calculateObsCFAll!(DataCF, taxa::Union{Vector{<:AbstractString}, Vector{Int}})

Calculate observed concordance factors: update the .quartet[i].obsCF values of the DataCF object based on its .tree vector.

calculateObsCFAll!(vector of quartets, vector of trees, taxa)

Calculate observed concordance factors: update the .obsCF values of the quartets, based on the trees, and returns a new DataCF object with these updated quartets and trees.

calculateObsCFAll_noDataCF!(vector of quartets, vector of trees, taxa)

update the .obsCF values of the quartets based on the trees, but returns nothing.

Warning: all these functions need input trees (without any reticulations: h=0).

See also: PhyloNetworks.countquartetsintrees, which uses a faster algorithm, processing each input tree only once. calculateObsCFAll_noDataCF! processes each input tree # quartet times.

source
SNaQ.checkMapDFMethod
checkMapDF(mapping_allele2species::DataFrame)

Check that the data frame has one column named "allele" or "individual", and one column named "species". Output: indices of these column.

source
SNaQ.deleteLeaf!Method
deleteLeaf!(net::HybridNetwork, leaf::AbstractString)
deleteLeaf!(net::Network, leaf::Node)

Delete the leaf taxon from the network. The leaf argument is the name of the taxon to delete.

Warnings:

  • requires a level-1 network with up-to-date attributes for snaq! (e.g. non-missing branch lengths, gammaz, etc.)
  • does not care where the root is and does not update it to a sensible location if the root is affected by the leaf removal.
  • does not merge edges, i.e. does not remove all nodes of degree 2. Within snaq!, this is used to extract quartets and to keep track of which edge lengths in the original network map to the quartet network.
source
SNaQ.gammaZero!Method
gammaZero!

Try to fix a gamma zero problem (h==0,1; t==0; hz==0,1)

  1. First tries to do changeDirection: change the direction of hybrid edge
  2. If changing the direction was successful, call optBL and check that the problem was fixed
  3. If problem fixed and the pseudolik is not worse, return success=true
  4. If changing the direction failed (step 1) or if the problem persists (step 2 failed) or if the pseudolik got worse (step 3 failed), then call moveHybrid!.

** Important**: Any function (afterOptBL) calling gammaZero! is assuming that it only made a change, so if the returned value is true, then a change was made, and the other function needs to run optBL and check that all parameters are 'valid'. If the returned value is false, then no change was possible and we need to remove a hybridization if the problem is h==0,1; hz==0,1. If the problem is t==0, we ignore this problem.

source
SNaQ.getNeighborsTargetMethod
getNeighborsTarget(hybrid_node, majoredge)

Vector of edges that are incident to either:

  • the node incident to majoredge other than hybrid_node, or
  • the tree child of hybrid_node.

This vector of edges is used as the list of suitable neighbors of "othermin" to move the target of a hybrid edge, in moveTargetUpdateRepeat!.

source
SNaQ.hybridatnode!Method
hybridatnode!(net::HybridNetwork, nodeNumber::Integer)

Change the direction and status of edges in network net, to move the hybrid node in a cycle to the node with number nodeNumber. This node must be in one (and only one) cycle, otherwise an error will be thrown. Check and update the nodes' field inCycle.

Output: net after hybrid modification.

Assumption: net must be of level 1, that is, each blob has a single cycle with a single reticulation.

example

net = readnewick("(A:1.0,((B:1.1,#H1:0.2::0.2):1.2,(((C:0.52,(E:0.5)#H2:0.02::0.7):0.6,(#H2:0.01::0.3,F:0.7):0.8):0.9,(D:0.8)#H1:0.3::0.8):1.3):0.7):0.1;");
using PhyloPlots
plot(net, shownodenumber=true); # to locate nodes and their numbers. D of hybrid origin
hybridatnode!(net, -4)
plot(net, shownodenumber=true); # hybrid direction reversed: now 2B of hybrid origin
source
SNaQ.hybridatnode!Method
hybridatnode!(net::HybridNetwork, hybrid::Node, newNode::Node)

Move the reticulation from hybrid to newNode, which must in the same cycle. net is assumed to be of level 1, but no checks are made and fields are supposed up-to-date.

Called by hybridatnode!(net, nodenumber), which is itself called by undirectedOtherNetworks.

source
SNaQ.mapallelesCFtable!Method
mapallelesCFtable!(quartet CF DataFrame, mapping DataFrame, columns, write?, filename)

Modify (and return) the quartet concordance factor (CF) DataFrame: replace each allele name by the species name that the allele maps onto based on the mapping data frame. This mapping data frame should have columns named "allele" and "species" (see rename! to change column names if need be).

If write? is true, the modified data frame is written to a file named "filename".

Warning: mapallelesCFtable takes the quartet data file as its second argument, while mapallelesCFtable! takes the quartet data (which it modifies) as its first argument.

source
SNaQ.moveHybrid!Method
moveHybrid!(net, edge, closeN, origin, N, movesgamma)

Delete a hybrid, and then add a new hybrid. Attempt to fix a gamma zero problem (h==0,1; t==0; hz==0,1) after changing the direction of hybrid edge failed. This function is called in gammaZero!.

Arguments:

  • closeN=true will try move origin/target on all neighbors (first choose minor/major edge at random, then make list of all neighbor edges and tries to put the hybrid node in all the neighbors until successful move). closeN=false will delete and add hybrid until successful move up to N times (this is never tested).
  • origin: move the origin if true, moves the target if false. Option used to avoid coming back to propose the same network over and over.
  • movesgama: vector of integers. Counts the number of times each move is proposed to fix a gamma zero situation. Proposal types and order: (add, mvorigin, mvtarget, chdir, delete, nni)

Return true if change was successful (not testing optBL again), and false if we could not move anything.

source
SNaQ.moveTargetUpdate!Method
moveTargetUpdate!(net, hybrid_node, majoredge, newedge)

Modify a level-1 network net by moving majoredge, which should be a hybrid edge parent of hybrid_node. Within SNaQ, majoredge is chosen by chooseMinorMajor.

  • calls moveTarget!(net,hybrid_node, majoredge, treeedge_belowhybrid, newedge), which does the move but does not update any attributes
  • updates all level-1 attributes needed for SNaQ: gammaz, containroot
  • un-does the move and updates if the move is invalid, through another call to moveTarget! but with the "undo" option.

newedge should be a tree edge (enforced by chooseEdgeOriginTarget!) adjacent to the parent node of majoredge or to the tree child of hybrid_node (enforced by getNeighborsTarget)

Output: tuple of 3 booleans (success, flag_triangle, flag_root).

  • success is false if the move failed (lead to an invalid network for SNaQ)
  • flag_triangle is false if hasVeryBadTriangle(net)
  • flag_root is false if the set of edges to place the root is empty

If success is false, then the flags are not meant to be used downstream.

source
SNaQ.optBL!Method
optBL!(
    net::HybridNetwork,
    d::DataCF,
    verbose::Bool,
    ftolRel::Float64,
    ftolAbs::Float64,
    xtolRel::Float64,
    xtolAbs::Float64
)

Optimize the edge parameters (lengths and inheritance probabilities γ) of a given level-1 network, using the BOBYQA algorithm from the NLopt package. The optimum found is used to modify net with new edge lengths, hybrid edge γs, and minimum loglik(net).

Warning: net is assumed to have up-to-date and correct level-1 attributes. This is not checked for efficiency, because this function is called repeatedly inside optTopLevel! and snaq!.

Procedure:

  • ht = parameters!(net) extracts the vector of parameters to estimate (h,t,gammaz), and sets as ht(net); identifies a bad diamond I, sets numht(net) (vector of hybrid node numbers for h, edge numbers for t, hybrid node numbers for gammaz), and index(net) to keep track of the vector of parameters to estimate
  • extractQuartet!(net,d) does the following for all quartets in d.quartet:
    • Extract quartet by deleting all leaves not in q -> create QuartetNetwork object saved in q.qnet
    • This network is ugly and does not have edges collapsed. This is done to keep a one-to-one correspondence between the edges in q.qnet and the edges in net (if we remove nodes with only two edges, we will lose this correspondence)
    • Calculate expected CF with calculateExpCFAll for a copy of q.qnet. We do this copy because we want to keep q.qnet as it is (without collapsed edges into one). The function will then save the expCF in q.qnet.expCF
  • calculateExpCFAll!(qnet) will
    • identify the type of quartet as type 1 (equivalent to a tree) or type 2 (minor CF different). Here the code will first clean up any hybrid node by removing nodes with only two edges before identifying the qnet (because identification depends on neighbor nodes to hybrid node); later, set qnet.which (1 or 2), node.prev (neighbor node to hybrid node), updates k(node) (number of nodes in hybridization cycle, this can change after deleting the nodes with only two edges), typeHyb(node) (1,2,3,4,5 depending on the number of nodes in the hybridization cycle and the origin/target of the minor hybrid edge; this attribute is never used).
    • eliminate hybridization: this will remove type 1 hybridizations first. If qnet.which=1, then the qnet is similar to a tree quartet, so it will calculate the internal length of the tree quartet: qnet.t1.
    • update split for qnet.which=1, to determine which taxa are together. For example, for the quartet 12|34, the split is [1,1,2,2] or [2,2,1,1], that is, taxon 1 and 2 are on the same side of the split. This will update qnet.split
    • update formula for qnet.which=1 to know the order of minorCF and majorCF in the vector qnet.expCF. That is, if the quartet is 1342 (order in qnet.quartet.taxon), then the expected CF should match the observed CF in 13|42, 14|32, 12|34 and the qnet is 12|34 (given by qnet.split), qnet.formula will be [2,2,1] minor, minor, major
    • calculateExpCF!(qnet) for qnet.which=1, it will do 1-2/3exp(-qnet.t1) if qnet.formula[i]==1, and 1/3exp(qnet.t1) if qnet.formula[i]==2. For qnet.which=2, we need to make sure that there is only one hybrid node, and compute the major, minor1,minor2 expected CF in the order 12|34, 13|24, 14|23 of the taxa in qnet.quartet.taxon

Then we create a NLopt object with algorithm BOBYQA and k parameters (length of ht). We define upper and lower bounds and define the objective function that should only depend on x=(h,t,gz) and g (gradient, which we do not have, but still need to put as argument).

The objective function obj(x,g) calls

  • calculateExpCFAll!(d,x,net) needs to be run after extractQuartet(net,d) that will update q.qnet for all quartet. Assumes that qnet.indexht is updated already: we only need to do this at the beginning of optBL! because the topology is fixed at this point)
    • First it will update the edge lengths according to x
    • If the q.qnet.changed=true (that is, any of qnet branches changed value), we need to call calculateExpCFAll!(qnet) on a copy of q.qnet (again because we want to leave q.qnet with the edge correspondence to net)
  • update!(net,x) simply saves the new x in ht(net)

Finally, after calling NLopt.optimize, loglik(net) and ht(net) are updated with the optimum score and parameter values that were found. After optBL, we want to call afterOptBLAll (or afterOptBLAllMultipleAlleles) to check if there are h==0,1; t==0; hz==0,1.

source
SNaQ.optTopLevel!Method
optTopLevel!(currT, args...)

Does most of the heavy-lifting of snaq!, to search the space of networks with at most hmax hybrid nodes. It optimizes the pseudolikelihood starting from network currT, and returns the best network. Unlike snaq!, it assumes that the starting topology currT is of level-1, and has all the attributes correctly updated. currT is modified.

Input parameters:

  • Starting topology currT, input data DataCF d, maximum number of hybridizations hmax
  • Numerical optimization parameters:
    • liktolAbs: stop the search if the change in loglik is smaller (in absolute value)
    • Nfail: number of failure networks with lower loglik before aborting
    • ftolRel, ftolAbs, xtolRel, xtolAbs: to stop the optimization of edge parameters on each fixed topology
  • Print parameters: verbose, logfile, writelog
  • Parameters to tune the search in space of networks: closeN=true only propose move origin/target to neighbor edges (coded, but not tested with closeN=false), Nmov0 vector with maximum number of trials allowed per type of move (add, mvorigin, mvtarget, chdir, delete, nni), by default computed inside with coupon’s collector formulas

The optimization procedure keeps track of

  • movescount: count of proposed moves,
  • movesgamma: count of proposed moves to fix a gamma zero situation (see below for definition of this situation),
  • movesfail: count of failed moves by violation of level-1 network (inCycle attribute) or worse pseudolikelihood than current,
  • failures: number of failed proposals that had a worse pseudolikelihood

Optimization procedure:

While the difference between current loglik and proposed loglik is greater than liktolAbs, or failures<Nfail, or stillmoves=true:

  • Nmov is a vector with the maximum number of trial per move type: (add,mvorigin,mvtarget,chdir,delete,nni). Nmov is updated based on newT. The type of move proposed will depend on newT (which is the same as currT at this point). For example, if currT is a tree, we cannot propose move origin/target.

  • move = whichMove selects randomly a type of move, depending on Nmov,movesfail,hmax,newT with weights 1/5 by default for all, and 0 for delete. These weights are adjusted depending on newT.numhybrids and hmax. If newT.numhybrids is far from hmax, we give higher probability to adding a new hybrid (we want to reach the hmax sooner, maybe not the best strategy, easy to change). Later, we adjust the weights by movesfail (first, give weight of 0 if movesfail[i]>Nmov[i], that is, if we reached the maximum possible number of moves allowed for a certain type) and then increase the probability of the other moves. So, unless one move has w=0, nothing changes. This could be improved by using the outlier quartets to guide the proposal of moves.

  • whichMove will choose a move randomly from the weights, it will return none if no more moves allowed, in which case, the optimization ends

  • flag=proposedTop!(move, newT) will modify newT based on move. The function proposedTop will return flag=true if the move was successful (the move succeeded by inCycle, containroot, available edge to make the move (more details in proposedTop)). If flag=false, then newT is cleaned, except for the case of multiple alleles. The function proposedTop keeps count of movescount (successful move), movesfail (unsuccessful move),

    Options:

    random=true: moves major/minor hybrid edge with prob h,1-h, respectively

    N=10: number of trials for NNI edge.

  • if(flag) Optimize branch lengths with optBL

    If loglik(newT) is better than loglik(currT) by liktolAbs, jump to newT (accepted=true) and fix gamma=0, t=0 problems (more info on afterOptBL)

    If(accepted) failures=0, movesfail=zeros, movescount for successful move +1

end while

After choosing the best network newT, we do one last more thorough optimization of branch lengths with optBL, we change non identifiable branch lengths to -1 (only in debug mode) and return newT

source
SNaQ.optTopRun1!Method
optTopRun1!(net, liktolAbs, Nfail, d::DataCF, hmax, etc.)

The function will run 1 run by modifying the starting topology and calling optTopLevel. See optTopRuns! for a roadmap.

probST (default in snaq is 0.3) is the probability of starting one run at the same input tree. So, with probability 1-probST, we will change the topology by a NNI move on a tree edge without neighbor hybrid. If the starting topology is a network, then with probability 1-probST it will also modify one randomly chosen hybrid edge: with prob 0.5, the function will move origin, with prob 0.5 will do move target.

If there are multiple alleles (d.repSpecies not empty), then the function has to check that the starting topology does not violate the multiple alleles condition.

After modifying the starting topology with NNI and/or move origin/target, optTopLevel is called.

source
SNaQ.optTopRuns!Method

Road map for various functions behind snaq!

snaq!
optTopRuns!
optTopRun1!
optTopLevel!
optBL!

All return their optimized network.

  • snaq! calls optTopRuns! once, after a deep copy of the starting network. If the data contain multiple alleles from a given species, snaq! first expands the leaf for that species into 2 separate leaves, and merges them back into a single leaf after calling optTopRuns!.
  • optTopRuns! calls optTopRun1! several (nrun) times. assumes level-1 network with >0 branch lengths. assumes same tips in network as in data: i.e. 2 separate tips per species that has multiple alleles. each call to optTopRun1! gets the same starting network.
  • optTopRun1! calls optTopLevel! once, after deep copying + changing the starting network slightly.
  • optTopLevel! calls optBL! various times and proposes new network with various moves.
source
SNaQ.proposedTop!Method
proposedTop!(move,newT,random,count,N,movescount,movesfail,multall)

Change the current network newT by a given move, and check that the move was successful (correct attributes). If not successful, newT is changed back to its original state, except for the case of multiple alleles.

Note that the update of attributes by each move is not done in all the network, but only in the local edges that were changed by the move. This is efficient (and makes a move easy to undo), but makes the code of each move function very clunky.

Arguments:

  • move chosen from whichMove as described in optTopLevel!
  • newT is the topology that will be modified inside with the move
  • random=true: choose the minor hybrid edge with probability 1-h, and major edge with probability h. If false: always choose the minor hybrid edge
  • count: simply which likelihood step we are in, in the optimization at optTopLevel!
  • N: number of attempts for NNI moves
  • movescount and movesfail: vector of counts of number of moves proposed. move types and order: (add,mvorigin,mvtarget,chdir,delete,nni).
  • multall=true if multiple alleles case: we need to check if the move did not violate the multiple alleles condition (sister alleles together and no gene flow into the alleles). This is inefficient because we are proposing moves that we can reject later, instead of being smart about the moves we propose: for example, move origin/target could rule out some neighbors that move gene flow into the alleles, the same for add hybridization; nni move can check if it is trying to separate the alleles)

Moves:

  • addHybridizationUpdate(newT,N): will choose a partition first (to avoid choosing edges that will create a non level-1 network) will choose two edges from this partition randomly, will not allow two edges in a cherry (non-identifiable), or sister edges that are not identifiable (the blacklist was a way to keep track of "bad edges" were we should not waste time trying to put hybridizations, it has never been used nor tested). Also choose gamma from U(0,0.5). The "Update" in the function name means that it creates the new hybrid, and also updates all the attributes of newT

  • node = chooseHybrid(newT) choose a hybrid randomly for the next moves:

  • moveOriginUpdateRepeat!(newT,node,random) will choose randomly the minor/major hybrid edge to move (if random=true); will get the list of all neighbor edges where to move the origin, will move the origin and update all the attributes and check if the move was successful (not conflicting attributes); if not, will undo the move, and try with a different neighbor until it runs out of neighbors. Return true if the move was successful.

  • moveTargetUpdateRepeat!(newT,node,random) same as move origin but moving the target

  • changeDirectionUpdate!(newT,node,random) chooses minor/major hybrid edge at random (if `random=true), and changes the direction, and updates all the attributes. Checks if the move was successful (returns true), or undoes the change and returns false.

  • deleteHybridizationUpdate!(newT,node) removes the hybrid node, updates the attributes, no need to check any attributes, always successful move

  • NNIRepeat!(newT,N) choose an edge for nni that does not have a neighbor hybrid. It will try to find such an edge N times, and if it fails, it will return false (unsuccessful move). N=10 by default. If N=1, it rarely finds such an edge if the network is small or complex. The function cannot choose an external edge. it will update locally the attributes.

** Important**: All the moves undo what they did if the move was not successful, so at the end you either have a newT with a new move and with all good attributes, or the same newT that started. This is important to avoid having to do deepcopy of the network before doing the move. Also, after each move, when we update the attributes, we do not update the attributes of the whole network, we only update the attributes of the edges that were affected by the move. This saves time, but makes the code quite clunky. Only the case of multiple alleles the moves does not undo what it did, because it finds out that it failed after the function is over, so just need to treat this case special.

source
SNaQ.readInputDataMethod
readInputData(trees, quartetfile, whichQuartets, numQuartets, writetable, tablename, writeQfile, writesummary)
readInputData(trees, whichQuartets, numQuartets, taxonlist,   writetable, tablename, writeQfile, writesummary)

Read gene trees and calculate the observed quartet concordance factors (CF), that is, the proportion of genes (and the number of genes) that display each quartet for a given list of four-taxon sets.

Input:

  • trees: name of a file containing a list of input gene trees, or vector of trees (HybridNetwork objects)

Optional arguments (defaults):

  • quartetfile: name of a file containing a list of quartets, or more precisely, a list of four-taxon sets
  • whichQuartets (:all): which quartets to sample. :all for all of them, :rand for a random sample.
  • numQuartets: number of quartets in the sample. default: total number of quartets if whichQuartets=:all and 10% of total if whichQuartets=:rand
  • taxonlist (all in the input gene trees): If taxonlist is used, whichQuartets will consist of all sets of 4 taxa in the taxonlist.
  • writetable (true): write the table of observed CF?
  • tablename ("tableCF.txt"): if writetable is true, the table of observed CFs is write to file tablename
  • writeQfile (false): write intermediate file with sampled quartets?
  • writesummary (true): write a summary file? if so, the summary will go in file "summaryTreesQuartets.txt".

Uses calculateObsCFAll!, which implements a slow algorithm.

See also: PhyloNetworks.countquartetsintrees, which uses a much faster algorithm; readtrees2CF, which is an exported and user-friendly re-naming of readInputData, and readtableCF to read a table of quartet CFs directly.

source
SNaQ.sameTaxaMethod
sameTaxa(Quartet, HybridNetwork)

Return true if all taxa in the quartet are represented in the network, false if one or more taxa in the quartet does not appear in the network.

warning: the name can cause confusion. A more appropriate name might be "in", or "taxain", or "taxonsubset", or etc.

source
SNaQ.sampleCFfromCIFunction
sampleCFfromCI(data frame, seed=0)
sampleCFfromCI!(data frame, seed=0)

Read a data frame containing CFs and their credibility intervals, and sample new obsCF uniformly within the CIs. These CFs are then rescaled to sum up to 1 for each 4-taxon sets. Return a data frame with taxon labels in first 4 columns, sampled obs CFs in columns 5-7 and credibility intervals in columns 8-13.

  • The non-modifying function creates a new data frame (with re-ordered columns) and returns it. If seed=-1, the new df is a deep copy of the input df, with no call to the random number generator. Otherwise, seed is passed to the modifying function.
  • The modifying function overwrites the input data frame with the sampled CFs and returns it. If seed=0, the random generator is seeded from the clock. Otherwise the random generator is seeded using seed.

Warning: the modifying version does not check the data frame: assumes correct columns.

optional argument: delim=',' by default: how columns are delimited.

source
SNaQ.setGammaBLfromGammaz!Method
setGammaBLfromGammaz!(node, network)

Update the γ values of the two sister hybrid edges in a bad diamond I, given the gammaz values of their parent nodes, and update the branch lengths t1 and t2 of their parent edges (those across from the hybrid nodes), in such a way that t1=t2 and that these branch lengths and γ values are consistent with the gammaz values in the network.

Similar to the first section of undoGammaz!, but does not update anything else than γ and t's. Unlike undoGammaz!, no error if non-hybrid node or not at bad diamond I.

source
SNaQ.setLength!Method
setLength!(edge, newlength)`

Set the length of edge, and set edge.y and edge.z accordingly. Warning: specific to SNaQ.jl. Consider PhyloNetworks.setlengths! from PhyloNetworks for a more generic tool.

  • The new length is censored to 10: if the new length is above 10, the edge's length will be set to 10. Lengths are interpreted in coalescent units, and 10 is close to infinity: near perfect gene tree concordance. 10 is used as an upper limit to coalescent units that can be reliably estimated.
  • The new length is allowed to be negative, but must be greater than -log(1.5), to ensure that the major quartet concordance factor (1 - 2/3 exp(-length)) is >= 0.
source
SNaQ.setNonIdBL!Method
setNonIdBL!(net)

Set non-identifiable edge branch lengths to -1.0 (i.e. missing) for a level-1 network net, except for edges in

  • a good triangle: the edge below the hybrid is constrained to 0.
  • a bad diamond II: the edge below the hybrid is constrained to 0
  • a bad diamond I: the edges across from the hybrid node have non identifiable lengths but are kept, because the two γ*(1-exp(-t)) values are identifiable.

will break if inCycle attributes are not initialized (at -1) or giving a correct node number.

see PhyloNetworks.Node for the meaning of boolean attributes isBadTriangle (which corresponds to a "good" triangle above), isBadDiamondI and isBadDiamondII.

source
SNaQ.sorttaxa!Method
sorttaxa!(DataFrame, columns)

Reorder the 4 taxa and reorders the observed concordance factors accordingly, on each row of the data frame. If columns is ommitted, taxon names are assumed to be in columns 1-4 and CFs are assumed to be in columns 5-6 with quartets in this order: 12_34, 13_24, 14_23. Does not reorder credibility interval values, if present.

sorttaxa!(DataCF)
sorttaxa!(Quartet, permutation_tax, permutation_cf)

Reorder the 4 taxa in each element of the DataCF quartet. For a given Quartet, reorder the 4 taxa in its fields taxon and qnet.quartetTaxon (if non-empty) and reorder the 3 concordance values accordingly, in obsCF and qnet.expCF.

permutation_tax and permutation_cf should be vectors of short integers (Int8) of length 4 and 3 respectively, whose memory allocation gets reused. Their length is not checked.

qnet.names is unchanged: the order of taxon names here relates to the order of nodes in the network (???)

source
SNaQ.taxadiffMethod
taxadiff(Vector{Quartet}, network; multiplealleles=true)
taxadiff(DataCF, network; multiplealleles=true)

Return 2 vectors:

  • taxa in at least 1 of the quartets but not in the network, and
  • taxa in the network but in none of the quartets.

When multiplealleles is true, the taxon names that end with "__2" are ignored in the quartets: they are not expected to appear in the networks that users give as input, or get as output.

source
SNaQ.traverseContainRoot!Method
updateContainRoot!(HybridNetwork, Node)
traverseContainRoot!(Node, Edge, edges_changed::Array{Edge,1}, rightDir::Vector{Bool})

The input node to updateContainRoot! must be a hybrid node (can come from PhyloNetworks.searchHybridNode). updateContainRoot! starts at the input node and calls traverseContainRoot!, which traverses the network recursively. By default, containroot attributes of edges are true. Changes containroot to false for all the visited edges: those below the input node, but not beyond any other hybrid node.

updateContainRoot! Returns a flag and an array of edges whose containroot has been changed from true to false. flag is false if the set of edges to place the root is empty

In traverseContainRoot!, rightDir turns false if hybridizations have incompatible directions (vector of length 1, to be modified).

Warning:

  • does not update containroot of minor hybrid edges.
  • assumes correct ismajor attributes: to stop the recursion at minor hybrid edges.
  • assumes correct hybrid attributes of both nodes & edges: to check if various hybridizations have compatible directions. For each hybrid node that is encountered, checks if it was reached via a hybrid edge (ok) or tree edge (not ok).

rightDir: vector of length 1 boolean, to be mutable and modified by the function

source
SNaQ.undirectedOtherNetworksMethod
undirectedOtherNetworks(net::HybridNetwork)

Return a vector of HybridNetwork objects, obtained by switching the placement of each hybrid node to other nodes inside its cycle. This amounts to changing the direction of a gene flow event (recursively to move around the whole cycle of each reticulation).

Optional argument: outgroup, as a String. If an outgroup is specified, then networks conflicting with the placement of the root are avoided.

Assumptions: net is assumed to be of level 1, that is, each blob has a single cycle with a single reticulation. All level-1 fields of net are assumed up-to-date.

Example

julia> net = readnewick("(A:1.0,((B:1.1,#H1:0.2::0.2):1.2,(((C:0.52,(E:0.5)#H2:0.02::0.7):0.6,(#H2:0.01::0.3,F:0.7):0.8):0.9,(D:0.8)#H1:0.3::0.8):1.3):0.7):0.1;");
julia> vnet = undirectedOtherNetworks(net)
source
SNaQ.undoGammaz!Method
undoGammaz!(node, network)

Undo updateGammaz! for the 2 cases: bad diamond I,II. node should be a hybrid node. Set length to edges that were not identifiable and change edges' gammaz attribute to -1.0. Recalculate branch lengths in terms of gammaz. warning: needs to know incycle attributes

source
SNaQ.updateBL!Method
updateBL!(net::HybridNetwork, d::DataCF)

Update internal branch lengths of net based on the average quartet concordance factor (CF) across all quartets that exactly correspond to a given branch: new branch length = -log(3/2(1-mean(CF observed in d))). net is assumed to be a tree, such that the above equation holds.

source
SNaQ.updateContainRoot!Function
updateContainRoot!(HybridNetwork, Node)
traverseContainRoot!(Node, Edge, edges_changed::Array{Edge,1}, rightDir::Vector{Bool})

The input node to updateContainRoot! must be a hybrid node (can come from PhyloNetworks.searchHybridNode). updateContainRoot! starts at the input node and calls traverseContainRoot!, which traverses the network recursively. By default, containroot attributes of edges are true. Changes containroot to false for all the visited edges: those below the input node, but not beyond any other hybrid node.

updateContainRoot! Returns a flag and an array of edges whose containroot has been changed from true to false. flag is false if the set of edges to place the root is empty

In traverseContainRoot!, rightDir turns false if hybridizations have incompatible directions (vector of length 1, to be modified).

Warning:

  • does not update containroot of minor hybrid edges.
  • assumes correct ismajor attributes: to stop the recursion at minor hybrid edges.
  • assumes correct hybrid attributes of both nodes & edges: to check if various hybridizations have compatible directions. For each hybrid node that is encountered, checks if it was reached via a hybrid edge (ok) or tree edge (not ok).

rightDir: vector of length 1 boolean, to be mutable and modified by the function

source
SNaQ.writenewick_level1Method
writenewick_level1(net::HybridNetwork)

Write the extended Newick parenthetical format of a level-1 network object with many optional arguments (see below). Makes a deep copy of net: does not modify net.

  • di=true: write in format for Dendroscope (default false)
  • namelabel=true: If namelabel is true, taxa are labelled by their names;

otherwise taxa are labelled by their numbers (unique identifiers).

  • outgroup (string): name of outgroup to root the tree/network. if "none" is given, the root is placed wherever possible.
  • printID=true, only print branch lengths for identifiable egdes according to the snaq estimation procedure (default false) (true inside of snaq!.)
  • round: rounds branch lengths and heritabilities γ (default: true)
  • digits: digits after the decimal place for rounding (defult: 3)
  • string: if true (default), returns a string, otherwise returns an IOBuffer object.
  • multall: (default false). set to true when there are multiple alleles per population.

The topology may be written using a root different than net.rooti, if net.rooti is incompatible with one of more hybrid node. Missing hybrid names are written as "#Hi" where "i" is the hybrid node number if possible.

source
SNaQ.fRelConstant

Default values for tolerance parameters used in the optimization of branch lengths and γ's (fAbs, fRel, xAbs, xRel) and acceptance of topologies (likAbs, numFails).

Below, PN refers to PhyloNetworks.jl, which contained snaq! up until PN v0.16. Starting with PN v0.17, snaq! is part of this package SNaQ.jl.

pkg versionfRelfAbsxRelxAbsnumFailslikAbsmultiplier
SNaQ v0.11e-61e-61e-21e-3751e-6
PN v0.5.11e-61e-61e-21e-3751e-6
PN v0.3.01e-51e-61e-31e-41000.01
PN v0.0.11e-51e-61e-31e-410010000
PN older1e-121e-101e-101e-10

v0.5.1: based on Nan Ji's work. same xAbs and xRel as in phylonet (as of 2015). earlier: a multiplier was used; later: likAbs corresponds to multiplier*fAbs. "older": values from GLM.jl, Prof Bates

Default values used on a single topology to optimize branch lengths and gammas, at the very end of snaq!.

pkg versionfRelBLfAbsBLxRelBLxAbsBL
SNaQ v0.11e-121e-101e-101e-10
PN v0.0.11e-121e-101e-101e-10
source

Index