public documentation

Documentation for QuartetNetworkGoodnessFit's public (exported) functions.

index

QuartetNetworkGoodnessFit.network_expectedCF
QuartetNetworkGoodnessFit.quarnetGoFtest!
QuartetNetworkGoodnessFit.ticr!

functions

QuartetNetworkGoodnessFit.network_expectedCF — Method

network_expectedCF(net::HybridNetwork; showprogressbar=true,
        inheritancecorrelation=0)

Calculate the quartet concordance factors (qCF) expected from the multispecies coalescent along network net, using the recursive algorithm by Ané et al. (2024).

Output: (q,t) where t is a list of taxa, and q is a list of 4-taxon set objects of type PhyloNetworks.QuartetT{datatype}. In each element of q, taxonnumber gives the indices in taxa of the 4 taxa of interest; and data contains the 3 concordance factors, for the 3 unrooted topologies in the following order: t1,t2|t3,t4, t1,t3|t2,t4 and t1,t4|t2,t3. This output is similar to that of PhyloNetworks.countquartetsintrees when 1 individual = 1 taxon, with 4-taxon sets listed in the same order (same output t, then same order of 4-taxon sets in q).

Assumption: the network should have edge lengths in coalescent units.

By default, lineages at a hybrid node come from a parent (chosen according to inheritance probabilities γ) independently across lineages. With option inheritancecorrelation > 0, lineages have positive dependence, e.g. to model locus-specific inheritance probabilities, randomly drawn from a Beta distribution with mean γ across all loci (see Fogg, Allman & Ané 2023). If inheritancecorrelation is set to 1, then all lineages at a given locus inherit from the same (randomly sampled) parent. More generally, the lineages' parents are distributed according to a Dirichlet process with base distribution determined by the γ values, and with concentration parameter α = (1-r)/r, that is, r = 1/(1+α), where r is the input inheritance correlation.

example

Below we use a network with two "3₂" cycles, causing some quartets to be anomalous (see Ané et al. 2024), in the sense that the unrooted tree topology displayed in the network, for example A1,A2|B1,C has a lower frequency among gene trees than another topology not even displayed in the network. This is due to incomplete lineage sorting interacting with gene flow.

julia> using PhyloNetworks, QuartetNetworkGoodnessFit

julia> net = readnewick("(D:1,((C:1,#H25:0):0.1,((((B1:10,B2:1):1.5,#H1:0):10.8,
                ((A1:1,A2:1):0.001)#H1:0::0.5):0.5)#H25:0::0.501):1);");

julia> # using PhyloPlots; plot(net, showedgelength=true);

julia> q,t = network_expectedCF(net); # anomalous: A1, A2, {B1 or B2}, {C or D}
Calculation quartet CFs for 15 quartets...
0+---------------+100%
  ***************

julia> show(q[1].taxonnumber)
[1, 2, 3, 4]
julia> show(q[1].data)
[0.8885456713760765, 0.05572716431196175, 0.05572716431196175]

julia> for qi in q
         println(join(t[qi.taxonnumber],",") * ": " *
            string(round.(qi.data, sigdigits=3)))
       end
A1,A2,B1,B2: [0.889, 0.0557, 0.0557]
A1,A2,B1,C: [0.168, 0.416, 0.416]
A1,A2,B2,C: [0.168, 0.416, 0.416]
A1,B1,B2,C: [0.0372, 0.0372, 0.926]
A2,B1,B2,C: [0.0372, 0.0372, 0.926]
A1,A2,B1,D: [0.168, 0.416, 0.416]
A1,A2,B2,D: [0.168, 0.416, 0.416]
A1,B1,B2,D: [0.0372, 0.0372, 0.926]
A2,B1,B2,D: [0.0372, 0.0372, 0.926]
A1,A2,C,D: [0.69, 0.155, 0.155]
A1,B1,C,D: [0.793, 0.103, 0.103]
A2,B1,C,D: [0.793, 0.103, 0.103]
A1,B2,C,D: [0.793, 0.103, 0.103]
A2,B2,C,D: [0.793, 0.103, 0.103]
B1,B2,C,D: [1.0, 9.42e-7, 9.42e-7]

source

QuartetNetworkGoodnessFit.quarnetGoFtest! — Method

quarnetGoFtest!(net::HybridNetwork, df::DataFrame, optbl::Bool;
    quartetstat=:LRT, correction=:simulation, seed=1234, nsim=1000,
    verbose=false, keepfiles=false)
quarnetGoFtest!(net::HybridNetwork, dcf::DataCF,   optbl::Bool; kwargs...)

Goodness-of-fit test for the adequacy of the multispecies network coalescent, to see if a given population or species network explains the quartet concordance factor data adequately. The network needs to be of level 1 at most (trees fullfil this condition), and have branch lengths in coalescent units. The test assumes a multinomial distribution for the observed quartet concordance factors (CF), such that information on the number of genes for each four-taxon set (ngenes field) must be present.

For each four-taxon set, an outlier p-value is obtained by comparing a test statistic (-2log likelihood ratio by default) to a chi-square distribution with 2 degrees of freedom (see below for other options).

The four-taxon sets are then categorized as outliers or not, according to their outlier p-values (outlier if p<0.05). Finally, a one-sided goodness-of-fit test is performed on the frequency of outlier 4-taxon sets to see if there are more outliers than expected. The z-value for this test corresponds to the null hypothesis that 5% outlier p-values are < 0.05 (versus more than 5%):

\[z = \frac{\mathrm{proportion.outliers} - 0.05}{\sqrt{0.05 \times 0.95/\mathrm{number.4taxon.sets}}}.\]

This z-value corresponds to a test that assumes independent outlier p-values across 4-taxon sets: it makes no correction for dependence.

To correct for dependence with correction=:simulation, the distribution of z-values is obtained by simulating gene trees under the coalescent along the network (after branch length optimization if optbl=true) using PhyloCoalSimulations. The z-score is calculated on each simulated data set. Under independence, these z-scores have mean 0 and variance 1. Under dependence, these z-scores still have mean 0, but an inflated variance. This variance σ² is estimated from the simulations, and the corrected p-value is obtained by comparing the original z value to N(0,σ²). When correction=:none, σ is taken to be 1 (independence): not recommended!

The first version takes a DataFrame where each row corresponds to a given four-taxon set. The data frame is modified by having an additional another column containing the p-values corresponding to each four-taxon set.
The second version takes a DataCF object and modifies it by updating the expected concordance factors stored in that object.

Note that net is not modified.

arguments

optbl: when false, branch lengths in net are taken as is, and need to be in coalescent units. When optbl=true, branch lengths in net are optimized, to optimize the pseudo log likelihood score as in SNaQ (see here). In both cases, any missing branch length is assigned a value with ultrametrize!, which attempts to make the major tree ultrametric (but never modifies an existing edge length). Missing branch lengths may arise if they are not identifiable, such as lengths of external branches if there is a single allele per taxon. The network is returned as part of the output.

keyword arguments

quartetstat: the test statistic used to obtain an outlier p-value for each four-taxon set, which is then compared to a chi-squared distribution with 2 degrees of freedom to get a p-value.
- :LRT is the default, for the likelihood ratio: $2n_\mathrm{genes} \sum_{j=1}^3 {\hat p}_j (\log{\hat p}_j - \log p_j)$ where $p_j$ is the quartet CF expected from the network, and ${\hat p}_j$ is the quartet CF observed in the data.
- :Qlog for the Qlog statistics (Lorenzen, 1995): $2n_\mathrm{genes} \sum_{j=1}^3 \frac{({\hat p}_j - p_j)^2}{p_j (\log{\hat p}_j - \log p_j)}$ and
- :pearson for Pearon's chi-squared statistic, which behaves poorly when one or more expected counts are low (e.g. less than 5): $n_\mathrm{genes} \sum_{j=1}^3 \frac{({\hat p}_j - p_j)^2 }{p_j}$
correction=:simulation to correct for dependence across 4-taxon sets. Use :none to turn off simulations and the correction for dependence.
seed=1234: master seed to control the seeds for gene tree simulations.
nsim=1000: number of simulated data sets. Each data set is simulated to have the median number of genes that each 4-taxon sets has data for.
verbose=false: turn to true to see progress of simulations and to diagnose potential issues.
keepfiles=false: if true, simulated gene trees are written to files, one file for each of the 1000 simulations. If created (with keepfiles=true), these files are stored in a newly created folder whose name starts with jl_ and is placed in the current directory.

output

p-value of the overall goodness-of-fit test (corrected for dependence if requested)
uncorrected z value test statistic
estimated σ for the test statistic used for the correction (1.0 if no correction)
a vector of outlier p-values, one for each four-taxon set
network (first and second versions): net with net.fscore updated to contain the quartet CF score (pseudo log-likelihood up to a constant) if optbl is false; copy of net with optimized branch lengths and quartet CF score stored in net.fscore if optbl is true
in case correction = :simulation, vector of simulated z values (nothing if correction = :none). These z-values could be used to calculate an empirical p-value (instead of the p-value in #1), as the proportion of simulated z-values that are ⩾ the observed z-value (in #2).

references

Ruoyi Cai & Cécile Ané (2021). Assessing the fit of the multi-species network coalescent to multi-locus data. Bioinformatics, 37(5):634-641. doi: 10.1093/bioinformatics/btaa863
Lorenzen (1995). A new family of goodness-of-fit statistics for discrete multivariate data. Statistics & Probability Letters, 25(4):301-307. doi: 10.1016/0167-7152(94)00234-8

source

QuartetNetworkGoodnessFit.ticr! — Method

ticr!(net::HybridNetwork, df::DataFrame, optbl::Bool; quartetstat, test)
ticr!(net::HybridNetwork, dcf::DataCF,   optbl::Bool; quartetstat, test)
ticr(dcf::DataCF, quartetstat::Symbol, test::Symbol)

The tree / network needs to have branch lengths in coalescent units, and must be of level 1. It must be fully resolved, such that the "major" quartet is clearly defined, even though branch lengths can be 0.

The Dirichlet distribution is used to fit the distribution of the observed quartet concordance factors (CF), with a concentration parameter estimated from the data.

The four-taxon sets are then binned into categories according to their outlier p-values: 0-0.01, 0.01-0.05, 0.05-0.10, and 0.10-1. Finally, a one-sided goodness-of-fit test is performed on these binned frequencies to determine if they depart from an expected 5% p-values being below 0.05 (versus more than 5%), using the default test = :onesided. Alternatively, the option test = :goodness carries out the overall goodness-of-fit chi-square test proposed by Stenz et al. (2015), to test for any kind of departure from the expected frequencies (0.01, 0.04, 0.05, 0.90) across all 4 bins.

The first version takes a DataFrame where each row corresponds to a given four-taxon set. The data frame is modified by having an additional another column containing the outlier p-values corresponding to each four-taxon set.
The second version takes a DataCF object and modifies it by updating the expected concordance factors stored in that object.
The last version (which all others call) assumes that the expected concordance factors in the DataCF object are correctly calculated from the test network.

arguments

optbl: when false, net.fscore is updated with the network's quartet CF score (pseudo log-likelihood up to a constant) but branch lengths are taken as is. When true, a copy of net is used to conduce the test, with updated branch lengths (in coalescent units) and updated score. This network is returned.
quartetstat = :maxCF: test statistic used to obtain an outlier p-value for each four-taxon set. By default, it is the absolute difference between the observed CF and expected CF of the major resolution (which has the largest CF) if quartetstat=:maxCF, as used in Stenz et al. (2015). The other option is quartetstat=:minpval, in which case the absolute difference between the observed CF and expected CF is calculated for all 3 resolutions, leading to 3 (non-independent) p-values. The outlier p-value for a given four-taxon set is taken as the minimum of these 3 p-values. This option can detect all kinds of departure from the network model for a given four-taxon set, but is not recommended because it is very liberal.
test = :onesided: the overall goodness-of-fit test performed on binned frequencies of quartet outlier p-values; see above.

output

p-value of the overall goodness-of-fit test
test statistic (z value)
a dictionary of the count of p-values in each of the four category
a vector of two test parameters for the Dirichlet distribution:
- value of the concentration parameter α
- value of the pseudo likelihood (optimized at α)
a vector of outlier p-values, one for each four-taxon set
network (first and second versions): net with field net.fscore updated to contain the quartet CF score (pseudo log-likelihood up to a constant) if optbl is false; copy of net with optimized branch lengths and quartet CF score stored in net.fscore if optbl is true

references

NWM Stenz, B Larget, DA Baum and C Ané (2015). Exploring tree-like and non-tree-like patterns using genome sequences: An example using the inbreeding plant species Arabidopsis thaliana (L.) Heynh. Systematic Biology, 64(5):809-823. doi: 10.1093/sysbio/syv039