public documentation
Documentation for QuartetNetworkGoodnessFit
's public (exported) functions.
index
QuartetNetworkGoodnessFit.network_expectedCF
QuartetNetworkGoodnessFit.quarnetGoFtest!
QuartetNetworkGoodnessFit.ticr!
functions
QuartetNetworkGoodnessFit.network_expectedCF
— Methodnetwork_expectedCF(net::HybridNetwork; showprogressbar=true,
inheritancecorrelation=0)
Calculate the quartet concordance factors (qCF) expected from the multispecies coalescent along network net
. Output: (q,t)
where t
is a list of taxa, and q
is a list of 4-taxon set objects of type PhyloNetworks.QuartetT{datatype}
. In each element of q
, taxonnumber
gives the indices in taxa
of the 4 taxa of interest; and data
contains the 3 concordance factors, for the 3 unrooted topologies in the following order: t1,t2|t3,t4
, t1,t3|t2,t4
and t1,t4|t2,t3
. This output is similar to that of PhyloNetworks.countquartetsintrees
when 1 individual = 1 taxon, with 4-taxon sets listed in the same order (same output t
, then same order of 4-taxon sets in q
).
Assumption: the network should have edge lengths in coalescent units.
By default, lineages at a hybrid node come from a parent (chosen according to inheritance probabilities γ) independently across lineages. With option inheritancecorrelation > 0
, lineages have positive dependence, e.g. to model locus-specific inheritance probabilities, randomly drawn from a Beta distribution with mean γ across all loci. If inheritancecorrelation
is set to 1, then all lineages at a given locus inherit from the same (randomly sampled) parent. More generally, the lineages' parents are distributed according to a Dirichlet process with base distribution determined by the γ values, and with concentration parameter α = (1-r)/r, that is, r = 1/(1+α), where r
is the input inheritance correlation.
examples
julia> using PhyloNetworks, QuartetNetworkGoodnessFit
julia> # network with 3_2 cycles, causing some anomalous quartets
net = readTopology("(D:1,((C:1,#H25:0):0.1,((((B1:10,B2:1):1.5,#H1:0):10.8,((A1:1,A2:1):0.001)#H1:0::0.5):0.5)#H25:0::0.501):1);");
julia> # using PhyloPlots; plot(net, showedgelength=true);
julia> q,t = network_expectedCF(net); # anomalous: A1, A2, {B1 or B2}, {C or D}
Calculation quartet CFs for 15 quartets...
0+---------------+100%
***************
julia> show(q[1].taxonnumber)
[1, 2, 3, 4]
julia> show(q[1].data)
[0.8885456713760765, 0.05572716431196175, 0.05572716431196175]
julia> for qi in q
println(join(t[qi.taxonnumber],",") * ": " * string(round.(qi.data, sigdigits=3)))
end
A1,A2,B1,B2: [0.889, 0.0557, 0.0557]
A1,A2,B1,C: [0.168, 0.416, 0.416]
A1,A2,B2,C: [0.168, 0.416, 0.416]
A1,B1,B2,C: [0.0372, 0.0372, 0.926]
A2,B1,B2,C: [0.0372, 0.0372, 0.926]
A1,A2,B1,D: [0.168, 0.416, 0.416]
A1,A2,B2,D: [0.168, 0.416, 0.416]
A1,B1,B2,D: [0.0372, 0.0372, 0.926]
A2,B1,B2,D: [0.0372, 0.0372, 0.926]
A1,A2,C,D: [0.69, 0.155, 0.155]
A1,B1,C,D: [0.793, 0.103, 0.103]
A2,B1,C,D: [0.793, 0.103, 0.103]
A1,B2,C,D: [0.793, 0.103, 0.103]
A2,B2,C,D: [0.793, 0.103, 0.103]
B1,B2,C,D: [1.0, 9.42e-7, 9.42e-7]
QuartetNetworkGoodnessFit.quarnetGoFtest!
— MethodquarnetGoFtest!(net::HybridNetwork, df::DataFrame, optbl::Bool; quartetstat=:LRT, correction=:simulation, seed=1234, nsim=1000, verbose=false, keepfiles=false)
quarnetGoFtest!(net::HybridNetwork, dcf::DataCF, optbl::Bool; kwargs...)
Goodness-of-fit test for the adequacy of the multispecies network coalescent, to see if a given population or species network explains the quartet concordance factor data adequately. The network needs to be of level 1 at most (trees fullfil this condition), and have branch lengths in coalescent units. The test assumes a multinomial distribution for the observed quartet concordance factors (CF), such that information on the number of genes for each four-taxon set (ngenes
field) must be present.
For each four-taxon set, an outlier p-value is obtained by comparing a test statistic (-2log likelihood ratio by default) to a chi-square distribution with 2 degrees of freedom (see below for other options).
The four-taxon sets are then categorized as outliers or not, according to their outlier p-values (outlier if p<0.05). Finally, a one-sided goodness-of-fit test is performed on the frequency of outlier 4-taxon sets to see if there are more outliers than expected. The z-value for this test corresponds to the null hypothesis that 5% outlier p-values are < 0.05 (versus more than 5%):
\[z = \frac{\mathrm{proportion.outliers} - 0.05}{\sqrt{0.05 \times 0.95/\mathrm{number.4taxon.sets}}}.\]
This z-value corresponds to a test that assumes independent outlier p-values across 4-taxon sets: it makes no correction for dependence.
To correct for dependence with correction=:simulation
, the distribution of z-values is obtained by simulating gene trees under the coalescent along the network (after branch length optimization if optbl=true
) using PhyloCoalSimulations. The z-score is calculated on each simulated data set. Under independence, these z-scores have mean 0 and variance 1. Under dependence, these z-scores still have mean 0, but an inflated variance. This variance σ² is estimated from the simulations, and the corrected p-value is obtained by comparing the original z value to N(0,σ²). When correction=:none
, σ is taken to be 1 (independence): not recommended!
- The first version takes a
DataFrame
where each row corresponds to a given four-taxon set. The data frame is modified by having an additional another column containing the p-values corresponding to each four-taxon set. - The second version takes a
DataCF
object and modifies it by updating the expected concordance factors stored in that object.
Note that net
is not modified.
arguments
optbl
: whenfalse
, branch lengths innet
are taken as is, and need to be in coalescent units. Whenoptbl=true
, branch lengths innet
are optimized, to optimize the pseudo log likelihood score as in SNaQ (see here). In both cases, any missing branch length is assigned a value withultrametrize!
, which attempts to make the major tree ultrametric (but never modifies an existing edge length). Missing branch lengths may arise if they are not identifiable, such as lengths of external branches if there is a single allele per taxon. The network is returned as part of the output.
keyword arguments
quartetstat
: the test statistic used to obtain an outlier p-value for each four-taxon set, which is then compared to a chi-squared distribution with 2 degrees of freedom to get a p-value.:LRT
is the default, for the likelihood ratio: $2n_\mathrm{genes} \sum_{j=1}^3 {\hat p}_j (\log{\hat p}_j - \log p_j)$ where $p_j$ is the quartet CF expected from the network, and ${\hat p}_j$ is the quartet CF observed in the data.:Qlog
for the Qlog statistics (Lorenzen, 1995): $2n_\mathrm{genes} \sum_{j=1}^3 \frac{({\hat p}_j - p_j)^2}{p_j (\log{\hat p}_j - \log p_j)}$ and:pearson
for Pearon's chi-squared statistic, which behaves poorly when one or more expected counts are low (e.g. less than 5): $n_\mathrm{genes} \sum_{j=1}^3 \frac{({\hat p}_j - p_j)^2 }{p_j}$
correction=:simulation
to correct for dependence across 4-taxon. Use:none
to turn off simulations and the correction for dependence.seed=1234
: master seed to control the seeds for gene tree simulations.nsim=1000
: number of simulated data sets. Each data set is simulated to have the median number of genes that each 4-taxon sets has data for.verbose=false
: turn totrue
to see progress of simulations and to diagnose potential issues.keepfiles=false
: if true, simulated gene trees are written to files, one file for each of the 1000 simulations. If created (withkeepfiles=true
), these files are stored in a newly created folder whose name starts withjl_
and is placed in the current directory.
output
- p-value of the overall goodness-of-fit test (corrected for dependence if requested)
- uncorrected z value test statistic
- estimated σ for the test statistic used for the correction (1.0 if no correction)
- a vector of outlier p-values, one for each four-taxon set
- network (first and second versions):
net
with loglik field updated ifoptbl
is false; copy ofnet
with optimized branch lengths and loglik ifoptbl
is true - in case
correction = :simulation
, vector of simulated z values (nothing
ifcorrection = :none
). These z-values could be used to calculate an empirical p-value (instead of the p-value in #1), as the proportion of simulated z-values that are ⩾ the observed z-value (in #2).
references
Ruoyi Cai & Cécile Ané (2021). Assessing the fit of the multi-species network coalescent to multi-locus data. Bioinformatics, 37(5):634-641. doi: 10.1093/bioinformatics/btaa863
Lorenzen (1995). A new family of goodness-of-fit statistics for discrete multivariate data. Statistics & Probability Letters, 25(4):301-307. doi: 10.1016/0167-7152(94)00234-8
QuartetNetworkGoodnessFit.ticr!
— Methodticr!(net::HybridNetwork, df::DataFrame, optbl::Bool; quartetstat, test)
ticr!(net::HybridNetwork, dcf::DataCF, optbl::Bool; quartetstat, test)
ticr(dcf::DataCF, quartetstat::Symbol, test::Symbol)
Goodness-of-fit test for the adequacy of the multispecies network coalescent, to see if a given population or species network explains the quartet concordance factor data adequately. See Stenz et al 2015 and addendum for the method on trees, from which the acronym TICR was derived: Tree Incongruence Checking with R.
The tree / network needs to have branch lengths in coalescent units, and must be of level 1. It must be fully resolved, such that the "major" quartet is clearly defined, even though branch lengths can be 0.
The Dirichlet distribution is used to fit the distribution of the observed quartet concordance factors (CF), with a concentration parameter estimated from the data.
The four-taxon sets are then binned into categories according to their outlier p-values: 0-0.01
, 0.01-0.05
, 0.05-0.10
, and 0.10-1
. Finally, a one-sided goodness-of-fit test is performed on these binned frequencies to determine if they depart from an expected 5% p-values being below 0.05 (versus more than 5%), using the default test = :onesided
. Alternatively, the option test = :goodness
carries out the overall goodness-of-fit chi-square test proposed by Stenz et al. (2015), to test for any kind of departure from the expected frequencies (0.01, 0.04, 0.05, 0.90) across all 4 bins.
- The first version takes a
DataFrame
where each row corresponds to a given four-taxon set. The data frame is modified by having an additional another column containing the outlier p-values corresponding to each four-taxon set. - The second version takes a
DataCF
object and modifies it by updating the expected concordance factors stored in that object. - The last version (which all others call) assumes that the expected concordance factors in the
DataCF
object are correctly calculated from the test network.
arguments
optbl
: whenfalse
, the loglik field ofnet
is updated but branch lengths are taken as is. Whentrue
, a copy ofnet
is used to conduce the test, with updated branch lengths (in coalescent units) and updated loglik. This network is returned.quartetstat = :maxCF
: test statistic used to obtain an outlier p-value for each four-taxon set. By default, it is the absolute difference between the observed CF and expected CF of the major resolution (which has the largest CF) ifquartetstat=:maxCF
, as used in Stenz et al. (2015). The other option isquartetstat=:minpval
, in which case the absolute difference between the observed CF and expected CF is calculated for all 3 resolutions, leading to 3 (non-independent) p-values. The outlier p-value for a given four-taxon set is taken as the minimum of these 3 p-values. This option can detect all kinds of departure from the network model for a given four-taxon set, but is not recommended because it is very liberal.test = :onesided
: the overall goodness-of-fit test performed on binned frequencies of quartet outlier p-values; see above.
output
- p-value of the overall goodness-of-fit test
- test statistic (z value)
- a dictionary of the count of p-values in each of the four category
- a vector of two test parameters for the Dirichlet distribution:
- value of the concentration parameter α
- value of the pseudo likelihood (optimized at α)
- a vector of outlier p-values, one for each four-taxon set
- network (first and second versions):
net
with loglik field updated ifoptbl
is false; copy ofnet
with optimized branch lengths and loglik ifoptbl
is true
references
NWM Stenz, B Larget, DA Baum and C Ané (2015). Exploring tree-like and non-tree-like patterns using genome sequences: An example using the inbreeding plant species Arabidopsis thaliana (L.) Heynh. Systematic Biology, 64(5):809-823. doi: 10.1093/sysbio/syv039
see also this addendum