Quartet test for goodness of fit

One can formally assess whether an estimated network fits the concordance factor data with the QuartetNetworkGoodnessFit package in Julia. This package simulates concordance factors from the estimated network under the network multispecies coalescent, computes how often outlier quartet topologies are observed, and compares this to the observed concordance factors to perform a test for goodness-of-fit. See the QuartetNetworkGoodnessFit documentation for an example of this test.

Visualizing observed and expected CFs

A good way to visualize the "goodness-of-fit" of a given estimated network to the data is to plot the observed CF versus the expected CF. If the network is a good fit, then the dots in the plot will be close to the diagonal (x=y line).

The following function will create a dataframe with the observed and expected CFs, which are all saved in the DataCF object after running snaq:

julia> topologymaxQpseudolik!(truenet, raxmlCF);
julia> df_wide = fittedquartetCF(raxmlCF) # same as fittedquartetCF(raxmlCF, :wide)15×10 DataFrame
 Row │ tx1     tx2     tx3     tx4     obsCF12   obsCF13    obsCF14    expCF12 ⋯
     │ String  String  String  String  Float64   Float64    Float64    Float64 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ A       B       C       D       1.0       0.0        0.0        0.99997 ⋯
   2 │ A       B       C       E       0.833333  0.0333333  0.133333   0.83420
   3 │ A       B       D       E       0.833333  0.0333333  0.133333   0.83420
   4 │ A       C       D       E       0.0       0.0        1.0        2.2166e
   5 │ B       C       D       E       0.0       0.0        1.0        2.2166e ⋯
   6 │ A       B       C       O       0.866667  0.0        0.133333   0.86274
   7 │ A       B       D       O       0.866667  0.0        0.133333   0.86274
   8 │ A       C       D       O       0.0       0.0        1.0        2.41479
   9 │ B       C       D       O       0.0       0.0        1.0        2.41479 ⋯
  10 │ A       B       E       O       0.833333  0.0666667  0.1        0.83880
  11 │ A       C       E       O       0.533333  0.333333   0.133333   0.60037
  12 │ B       C       E       O       0.666667  0.266667   0.0666667  0.60037
  13 │ A       D       E       O       0.533333  0.333333   0.133333   0.60037 ⋯
  14 │ B       D       E       O       0.666667  0.266667   0.0666667  0.60037
  15 │ C       D       E       O       1.0       0.0        0.0        0.99997
                                                               3 columns omitted
julia> df_long = fittedquartetCF(raxmlCF, :long)45×7 DataFrame
 Row │ tx1     tx2     tx3     tx4     quartet  obsCF      expCF      
     │ String  String  String  String  String   Float64    Float64    
─────┼────────────────────────────────────────────────────────────────
   1 │ A       B       C       D       12_34    1.0        0.99997
   2 │ A       B       C       D       13_24    0.0        1.51333e-5
   3 │ A       B       C       D       14_23    0.0        1.51333e-5
   4 │ A       B       C       E       12_34    0.833333   0.834209
   5 │ A       B       C       E       13_24    0.0333333  0.0828954
   6 │ A       B       C       E       14_23    0.133333   0.0828954
   7 │ A       B       D       E       12_34    0.833333   0.834209
   8 │ A       B       D       E       13_24    0.0333333  0.0828954
  ⋮  │   ⋮       ⋮       ⋮       ⋮        ⋮         ⋮          ⋮
  39 │ A       D       E       O       14_23    0.133333   0.100037
  40 │ B       D       E       O       12_34    0.666667   0.60037
  41 │ B       D       E       O       13_24    0.266667   0.299592
  42 │ B       D       E       O       14_23    0.0666667  0.100037
  43 │ C       D       E       O       12_34    1.0        0.99997
  44 │ C       D       E       O       13_24    0.0        1.51333e-5
  45 │ C       D       E       O       14_23    0.0        1.51333e-5
                                                       30 rows omitted

It is important to have run snaq!, topologyQpseudolik! or topologymaxQpseudolik! before making these tables, or the result would be meaningless. These functions update the fitted concordance factors (those expected under the network) inside the DataCF object raxmlCF.

Here is one way to plot them, via R again, and using the R package ggplot2.

using RCall

To install ggplot2 if not installed already, do: R"install.packages('ggplot2', dep=TRUE)"

@rlibrary ggplot2
ggplot(df_long, aes(x=:obsCF,y=:expCF)) + theme_classic() +
    geom_segment(x=0,y=0,xend=1,yend=1, color="#008080", size=0.3) + # diagonal line
    geom_point(alpha=0.5, color="#008080", position=position_jitter(width=0.005, height=0.005)) +
    ylab("quartet CF expected from network") + xlab("quartet CF observed in gene trees") + coord_equal(ratio=1);
# if needed, save with:
ggsave("expCFs_obsvsfitted.svg", scale=1, width=6, height=5);

obsvsfitted

Many points are overlapping, so they were "jittered" a little to see them all better. There are always many points overlapping on the bottom-left corner: concordance factors of 0.0 for quartet resolutions not observed, and not expected. To export the table of quartet CFs and explore the fit of the network with other tools:

using CSV
CSV.write("fittedCF.csv", df_long)

We could highlight quartets that include taxon A, say, if we suspect that it is an unrecognized hybrid. Many points are overlapping, like before, so they are again "jittered" a bit.

using DataFrames
df_long[!,:has_A] .= "no"; # add a column to our data, to indicate which 4-taxon sets have A or not
for r in eachrow(df_long)
    if "A" ∈ [r[:tx1], r[:tx2], r[:tx3], r[:tx4]]
       r[:has_A]="yes"
    end
end

julia> first(df_long, 7) # first 7 rows7×8 DataFrame
 Row │ tx1     tx2     tx3     tx4     quartet  obsCF      expCF       has_A  
     │ String  String  String  String  String   Float64    Float64     String 
─────┼────────────────────────────────────────────────────────────────────────
   1 │ A       B       C       D       12_34    1.0        0.99997     yes
   2 │ A       B       C       D       13_24    0.0        1.51333e-5  yes
   3 │ A       B       C       D       14_23    0.0        1.51333e-5  yes
   4 │ A       B       C       E       12_34    0.833333   0.834209    yes
   5 │ A       B       C       E       13_24    0.0333333  0.0828954   yes
   6 │ A       B       C       E       14_23    0.133333   0.0828954   yes
   7 │ A       B       D       E       12_34    0.833333   0.834209    yes

ggplot(df_long, aes(x=:obsCF, y=:expCF, color=:has_A)) + theme_classic() +
    geom_segment(x=0,y=0,xend=1,yend=1, color="black", size=0.3) + # diagonal line
    geom_point(alpha=0.5, position=position_jitter(width=0.005, height=0.005)) +
    ylab("quartet CF expected from network") + xlab("quartet CF observed in gene trees") + coord_equal(ratio=1);
# can be saved:
ggsave("expCFs_obsvsfitted_A.svg", width=6, height=5);

obsvsfitted A present or not