Quartet test for goodness of fit
One can formally assess whether an estimated network fits the concordance factor data with the QuartetNetworkGoodnessFit
package in Julia. This package simulates concordance factors from the estimated network under the network multispecies coalescent, computes how often outlier quartet topologies are observed, and compares this to the observed concordance factors to perform a test for goodness-of-fit. See the QuartetNetworkGoodnessFit
documentation for an example of this test.
Visualizing observed and expected CFs
A good way to visualize the "goodness-of-fit" of a given estimated network to the data is to plot the observed CF versus the expected CF. If the network is a good fit, then the dots in the plot will be close to the diagonal (x=y line).
The following function will create a dataframe with the observed and expected CFs, which are all saved in the DataCF object after running snaq:
julia> topologymaxQpseudolik!(truenet, raxmlCF);
julia> df_wide = fittedquartetCF(raxmlCF) # same as fittedquartetCF(raxmlCF, :wide)
15×10 DataFrame Row │ tx1 tx2 tx3 tx4 obsCF12 obsCF13 obsCF14 expCF12 ⋯ │ String String String String Float64 Float64 Float64 Float64 ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ A B C D 1.0 0.0 0.0 0.99997 ⋯ 2 │ A B C E 0.833333 0.0333333 0.133333 0.83420 3 │ A B D E 0.833333 0.0333333 0.133333 0.83420 4 │ A C D E 0.0 0.0 1.0 2.2166e 5 │ B C D E 0.0 0.0 1.0 2.2166e ⋯ 6 │ A B C O 0.866667 0.0 0.133333 0.86274 7 │ A B D O 0.866667 0.0 0.133333 0.86274 8 │ A C D O 0.0 0.0 1.0 2.41479 9 │ B C D O 0.0 0.0 1.0 2.41479 ⋯ 10 │ A B E O 0.833333 0.0666667 0.1 0.83880 11 │ A C E O 0.533333 0.333333 0.133333 0.60037 12 │ B C E O 0.666667 0.266667 0.0666667 0.60037 13 │ A D E O 0.533333 0.333333 0.133333 0.60037 ⋯ 14 │ B D E O 0.666667 0.266667 0.0666667 0.60037 15 │ C D E O 1.0 0.0 0.0 0.99997 3 columns omitted
julia> df_long = fittedquartetCF(raxmlCF, :long)
45×7 DataFrame Row │ tx1 tx2 tx3 tx4 quartet obsCF expCF │ String String String String String Float64 Float64 ─────┼──────────────────────────────────────────────────────────────── 1 │ A B C D 12_34 1.0 0.99997 2 │ A B C D 13_24 0.0 1.51333e-5 3 │ A B C D 14_23 0.0 1.51333e-5 4 │ A B C E 12_34 0.833333 0.834209 5 │ A B C E 13_24 0.0333333 0.0828954 6 │ A B C E 14_23 0.133333 0.0828954 7 │ A B D E 12_34 0.833333 0.834209 8 │ A B D E 13_24 0.0333333 0.0828954 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 39 │ A D E O 14_23 0.133333 0.100037 40 │ B D E O 12_34 0.666667 0.60037 41 │ B D E O 13_24 0.266667 0.299592 42 │ B D E O 14_23 0.0666667 0.100037 43 │ C D E O 12_34 1.0 0.99997 44 │ C D E O 13_24 0.0 1.51333e-5 45 │ C D E O 14_23 0.0 1.51333e-5 30 rows omitted
It is important to have run snaq!
, topologyQpseudolik!
or topologymaxQpseudolik!
before making these tables, or the result would be meaningless. These functions update the fitted concordance factors (those expected under the network) inside the DataCF object raxmlCF
.
Here is one way to plot them, via R again, and using the R package ggplot2
.
using RCall
To install ggplot2 if not installed already, do: R"install.packages('ggplot2', dep=TRUE)"
@rlibrary ggplot2
ggplot(df_long, aes(x=:obsCF,y=:expCF)) + theme_classic() +
geom_segment(x=0,y=0,xend=1,yend=1, color="#008080", size=0.3) + # diagonal line
geom_point(alpha=0.5, color="#008080", position=position_jitter(width=0.005, height=0.005)) +
ylab("quartet CF expected from network") + xlab("quartet CF observed in gene trees") + coord_equal(ratio=1);
# if needed, save with:
ggsave("expCFs_obsvsfitted.svg", scale=1, width=6, height=5);
Many points are overlapping, so they were "jittered" a little to see them all better. There are always many points overlapping on the bottom-left corner: concordance factors of 0.0 for quartet resolutions not observed, and not expected. To export the table of quartet CFs and explore the fit of the network with other tools:
using CSV
CSV.write("fittedCF.csv", df_long)
We could highlight quartets that include taxon A, say, if we suspect that it is an unrecognized hybrid. Many points are overlapping, like before, so they are again "jittered" a bit.
using DataFrames
df_long[!,:has_A] .= "no"; # add a column to our data, to indicate which 4-taxon sets have A or not
for r in eachrow(df_long)
if "A" ∈ [r[:tx1], r[:tx2], r[:tx3], r[:tx4]]
r[:has_A]="yes"
end
end
julia> first(df_long, 7) # first 7 rows
7×8 DataFrame Row │ tx1 tx2 tx3 tx4 quartet obsCF expCF has_A │ String String String String String Float64 Float64 String ─────┼──────────────────────────────────────────────────────────────────────── 1 │ A B C D 12_34 1.0 0.99997 yes 2 │ A B C D 13_24 0.0 1.51333e-5 yes 3 │ A B C D 14_23 0.0 1.51333e-5 yes 4 │ A B C E 12_34 0.833333 0.834209 yes 5 │ A B C E 13_24 0.0333333 0.0828954 yes 6 │ A B C E 14_23 0.133333 0.0828954 yes 7 │ A B D E 12_34 0.833333 0.834209 yes
ggplot(df_long, aes(x=:obsCF, y=:expCF, color=:has_A)) + theme_classic() +
geom_segment(x=0,y=0,xend=1,yend=1, color="black", size=0.3) + # diagonal line
geom_point(alpha=0.5, position=position_jitter(width=0.005, height=0.005)) +
ylab("quartet CF expected from network") + xlab("quartet CF observed in gene trees") + coord_equal(ratio=1);
# can be saved:
ggsave("expCFs_obsvsfitted_A.svg", width=6, height=5);