Average distances

Code

# code from prior sections: load packages used here
using CSV, DataFrames, PhyloNetworks, StatsBase

For trait evolution, we need a phylogeny with branch lengths. These lengths will be used to parametrize how much trait evolution occurred along each branch (up to a factor that will be estimated). So they need to represent evolutionary time that can reasonably be assumed to represent the amount of trait evolution. Typically, this assumption is reasonable for branch lengths proportional to calendar time, or to substitutions/site (amount of molecular evolution).

In this section and next, we show how to “calibrate” a given network topology, to have branch lengths in substitutions per site.

We want these lengths to represent substitutions/site observed in data, so this section show how to estimate genetic distances from gene trees. These genetic distances are used in the next section to calibrate a network.

Using gene trees from multiple loci may be best when loci are sufficiently informative. With short loci or with loci with few informative sites, some branch lengths may often be underestimated at 0 due to a lack of variable sites, in which case it may be best to estimate genetic distances from the multiple alignment directly.

If loci are sufficiently informative, estimating a gene tree from each locus may help reduce noise in the estimated distances (e.g. account for multiple substitutions along the same lineage). It can also decrease the influence of fast-evolving genes, and rate variation across genes more generally.

We first load helper functions for computing pairwise distances, to

get a matrix of pairwise distances from each tree
normalize these distance matrices, so they all have a similar distance between the ingroup and outgroup taxa
average distance matrices across genes.

These functions were originally developed by Karimi et al. (2020). To load them, do this below within your julia session:

include(download(
  "https://raw.githubusercontent.com/JuliaPhylo/PhyloUtilities/refs/heads/main/scripts/averagetaxondistances.jl"
))

(or copy & paste them)

This tutorial comes with a small sample of gene trees, some of them modified so as to illustrate the case when taxa are missing from some genes. These gene tree files are in folder data/genetrees_sample, and are majority-rule consensus trees in nexus format created by MrBayes. This is only to give example code to read gene tree files and calculate distances. In practice, the majority-rule creates edges of length 0 (polytomies) when there is less than 50% credibility for a resolution in the gene tree, so this is underestimating the true edge length, and a full consensus tree may be a better option.

Below, we show how to read these files using readnexus_treeblock because of the nexus format. This function is meant to read a sample of multiple trees (or multiple networks), so it returns a list. We only use the first and only tree in this list. As we can see, loci L10 and L20 are missing a few taxa.

genetreefolder = joinpath("data","genetrees_sample")
# list the content of this folder, and filter to keep files ending in ".tre":
genetreefiles = filter(x -> endswith(x,".tre"), readdir(genetreefolder))
length(genetreefiles) # 19 files: L22 missing
function read1gene(filename)
  genetreefile = joinpath(genetreefolder, filename)
  return readnexus_treeblock(genetreefile)[1] # first and only tree in the list
end
genetrees = read1gene.(genetreefiles)
for (filename,tree) in zip(genetreefiles, genetrees)
  println("$filename: $(tree.numtaxa) taxa")
end

T405_L10.nex.con.tre: 14 taxa
T405_L11.nex.con.tre: 17 taxa
T405_L12.nex.con.tre: 17 taxa
T405_L13.nex.con.tre: 17 taxa
T405_L14.nex.con.tre: 17 taxa
T405_L15.nex.con.tre: 17 taxa
T405_L16.nex.con.tre: 17 taxa
T405_L17.nex.con.tre: 17 taxa
T405_L18.nex.con.tre: 17 taxa
T405_L19.nex.con.tre: 17 taxa
T405_L20.nex.con.tre: 16 taxa
T405_L21.nex.con.tre: 17 taxa
T405_L23.nex.con.tre: 17 taxa
T405_L24.nex.con.tre: 17 taxa
T405_L25.nex.con.tre: 17 taxa
T405_L26.nex.con.tre: 17 taxa
T405_L27.nex.con.tre: 17 taxa
T405_L28.nex.con.tre: 17 taxa
T405_L29.nex.con.tre: 17 taxa

To convert each tree to a matrix of pairwise distances, we first need to decide the order in which the taxa should be listed.

taxa_long = sort!(union(tiplabels.(genetrees)...))
D, ngenes, geneind = getpairwisedistances(genetrees, taxa_long);
geneind # indices of genes with some missing taxa

2-element Vector{Int64}:
  1
 11

ngenes # number of genes with data for each pair: 19 for most, some 17's and 18's

17×17 Matrix{Int64}:
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 17  17  17  17  17  17  17  17  17  17  17  17  17  17  17  17  17
 18  18  18  18  18  18  18  17  18  18  18  18  18  18  18  18  18
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 18  18  18  18  18  18  18  17  18  18  18  18  18  18  18  18  18
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19
 19  19  19  19  19  19  19  17  18  19  19  19  18  19  19  19  19

D[11] # distance matrix from 11th gene in the list. if missing taxa, then 0s

17×17 Matrix{Float64}:
 0.0         0.031831   0.0261231  …  0.0407398  0.0113065   0.0381869
 0.031831    0.0        0.0191849     0.0272717  0.0307095   0.035106
 0.0261231   0.0191849  0.0           0.0280938  0.0250017   0.0293982
 0.0407076   0.0376267  0.0319188     0.0465355  0.0395861   0.00606439
 0.0324402   0.0293594  0.0236515     0.0382682  0.0313188   0.0283365
 0.0402938   0.0372129  0.031505   …  0.0461218  0.0391723   0.029748
 0.0107768   0.0301798  0.024472      0.0390887  0.00965534  0.0365358
 0.0         0.0        0.0           0.0        0.0         0.0
 0.00702223  0.0264253  0.0207174     0.0353341  0.00590078  0.0327812
 0.00695353  0.0263566  0.0206487     0.0352654  0.00583209  0.0327125
 0.00697372  0.0263767  0.0206689  …  0.0352856  0.00585227  0.0327327
 0.0279844   0.0335855  0.0278776     0.0424944  0.0268629   0.0399415
 0.00695846  0.0263615  0.0206536     0.0352704  0.00583701  0.0327174
 0.0365629   0.0230947  0.0239168     0.0198704  0.0354414   0.039838
 0.0407398   0.0272717  0.0280938     0.0        0.0396184   0.0440149
 0.0113065   0.0307095  0.0250017  …  0.0396184  0.0         0.0370655
 0.0381869   0.035106   0.0293982     0.0440149  0.0370655   0.0

Now we want to rescale each distance matrix so as to limit the impact of rate variation across genes. Otherwise, fast-evolving genes would overwhelm the signal, which may be particularly dangerous if these fast-evolving genes have saturated edge lengths or are affected by long-branch attraction.

For normalizing the median distance between all pairs of ingroup-outgroup taxa, we define these sets of taxa:

outgroup = filter(x -> occursin("micranthum",x), taxa_long)
ingroup = setdiff(taxa_long, outgroup)
med_in2out = normalizedistances_outgroup2ingroup!(D,
  taxa=taxa_long, ingroup=ingroup, outgroup=outgroup);
med_in2out

19-element Vector{Float64}:
 0.028521906
 0.019770464
 0.017636436499999998
 0.0224220723
 0.028571541
 0.038314872
 0.050856829650000004
 0.039494405
 0.0235365128
 0.0321006079
 0.038268231
 0.018588236999999997
 0.0256485685
 0.013689591000000001
 0.00205620645
 0.032686909
 0.0434885215
 0.0146293095
 0.009773589499999999

We can see that all genes had somewhat similar rates, because similar median ingroup-to-outgroup distances, except for gene 15, that had a particularly smaller rate than others. If there is reason to think that slow-evolving genes may need to be removed, a filtering step could be added at this point. For the sake of providing example code, we do this here.

slowgene_indices = findall(d -> d < 0.003, med_in2out) # gene 15 here

1-element Vector{Int64}:
 15

Gene 15 was not missing any taxa: geneind did not list that gene. So we don’t have to re-run the calculation of the vector of D matrix and of the matrix storing the number of genes contributing data for each pair of taxa. Otherwise, we would need to re-run the calculation of ngenes.

deleteat!(D, slowgene_indices); # deletes D[k] for a slow gene k
num_slow = length(slowgene_indices) # 1: number of slow genes that are being exluded
ngenes .-= num_slow # gene 15 was not missing any taxa
length(D) # 18: 1 fewer than before

Now we can see how the distance matrix changed for the first gene, which we looked at earlier:

D[11] # from 11th gene, after normalization. 0s on rows & columns for missing taxa

17×17 Matrix{Float64}:
 0.0         0.0218915  0.017966   …  0.0280185  0.00777595  0.0262628
 0.0218915   0.0        0.0131943     0.0187559  0.0211202   0.0241439
 0.017966    0.0131943  0.0           0.0193213  0.0171947   0.0202184
 0.0279963   0.0258774  0.0219519     0.0320045  0.027225    0.00417073
 0.0223105   0.0201917  0.0162661     0.0263187  0.0215393   0.0194882
 0.0277117   0.0255929  0.0216673  …  0.0317199  0.0269405   0.0204589
 0.00741165  0.0207559  0.0168304     0.0268829  0.00664039  0.0251272
 0.0         0.0        0.0           0.0        0.0         0.0
 0.00482948  0.0181738  0.0142482     0.0243008  0.00405822  0.022545
 0.00478224  0.0181265  0.014201      0.0242535  0.00401097  0.0224978
 0.00479612  0.0181404  0.0142149  …  0.0242674  0.00402485  0.0225117
 0.019246    0.0230982  0.0191726     0.0292252  0.0184748   0.0274694
 0.00478562  0.0181299  0.0142044     0.0242569  0.00401436  0.0225012
 0.0251458   0.0158832  0.0164486     0.0136657  0.0243746   0.0273982
 0.0280185   0.0187559  0.0193213     0.0        0.0272472   0.0302709
 0.00777595  0.0211202  0.0171947  …  0.0272472  0.0         0.0254915
 0.0262628   0.0241439  0.0202184     0.0302709  0.0254915   0.0

Finally, let’s average distances across genes

avD = averagepairwisedistances(D, ngenes)

17×17 Matrix{Float64}:
 0.0        0.025131   0.0166456  …  0.0333717  0.0204769  0.0221704
 0.025131   0.0        0.0201658     0.0288717  0.0161611  0.0170719
 0.0166456  0.0201658  0.0           0.0314511  0.0187754  0.0191991
 0.0167357  0.0204018  0.012726      0.0325606  0.0202637  0.0158341
 0.0173756  0.0195113  0.0129158     0.0316148  0.0196154  0.0206315
 0.0246931  0.0204742  0.0210903  …  0.0281747  0.0168907  0.01409
 0.0180386  0.017438   0.0202813     0.0293285  0.0154165  0.0160886
 0.0218215  0.0203966  0.0241864     0.0308539  0.017866   0.0173078
 0.0198536  0.0154942  0.0183717     0.0255542  0.0122323  0.0132618
 0.0213477  0.0148121  0.0182099     0.0251534  0.0129911  0.0109218
 0.0130298  0.0221036  0.0126574  …  0.0314761  0.0164174  0.0202386
 0.0181169  0.0254095  0.0138355     0.0301603  0.0199716  0.0212204
 0.0175151  0.0144697  0.0180279     0.0253002  0.010981   0.0114824
 0.0221586  0.0148132  0.0163897     0.0241507  0.0123825  0.0128827
 0.0333717  0.0288717  0.0314511     0.0        0.0249235  0.0244055
 0.0204769  0.0161611  0.0187754  …  0.0249235  0.0        0.0146991
 0.0221704  0.0170719  0.0191991     0.0244055  0.0146991  0.0

and save this average distance matrix to a file:

csv_filename = joinpath(genetreefolder, "averagedist_sample.csv")
avD_df = DataFrame([avD[:,j] for j in 1:size(avD,2)], # column data
                   [Symbol(t) for t in taxa_long])    # column names
CSV.write(csv_filename, avD_df)

"data/genetrees_sample/averagedist_sample.csv"