Multiple alleles per species

This documentation pertains to SNaQ v1.0 as originally described in Solís-Lemus & Ané (2016)

Between-species 4-taxon sets

The default setting for SNaQ considers that each allele in a gene tree corresponds to a taxon (a tip) in the network. If instead each allele/individual can be mapped confidently to a species, and if only the species-level network needs to be estimated, then the following functions can be used:

julia> using CSV, DataFrames
julia> mappingfile = joinpath(dirname(pathof(SNaQ)), "..","examples",
           "mappingIndividuals.csv");
julia> tm = CSV.read(mappingfile, DataFrame) # taxon map as a data frame3×2 DataFrame
 Row │ species  individual 
     │ String3  String3    
─────┼─────────────────────
   1 │ S1       S1A
   2 │ S1       S1B
   3 │ S1       S1C
julia> taxonmap = Dict(r[:individual] => r[:species] for r in eachrow(tm)) # as dictionaryDict{InlineStrings.String3, InlineStrings.String3} with 3 entries:
  "S1A" => "S1"
  "S1B" => "S1"
  "S1C" => "S1"

The mapping file can be a text (or csv) file with two columns (at least): one for the individuals, named allele or individual, and one column containing the species names, named species. Each row should map an allele name to a species name. Next, read in the gene trees and calculate the quartet CFs at the species level:

julia> genetreefile = joinpath(dirname(pathof(SNaQ)), "..","examples",
           "genetrees_alleletips.tre");
julia> genetrees = readmultinewick(genetreefile);
julia> sort(tiplabels(genetrees[1])) # multiple tips in species S17-element Vector{String}:
 "S1A"
 "S1B"
 "S1C"
 "S2"
 "S3"
 "S4"
 "S5"
julia> df_sp = tablequartetCF(countquartetsintrees(genetrees, taxonmap;
           showprogressbar=false)...);
julia> keys(df_sp)  # columns names(:qind, :t1, :t2, :t3, :t4, :CF12_34, :CF13_24, :CF14_23, :ngenes)
julia> df_sp[:qind] # quartet index5-element Vector{Int64}:
 1
 2
 3
 4
 5
julia> df_sp[:t1]   # name of first taxon in each quartet5-element Vector{AbstractString}:
 "S1"
 "S1"
 "S1"
 "S1"
 "S2"
julia> df_sp[:CF12_34] # concordance factor for split taxa 12 vs 345-element Vector{Float64}:
 0.08333333333333329
 0.08333333333333329
 0.875
 1.0
 0.875

Now df_sp is a table (technically, a named tuple) containing the quartet concordance factors at the species level only, that is, considering sets made of 4 distinct species, even if the gene trees may have multiple alleles from the same species. For 4 distinct species A,B,C,D, all alleles from each species (A etc.) will be used to calculate the quartet CF. If a given gene tree has n_a alleles from a, n_b alleles from b etc., then each set of 4 alleles is given a weight of 1/(n_a n_b n_c n_d) to calculated of the CF for A,B,C,D (such that the total weight from this particular gene trees is 1). It is safe to save this data frame, then use it for snaq! like this:

julia> CSV.write("tableCF_species.csv", df_sp);   # to save the table to a file
julia> d_sp = readtableCF("tableCF_species.csv"); # "DataCF" object for use in snaq!16.0 gene trees per 4-taxon set
julia> summarizedataCF(d_sp)data consists of 5 4-taxon subsets
Taxa: ["S1", "S2", "S3", "S4", "S5"]
Number of Taxa: 5
Maximum number of 4-taxon subsets: 5. Thus, 100.0 percent of 4-taxon subsets sampled

Within-species 4-taxon sets

Four-taxon sets involving 2 individuals per species can provide more information about the underlying network, including external branch length in coalescent units. However, snaq! runs more slowly when using this extra information.

To get quartet CFs from sets of 4 individuals in which 2 individuals are from the same species, the following functions should be used, where the mapping file can be a text (or csv) file with two columns named allele (or individual) and species, mapping each allele name to a species name.

julia> q,t = countquartetsintrees(genetrees);Reading in trees, looking at 35 quartets in each...
0+----------------+100%
  ****************
julia> df_ind = DataFrame(tablequartetCF(q,t)); # no mapping: CFs across individuals
julia> first(df_ind, 5) # to see the first 5 rows5×9 DataFrame
 Row │ qind   t1      t2      t3      t4      CF12_34  CF13_24  CF14_23  ngene ⋯
     │ Int64  String  String  String  String  Float64  Float64  Float64  Float ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │     1  S1A     S1B     S1C     S2       0.375    0.25     0.375      16 ⋯
   2 │     2  S1A     S1B     S1C     S3       0.375    0.25     0.375      16
   3 │     3  S1A     S1B     S2      S3       0.8125   0.0625   0.125      16
   4 │     4  S1A     S1C     S2      S3       0.75     0.125    0.125      16
   5 │     5  S1B     S1C     S2      S3       0.8125   0.125    0.0625     16 ⋯
                                                                1 column omitted

Now df_ind is the table of concordance factors at the level of individuals. In other words, it lists CFs using one row for each set of 4 alleles/individuals.

Warning: This procedure requires that all alleles from the same individual are given the same name (the individual's 'name') across all genes for which that individual was sequenced.

Next, we use mapallelesCFtable to get these data as quartet concordance factors at the species level in df_sp: with the allele names replaced by the appropriate species names.

julia> CSV.write("tableCF_individuals.csv", df_ind);  # to save to a file
julia> df_sp = mapallelesCFtable(mappingfile, "tableCF_individuals.csv";
           columns=2:5); # taxon names are in columns 2 through 5, not default 1-4┌ Warning: not all alleles were mapped
│ alleles not mapped to a species name: S2 S3 S4 S5
└ @ SNaQ ~/work/SNaQ.jl/SNaQ.jl/src/multipleAlleles.jl:234
julia> nrow(df_sp)       # 35 quartets of individuals35
julia> first(df_sp, 6)   # first 6 rows of data frame6×9 DataFrame
 Row │ qind   t1      t2      t3      t4      CF12_34  CF13_24  CF14_23  ngene ⋯
     │ Int64  String  String  String  String  Float64  Float64  Float64  Float ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │     1  S1      S1      S1      S2       0.375    0.25     0.375      16 ⋯
   2 │     2  S1      S1      S1      S3       0.375    0.25     0.375      16
   3 │     3  S1      S1      S2      S3       0.8125   0.0625   0.125      16
   4 │     4  S1      S1      S2      S3       0.75     0.125    0.125      16
   5 │     5  S1      S1      S2      S3       0.8125   0.125    0.0625     16 ⋯
   6 │     6  S1      S1      S1      S4       0.375    0.25     0.375      16
                                                                1 column omitted

The warning above, after creating df_sp, is because our mapping file does not list species S2 through S5. We did not need to list them because we have a single individual in each of these species. So we can safely ignore the warning. We will just need to make sure that our starting tree, when we run SNaQ, has the same (unmapped) names, here S2-S5.

The command below modifies df_sp to delete rows that are uninformative about between-species relationships, such as rows containing 3 or 4 individuals from the same species (e.g. rows 1, 2 and 6: they contain S1 three times); and creates d_sp, an object of type DataCF at the species level, that we can use later as input for networks estimation with snaq!.

julia> d_sp = readtableCF!(df_sp, mergerows=true); # DataCF objectfound 4 4-taxon sets uninformative about between-species relationships, out of 35.
These 4-taxon sets will be deleted from the data frame.
31 informative 4-taxon sets will be used.
11 unique 4-taxon sets were found.
CF values of repeated 4-taxon sets will be averaged (ngenes too).
16.0 gene trees per 4-taxon set
julia> nrow(df_sp) # 31 quartets of individuals informative between species31

Note

For a four-taxon set A,B,C,D, all the individuals from A, B, C and D are considered, say (a1,b1,c1,d1), (a2,b1,c1,d1), (a1,b2,c1,d1), (a2,b2,c1,d1) and so on. The CFs of these 4-taxon sets are averaged together to obtain the CFs at the species level. This procedures gives more weight to genes that have many alleles (because they contribute to more sets of 4 individuals) and less weight to genes that have few alleles.

Before we run SNaQ, it is safe to save the concordance factor of species quartets, which can be calculated by averaging the CFs of quartets of individuals from the associated species:

julia> df_sp_ave = DataFrame(tablequartetCF(d_sp))  # CFs averaged across individuals11×8 DataFrame
 Row │ t1      t2      t3      t4      CF12_34    CF13_24   CF14_23   ngenes   ⋯
     │ String  String  String  String  Float64    Float64   Float64   Float64? ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ S1      S1__2   S2      S3      0.791667   0.104167  0.104167      16.0 ⋯
   2 │ S1      S1__2   S2      S4      0.875      0.0625    0.0625        16.0
   3 │ S1      S1__2   S3      S4      0.791667   0.104167  0.104167      16.0
   4 │ S1      S2      S3      S4      0.0833333  0.854167  0.0625        16.0
   5 │ S1      S1__2   S2      S5      0.875      0.0625    0.0625        16.0 ⋯
   6 │ S1      S1__2   S3      S5      0.791667   0.104167  0.104167      16.0
   7 │ S1      S2      S3      S5      0.0833333  0.854167  0.0625        16.0
   8 │ S1      S1__2   S4      S5      1.0        0.0       0.0           16.0
   9 │ S1      S2      S4      S5      0.875      0.0625    0.0625        16.0 ⋯
  10 │ S1      S3      S4      S5      1.0        0.0       0.0           16.0
  11 │ S2      S3      S4      S5      0.875      0.0625    0.0625        16.0
julia> CSV.write("CFtable_species.csv", df_sp_ave); # save to file

Some quartets have the same species repeated twice, representing cases when 2 of the 4 individuals came from the same species. These quartets, with repeated species, are informative about the population size of extant populations, i.e. about the lengths of external branches in coalescent units.

The main difference between this section compared to the previous section on Between-species 4-taxon sets is that quartets with 2 individuals from the same species are included here, such as a1,a2,b1,c1. Also, the weighting of quartets is different. Here, genes with more alleles are given more weight.

now we can run snaq!:

net = snaq!(T_sp, d_sp);

where T_sp should be a starting topology with one tip per species, labelled with the same species names as the names used in the mapping file.

If snaq! takes too long that way, we can try a less ambitious estimation that does not estimate the external branch lengths, that is, without using quartets that have 2 individuals from the same species. To do so, we can use the quartet concordance factors at the species level, but filter out the quartets with one (or more) species repeated, such as these below:

julia> first(df_sp_ave, 3) # some quartets have the same species twice3×8 DataFrame
 Row │ t1      t2      t3      t4      CF12_34   CF13_24   CF14_23   ngenes   
     │ String  String  String  String  Float64   Float64   Float64   Float64? 
─────┼────────────────────────────────────────────────────────────────────────
   1 │ S1      S1__2   S2      S3      0.791667  0.104167  0.104167      16.0
   2 │ S1      S1__2   S2      S4      0.875     0.0625    0.0625        16.0
   3 │ S1      S1__2   S3      S4      0.791667  0.104167  0.104167      16.0

Filtering them out can be done as in the first section (Between-species 4-taxon sets) to give equal weight to all genes, or as shown below to give more weight to genes that have more alleles. We first define a helper function to identify which rows we want to get rid of.

julia> first(df_sp_ave, 3) # some quartets have the same species twice3×8 DataFrame
 Row │ t1      t2      t3      t4      CF12_34   CF13_24   CF14_23   ngenes   
     │ String  String  String  String  Float64   Float64   Float64   Float64? 
─────┼────────────────────────────────────────────────────────────────────────
   1 │ S1      S1__2   S2      S3      0.791667  0.104167  0.104167      16.0
   2 │ S1      S1__2   S2      S4      0.875     0.0625    0.0625        16.0
   3 │ S1      S1__2   S3      S4      0.791667  0.104167  0.104167      16.0
julia> """
           hasrep
       
       Return true if a row (4-taxon set) has a "repeated" species, that is, a species
       whose name ends with "__2". Otherwise, return false.
       
       Warning: this function assumes that taxon names are in columns
       "t1", "t2", "t3", "t4". For data frames with different column names,
       e.g. "taxon1", "taxon2" etc., simply edit the code below by replacing
       `:t1` by `:taxon1` (or the appropriate column name in your data).
       """
       function hasrep(row)
           occursin(r"__2$", row[:t1]) || occursin(r"__2$", row[:t2]) ||
           occursin(r"__2$", row[:t3]) || occursin(r"__2$", row[:t4])
       end; # this function is used on the next line
julia> df_sp_reduced = filter(!hasrep, df_sp_ave) # removes rows with repeated species5×8 DataFrame
 Row │ t1      t2      t3      t4      CF12_34    CF13_24   CF14_23  ngenes   
     │ String  String  String  String  Float64    Float64   Float64  Float64? 
─────┼────────────────────────────────────────────────────────────────────────
   1 │ S1      S2      S3      S4      0.0833333  0.854167   0.0625      16.0
   2 │ S1      S2      S3      S5      0.0833333  0.854167   0.0625      16.0
   3 │ S1      S2      S4      S5      0.875      0.0625     0.0625      16.0
   4 │ S1      S3      S4      S5      1.0        0.0        0.0         16.0
   5 │ S2      S3      S4      S5      0.875      0.0625     0.0625      16.0
julia> CSV.write("CFtable_species_norep.csv", df_sp_reduced); # to save to file
julia> d_sp_reduced = readtableCF(df_sp_reduced) # DataCF object, for input to snaq!16.0 gene trees per 4-taxon set
Object DataCF
number of quartets: 5

and now we can run snaq! on the reduced set of quartets without repeats, which should be faster:

net = snaq!(T_sp, d_sp_reduced);