Multiple alleles per species
This documentation pertains to SNaQ v1.0 as originally described in Solís-Lemus & Ané (2016)
Between-species 4-taxon sets
The default setting for SNaQ considers that each allele in a gene tree corresponds to a taxon (a tip) in the network. If instead each allele/individual can be mapped confidently to a species, and if only the species-level network needs to be estimated, then the following functions can be used:
julia> using CSV, DataFrames
julia> mappingfile = joinpath(dirname(pathof(SNaQ)), "..","examples", "mappingIndividuals.csv");
julia> tm = CSV.read(mappingfile, DataFrame) # taxon map as a data frame
3×2 DataFrame Row │ species individual │ String3 String3 ─────┼───────────────────── 1 │ S1 S1A 2 │ S1 S1B 3 │ S1 S1C
julia> taxonmap = Dict(r[:individual] => r[:species] for r in eachrow(tm)) # as dictionary
Dict{InlineStrings.String3, InlineStrings.String3} with 3 entries: "S1A" => "S1" "S1B" => "S1" "S1C" => "S1"
The mapping file can be a text (or csv
) file with two columns (at least): one for the individuals, named allele
or individual
, and one column containing the species names, named species
. Each row should map an allele name to a species name. Next, read in the gene trees and calculate the quartet CFs at the species level:
julia> genetreefile = joinpath(dirname(pathof(SNaQ)), "..","examples", "genetrees_alleletips.tre");
julia> genetrees = readmultinewick(genetreefile);
julia> sort(tiplabels(genetrees[1])) # multiple tips in species S1
7-element Vector{String}: "S1A" "S1B" "S1C" "S2" "S3" "S4" "S5"
julia> df_sp = tablequartetCF(countquartetsintrees(genetrees, taxonmap; showprogressbar=false)...);
julia> keys(df_sp) # columns names
(:qind, :t1, :t2, :t3, :t4, :CF12_34, :CF13_24, :CF14_23, :ngenes)
julia> df_sp[:qind] # quartet index
5-element Vector{Int64}: 1 2 3 4 5
julia> df_sp[:t1] # name of first taxon in each quartet
5-element Vector{AbstractString}: "S1" "S1" "S1" "S1" "S2"
julia> df_sp[:CF12_34] # concordance factor for split taxa 12 vs 34
5-element Vector{Float64}: 0.08333333333333329 0.08333333333333329 0.875 1.0 0.875
Now df_sp
is a table (technically, a named tuple) containing the quartet concordance factors at the species level only, that is, considering sets made of 4 distinct species, even if the gene trees may have multiple alleles from the same species. For 4 distinct species A,B,C,D
, all alleles from each species (A
etc.) will be used to calculate the quartet CF. If a given gene tree has n_a
alleles from a
, n_b
alleles from b
etc., then each set of 4 alleles is given a weight of 1/(n_a n_b n_c n_d)
to calculated of the CF for A,B,C,D
(such that the total weight from this particular gene trees is 1). It is safe to save this data frame, then use it for snaq!
like this:
julia> CSV.write("tableCF_species.csv", df_sp); # to save the table to a file
julia> d_sp = readtableCF("tableCF_species.csv"); # "DataCF" object for use in snaq!
16.0 gene trees per 4-taxon set
julia> summarizedataCF(d_sp)
data consists of 5 4-taxon subsets Taxa: ["S1", "S2", "S3", "S4", "S5"] Number of Taxa: 5 Maximum number of 4-taxon subsets: 5. Thus, 100.0 percent of 4-taxon subsets sampled
Within-species 4-taxon sets
Four-taxon sets involving 2 individuals per species can provide more information about the underlying network, including external branch length in coalescent units. However, snaq!
runs more slowly when using this extra information.
To get quartet CFs from sets of 4 individuals in which 2 individuals are from the same species, the following functions should be used, where the mapping file can be a text (or csv
) file with two columns named allele
(or individual
) and species
, mapping each allele name to a species name.
julia> q,t = countquartetsintrees(genetrees);
Reading in trees, looking at 35 quartets in each... 0+----------------+100% ****************
julia> df_ind = DataFrame(tablequartetCF(q,t)); # no mapping: CFs across individuals
julia> first(df_ind, 5) # to see the first 5 rows
5×9 DataFrame Row │ qind t1 t2 t3 t4 CF12_34 CF13_24 CF14_23 ngene ⋯ │ Int64 String String String String Float64 Float64 Float64 Float ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ 1 S1A S1B S1C S2 0.375 0.25 0.375 16 ⋯ 2 │ 2 S1A S1B S1C S3 0.375 0.25 0.375 16 3 │ 3 S1A S1B S2 S3 0.8125 0.0625 0.125 16 4 │ 4 S1A S1C S2 S3 0.75 0.125 0.125 16 5 │ 5 S1B S1C S2 S3 0.8125 0.125 0.0625 16 ⋯ 1 column omitted
Now df_ind
is the table of concordance factors at the level of individuals. In other words, it lists CFs using one row for each set of 4 alleles/individuals.
Warning: This procedure requires that all alleles from the same individual are given the same name (the individual's 'name') across all genes for which that individual was sequenced.
Next, we use mapallelesCFtable
to get these data as quartet concordance factors at the species level in df_sp
: with the allele names replaced by the appropriate species names.
julia> CSV.write("tableCF_individuals.csv", df_ind); # to save to a file
julia> df_sp = mapallelesCFtable(mappingfile, "tableCF_individuals.csv"; columns=2:5); # taxon names are in columns 2 through 5, not default 1-4
┌ Warning: not all alleles were mapped │ alleles not mapped to a species name: S2 S3 S4 S5 └ @ SNaQ ~/work/SNaQ.jl/SNaQ.jl/src/multipleAlleles.jl:234
julia> nrow(df_sp) # 35 quartets of individuals
35
julia> first(df_sp, 6) # first 6 rows of data frame
6×9 DataFrame Row │ qind t1 t2 t3 t4 CF12_34 CF13_24 CF14_23 ngene ⋯ │ Int64 String String String String Float64 Float64 Float64 Float ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ 1 S1 S1 S1 S2 0.375 0.25 0.375 16 ⋯ 2 │ 2 S1 S1 S1 S3 0.375 0.25 0.375 16 3 │ 3 S1 S1 S2 S3 0.8125 0.0625 0.125 16 4 │ 4 S1 S1 S2 S3 0.75 0.125 0.125 16 5 │ 5 S1 S1 S2 S3 0.8125 0.125 0.0625 16 ⋯ 6 │ 6 S1 S1 S1 S4 0.375 0.25 0.375 16 1 column omitted
The warning above, after creating df_sp
, is because our mapping file does not list species S2 through S5. We did not need to list them because we have a single individual in each of these species. So we can safely ignore the warning. We will just need to make sure that our starting tree, when we run SNaQ, has the same (unmapped) names, here S2-S5.
The command below modifies df_sp
to delete rows that are uninformative about between-species relationships, such as rows containing 3 or 4 individuals from the same species (e.g. rows 1, 2 and 6: they contain S1 three times); and creates d_sp
, an object of type DataCF
at the species level, that we can use later as input for networks estimation with snaq!
.
julia> d_sp = readtableCF!(df_sp, mergerows=true); # DataCF object
found 4 4-taxon sets uninformative about between-species relationships, out of 35. These 4-taxon sets will be deleted from the data frame. 31 informative 4-taxon sets will be used. 11 unique 4-taxon sets were found. CF values of repeated 4-taxon sets will be averaged (ngenes too). 16.0 gene trees per 4-taxon set
julia> nrow(df_sp) # 31 quartets of individuals informative between species
31
For a four-taxon set A,B,C,D
, all the individuals from A
, B
, C
and D
are considered, say (a1,b1,c1,d1)
, (a2,b1,c1,d1)
, (a1,b2,c1,d1)
, (a2,b2,c1,d1)
and so on. The CFs of these 4-taxon sets are averaged together to obtain the CFs at the species level. This procedures gives more weight to genes that have many alleles (because they contribute to more sets of 4 individuals) and less weight to genes that have few alleles.
Before we run SNaQ, it is safe to save the concordance factor of species quartets, which can be calculated by averaging the CFs of quartets of individuals from the associated species:
julia> df_sp_ave = DataFrame(tablequartetCF(d_sp)) # CFs averaged across individuals
11×8 DataFrame Row │ t1 t2 t3 t4 CF12_34 CF13_24 CF14_23 ngenes ⋯ │ String String String String Float64 Float64 Float64 Float64? ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ S1 S1__2 S2 S3 0.791667 0.104167 0.104167 16.0 ⋯ 2 │ S1 S1__2 S2 S4 0.875 0.0625 0.0625 16.0 3 │ S1 S1__2 S3 S4 0.791667 0.104167 0.104167 16.0 4 │ S1 S2 S3 S4 0.0833333 0.854167 0.0625 16.0 5 │ S1 S1__2 S2 S5 0.875 0.0625 0.0625 16.0 ⋯ 6 │ S1 S1__2 S3 S5 0.791667 0.104167 0.104167 16.0 7 │ S1 S2 S3 S5 0.0833333 0.854167 0.0625 16.0 8 │ S1 S1__2 S4 S5 1.0 0.0 0.0 16.0 9 │ S1 S2 S4 S5 0.875 0.0625 0.0625 16.0 ⋯ 10 │ S1 S3 S4 S5 1.0 0.0 0.0 16.0 11 │ S2 S3 S4 S5 0.875 0.0625 0.0625 16.0
julia> CSV.write("CFtable_species.csv", df_sp_ave); # save to file
Some quartets have the same species repeated twice, representing cases when 2 of the 4 individuals came from the same species. These quartets, with repeated species, are informative about the population size of extant populations, i.e. about the lengths of external branches in coalescent units.
The main difference between this section compared to the previous section on Between-species 4-taxon sets is that quartets with 2 individuals from the same species are included here, such as a1,a2,b1,c1
. Also, the weighting of quartets is different. Here, genes with more alleles are given more weight.
now we can run snaq!
:
net = snaq!(T_sp, d_sp);
where T_sp
should be a starting topology with one tip per species, labelled with the same species names as the names used in the mapping file.
If snaq!
takes too long that way, we can try a less ambitious estimation that does not estimate the external branch lengths, that is, without using quartets that have 2 individuals from the same species. To do so, we can use the quartet concordance factors at the species level, but filter out the quartets with one (or more) species repeated, such as these below:
julia> first(df_sp_ave, 3) # some quartets have the same species twice
3×8 DataFrame Row │ t1 t2 t3 t4 CF12_34 CF13_24 CF14_23 ngenes │ String String String String Float64 Float64 Float64 Float64? ─────┼──────────────────────────────────────────────────────────────────────── 1 │ S1 S1__2 S2 S3 0.791667 0.104167 0.104167 16.0 2 │ S1 S1__2 S2 S4 0.875 0.0625 0.0625 16.0 3 │ S1 S1__2 S3 S4 0.791667 0.104167 0.104167 16.0
Filtering them out can be done as in the first section (Between-species 4-taxon sets) to give equal weight to all genes, or as shown below to give more weight to genes that have more alleles. We first define a helper function to identify which rows we want to get rid of.
julia> first(df_sp_ave, 3) # some quartets have the same species twice
3×8 DataFrame Row │ t1 t2 t3 t4 CF12_34 CF13_24 CF14_23 ngenes │ String String String String Float64 Float64 Float64 Float64? ─────┼──────────────────────────────────────────────────────────────────────── 1 │ S1 S1__2 S2 S3 0.791667 0.104167 0.104167 16.0 2 │ S1 S1__2 S2 S4 0.875 0.0625 0.0625 16.0 3 │ S1 S1__2 S3 S4 0.791667 0.104167 0.104167 16.0
julia> """ hasrep Return true if a row (4-taxon set) has a "repeated" species, that is, a species whose name ends with "__2". Otherwise, return false. Warning: this function assumes that taxon names are in columns "t1", "t2", "t3", "t4". For data frames with different column names, e.g. "taxon1", "taxon2" etc., simply edit the code below by replacing `:t1` by `:taxon1` (or the appropriate column name in your data). """ function hasrep(row) occursin(r"__2$", row[:t1]) || occursin(r"__2$", row[:t2]) || occursin(r"__2$", row[:t3]) || occursin(r"__2$", row[:t4]) end; # this function is used on the next line
julia> df_sp_reduced = filter(!hasrep, df_sp_ave) # removes rows with repeated species
5×8 DataFrame Row │ t1 t2 t3 t4 CF12_34 CF13_24 CF14_23 ngenes │ String String String String Float64 Float64 Float64 Float64? ─────┼──────────────────────────────────────────────────────────────────────── 1 │ S1 S2 S3 S4 0.0833333 0.854167 0.0625 16.0 2 │ S1 S2 S3 S5 0.0833333 0.854167 0.0625 16.0 3 │ S1 S2 S4 S5 0.875 0.0625 0.0625 16.0 4 │ S1 S3 S4 S5 1.0 0.0 0.0 16.0 5 │ S2 S3 S4 S5 0.875 0.0625 0.0625 16.0
julia> CSV.write("CFtable_species_norep.csv", df_sp_reduced); # to save to file
julia> d_sp_reduced = readtableCF(df_sp_reduced) # DataCF object, for input to snaq!
16.0 gene trees per 4-taxon set Object DataCF number of quartets: 5
and now we can run snaq!
on the reduced set of quartets without repeats, which should be faster:
net = snaq!(T_sp, d_sp_reduced);