TICR pipeline
PhyloNetworks' wiki has a step-by-step tutorial, to go from multiple sequence alignments to a table of quartet gene frequencies (concordance factors: CFs), through BUCKy (to integrate out gene tree uncertainty) or through RAxML. To get the raxml.pl
perl script to run RAxML on each gene, download the content of that wiki with
git clone https://github.com/juliaphylo/PhyloNetworks.jl.wiki.git
then go to the script/
folder. Full information and code is here. Below is more detail to insert data into the pipeline at various stages, using one of two pipelines, depending on your machine configuration:
"no-scheduler" pipeline: original set of Perl scripts from the TICR pipeline, well suited for a machine or a cluster of machines without a job scheduler. The scripts automatically parallelize the work across the available cores.
"slurm" pipeline: well suited for a cluster where users submit jobs via a job scheduler like SLURM or SGE. The job scheduler does the work of parallelizing the work across available cores. The scripts, in this second pipeline, were created to take full advantage of job scheduler capabilities. They were developed for a cluster running SLURM. Adjustments to the submit scripts will be needed, to adapt to your own SLURM configuration or to the syntax that your job scheduler wants.
To run MrBayes: we already have alignments
no-scheduler pipeline
We don't need to run mdl.pl
if we already have aligned gene sequences from separate loci. To run MrBayes on each locus, we can simply create a tarball of the Nexus files we wish to use (fasta won't work at this stage). The command below assumes that we want to use all the files ending with ".nex" in the current directory, one file per locus:
tar czf my-genes.tar.gz *.nex
Once the tarball has been successfully generated, we can then specify this file as input for mb.pl assuming we have a valid MrBayes block located in the file "bayes.txt":
mb.pl my-genes.tar.gz -m bayes.txt -o my-genes-mb
If we get an error message like mb.pl: Command not found
, it might be because mb.pl
has no execute permission, or the current directory is not in our "path". An easy fix is to run this command instead:
perl mb.pl my-genes.tar.gz -m bayes.txt -o my-genes-mb
The resulting output tarball would now be located in my-genes-mb/my-genes.mb.tar
, and can be used normally with bucky.pl, that is, like this:
bucky.pl my-genes-mb/my-genes.mb.tar -o mygenes-bucky
The output, with the table of concordance factors for all sets of 4 taxa, will be in a file named my-genes.CFs.csv
inside a directory named mygenes-bucky
. That's the file containing the quartet concordance factors to give to SNaQ as input. There is no need to do any of the steps below: they are already done by bucky.pl
.
slurm pipeline
SLURM will parallelize the MrBayes runs across genes.
Navigate in some "working" directory where you place:
- a folder containing all nexus files, which we will call "nexusfolder" below
- a text file named
mb-block.txt
with the MrBayes block to be used
for all the genes (containing the options for MrBayes: model of sequence evolution, number of generations etc.). If we want a different MrBayes block for different genes, step 2 should be skipped, and we should instead find some other way to put the specific MrBayes block at the end of each nexus file.
In the "working" directory above, run the julia script
paste-mb-block.jl
with "nexusfolder" as argument, to tell the script where to find all the nexus files:julia path/to/paste-mb-block.jl nexusfolder
This script will read all the nexus files in the directory
nexusfolder
, will create a new directorynexusfolder-block
, and will create new nexus files (containing the MrBayes block found in filemb-block.txt
) as1.nex, 2.nex, ...
in the new directory. Atranslate.txt
file will also be created to map the original gene file names to the new (numbered) file names. If we named our MrBayes block file differently: we can edit the script and modify it to replacemb-block.txt
by our actual file name for the MrBayes block.Modify the submit script
mb-slurm-submit.sh
, which will parallelize all the individual-gene MrBayes runs with SLURM:- change
--array
to the correct number of genes - change
--mail-user
to the user's email (if this is an option for your job scheduler) - replace the
/workspace/software/bin
inPATH="/workspace/software/bin:$PATH"
to the path where themb
executable is located or put the whole path in the command:/s/mrbayes-3.2.6-1/bin/mb
In slurm, we can then submit the MrBayes array job with:
sbatch mb-slurm-submit.sh
- change
With this slurm pipeline, the steps below are needed: keep reading.
To run mbsum on the output of MrBayes for each gene
If we have the output of MrBayes and want to run BUCKy, we must first run mbsum
on the output from MrBayes, separately for each gene.
For a gene with output tree files named gene1.run1.t
, gene1.run2.t
and gene1.run3.t
, and a desired burnin of 1000 trees per tree file, we do this:
mbsum -n 1000 -o gene1.in gene1.run1.t gene1.run2.t gene1.run3.t
This mbsum
command will need to be executed for each gene. Then we can continue to the next section to run bucky.
Alternatively, we can use the julia script mbsum-t-files.jl
, and give it as argument the directory that has the output tree files from MrBayes, to run mbsum for all the genes. mbsum
is fast, so there is no attempt to parallelize the various mbsum commands.
julia mbsum-t-files.jl mbfolder outputfolder burnin # or
julia --color=yes -- mbsum-t-files.jl mbfolder outputfolder burnin # for colorized messages to the screen
where burnin
is replaced by the number of trees to ignore in each tree file for burnin. This burnin
argument is optional (default: 2501). The outputfolder
will contain the output of mbsum
.
To run bucky on all 4-taxon sets: we already have the mbsum output
no-scheduler pipeline
If we already ran mbsum
on the output from MrBayes, for each individual gene, we can simply create a tarball containing all the mbsum output files. So if we had mbsum output in files named gene1.in
, gene2.in
, ... , gene100.in
, we would want to run something similar to the following command to create the tarball:
tar czf my-genes-mbsum.tar.gz gene*.in
We can now use this tarball along with the -s
option in bucky.pl like this:
bucky.pl my-genes-mbsum.tar.gz -s -o mygenes-bucky
Again, if we get an error like bucky.pl: Command not found
, we could run instead
perl bucky.pl my-genes-mbsum.tar.gz -s -o mygenes-bucky
The output, with the table of concordance factors for all sets of 4 taxa, will be in a file named my-genes.CFs.csv
inside directory mygenes-bucky
. That's the file containing the quartet concordance factors to give to SNaQ as input.
slurm pipeline
We want to run bucky
on every 4-taxon set. SLURM will parallelize these jobs with the submit script bucky-slurm-submit.sh
, which calls the perl script bucky-slurm.pl
.
The perl script bucky-slurm.pl
runs bucky
on a single 4-taxon set. It takes the following arguments, which must be modified in the submit script bucky-slurm-submit.sh
:
- name of the folder containing the
mbsum
output files (one per locus) from previous step. This folder is namedmbsum
in the submit script: adapt if needed. - output name:
-o
or--out-dir
name of the directory to store output files in. This option is not used in the default submit script - bucky arguments:
-a
or--alpha
for the prior alpha value, and-n
or--ngen
number of generations. These options are not used either, in the script: the defaults are used then (α=1, 1 million generations) - integer for the given quartet, via option
-q
. The quartet ID is specified by SLURM with its own array ID:$SLURM_ARRAY_TASK_ID
.
In the submit script that gives instructions to the job scheduler:
- adapt the name of the
$SLURM_ARRAY_TASK_ID
variable, which captures the task number in the array of tasks, to your scheduler syntax - change
--array
to the correct number of 4-taxon sets. For example, if there are 15 taxa in the dataset, there are1365
4-taxon sets. To get this number, if you are unsure, usechoose(15,4)
in R orbinomial(15,4)
in Julia, but replace 15 by your actual number of individuals. - change
--mail-user
to the user's email (if this is an option for your job scheduler) - replace the
/workspace/software/bin
inPATH="/workspace/software/bin:$PATH"
by the path where thebucky
executable is located. Also, replace/workspace/claudia/software/TICR/scripts/
by the full path where thebucky-slurm.pl
script is located.
In slurm, we would submit the BUCKy array job with:
sbatch bucky-slurm-submit.sh
At the end, the array job will produce
- a
.concordance
file for every 4-taxon set - a
.cf
file with the parsed output for that same 4-taxon set, in the format needed for the final CF table.
The .cf
files can be concatenated to produce the file containing the quartet concordance factors across all 4-taxon sets, to give to SNaQ as input:
cat *.cf > CFtable.csv
Alternatively, if the list of .cf
files is not easily captured by *.cf
(because the list is too long for a shell command), the following julia script can do the concatenation. Just copy-paste the commands below within a Julia session, started from the directory that contains the .cf
files:
files = String[] # empty vector of strings: will contain the .cf file names later
for f in filter(x -> endswith(x, ".cf"), readdir())
push!(files,f)
end
println("found $(length(files)) cf files") # to check how many .cf output files were found
open("CFtable.csv","w") do f_out
# write the header:
write(f_out, "taxon1,taxon2,taxon3,taxon4,CF12_34,CF12_34_lo,CF12_34_hi,CF13_24,CF13_24_lo,CF13_24_hi,CF14_23,CF14_23_lo,CF14_23_hi,ngenes\n")
for file in files
@show file # to see the .cf file name: comment this out if that's too much screen output
open(file) do f_in
line = read(f_in, String)
write(f_out, string(line,"\n"))
end # closes "file" safely
end
end # closes "CFtable.csv" safely
When this is done, we will have a file CFtable.csv
containing the quartet concordance factors, to give to SNaQ as input :smiley: