Example data: multi-locus sequences

There are two ways to download the input data files:

  • navigate to where you want the data, then clone the repository that contains both this tutorial and the files:

    git clone https://github.com/JuliaPhylo/PhyloUtilities.git

    Inside, you should find a directory data_results:

    cd PhyloUtilities
    ls
    cd data_results
    ls
    
  • or download and uncompress a tarball: .zip or .tgz

These tutorial files are organized in a directory data_results/ that has:

  • 2 example data sets, with a separate directory for each. These are all simulated data, for which we know the true network.

    • 1 data set has 15 taxa and 3 reticulations (n15.gamma0.30.20.2_n300).
    • 1 data set has 6 taxa and 1 reticulation (baseline.gamma0.3_n30), to run fast during the tutorial.

    These examples will provide ways to look at the effect of the number of taxa, the number of genes, and the number of reticulations.

Inside PhyloUtilities, we also have a folder scripts/ containing all the scripts we will use in the tutorial.

Within each dataset folder, you will find several directories:

  • input/: contains input data. One file has the true gene trees (simulated from a network) and the other file is a tarball containing all the alignments (e.g. 1000 alignments if the data set has 1000 genes). All the other folders and files can be recreated using the scripts. The main results files are provided, though, to allow participants to pick up the tutorial at any step.
  • bucky-output/: contains the main output file from running MrBayes + BUCKy, which is a table listing quartet concordance factors (CFs). Also contains a file with the species estimated from quartet CFs using Quartet MaxCut.
  • raxml/: contains one file with all the best RAxML trees, one for each gene; and a directory bootstrap/ containing one file with 100 bootstrap for each gene.
  • astral/: contains one file that lists all the bootstrap tree files from RAxML, and another file with the results of ASTRAL. These results consists of 102 trees: first 100 bootstrap species trees, then their consensus, then the species tree estimated from the original data annotated with bootstrap support.
  • snaq/: contains various files for the various estimated networks: best networks and bootstrap networks.

For both datasets, the main results of each step are provided. Participants can continue on the next step even if the previous step did not work on their laptop.

Sequence alignments inside the input folder

The sequence alignments, one for each gene, are in nexus format and bundled in a tarball. We first navigate to the data directory:

$ cd data_results/baseline.gamma0.3_n30/
$ ls input/
1_seqgen.in	1_seqgen.tar.gz

1_seqgen.tar.gz is a tarball that contains all 30 alignments (30 loci):

$ tar -ztf input/1_seqgen.tar.gz
1_seqgen10.nex
1_seqgen11.nex
1_seqgen12.nex
1_seqgen13.nex
...
1_seqgen6.nex
1_seqgen7.nex
1_seqgen8.nex
1_seqgen9.nex

Let’s look at the first alignment in input/1_seqgen.tar.gz/1_seqgen1.nex. We can decompress the nexus files into a new folder that we will call nexus, then look at the first alignment:

cd input
mkdir nexus
tar -xzvf 1_seqgen.tar.gz -C nexus
ls nexus
cat nexus/1_seqgen1.nex
less -S nexus/1_seqgen1.nex

(type q to quit viewing the file)

The alignment looks like this, showing only 6 taxa and 500 bp (for faster analyses during the workshop) – and yes these data were simulated:

#NEXUS
[
Generated by seq-gen Version 1.3.2x
Simulations of 6 taxa, 500 nucleotides
for 30 tree(s) with 1 dataset(s) per tree
Branch lengths of trees multiplied by 0.018
Rate homogeneity of sites.
Model = HKY: Hasegawa, Kishino & Yano (1985)
transition/transversion ratio = 2 (K=4.21179)
with nucleotide frequencies specified as:
A=0.300414 C=0.191363 G=0.196748 T=0.311475
]
Begin DATA;	[Tree 1]
    Dimensions NTAX=6 NCHAR=500;
    Format MISSING=? GAP=- DATATYPE=DNA;
    Matrix
6 TTGAAACGGGTAATTTTACTTATCGATTATAAGCATCATACATGATATGGTTGTTTGTTGATGACTTCATAGCTATAAGAGGCATTATAGTATGCATGTTCCGTCAGACTCGCCCACTACAGAGCTATGTAAACAGTGGGGGCTGGTACAACTCCCTACCGATTGAATCTTATAATGGCGTATGATGTTAACGCGCTCTTGAATTGTCTTTTAAGCATAAGGGCTTTGGATAGATTAATCTTGCTTTAAATCACTCTAGCAGAAGCGTACGTTTTAATCAGACATTAACACGTTGTCGATCCATTTCAACACACACTGTTCAGTACCTTGGATCTATAAGATCCATGGGTATACCACATTTGTTGTTGCCGCTTGTGTACCCTGGTGAATGGCGTTAAGACTCCAGAGTAACCTGCTAGCTACACGCATCATGAACGGCTATGCCGATAGCTGACAAGTTCTTACGTCTAGGGTCTTAGCACCGCCATTCCCAGGTAAAG
5 TTGAAACGTGTAATTTTACTTATCGATTATAAGCATCATACATGATATGGTTGTTTGCTGATGACTTCTTGGCCATAAAAGGCATTGTAGTATGCATGTGCCGTCAGACCCGCCTATAACAGAACTATGTAAATAGTGGGGGCCAGTACAACTCCCTACCGATTGAATCTTATAATGGTGAATGATGTTAACGCGCTCTTGAATTGTCTTTTAAGCATAAGGGCTTTAGATAGACTAATCTAGCTTTAATTCACTCTAGTAGAAGCTTACGCTTTAATCAACCGTTAACACATTGTCGATCCATTTCAACACACTCTGTTCAATACCTTGGATCTATAAGATCCATGGGTTTACAACATTTGTTGTTGCTGCTCGTATACCCTGGCGGATGGCGTTAGATCTCCAGAGTAACCTGCTAGCTACACATATCGTGAATGGCTATGTCGATAACGGACAAGTTCCTACGTCTAGGATCTTAGTACCGGCATTCCCAAGTGAAG
1 TTGAAACGGGTAATCTTACTTATCGATTATAAGCATCATACATGATATGGTTGTTTGCTGATGATTTCTTAGCTATAAAAGGCATTATAGTGTGCATGTGCCGTCAGACCCGCCTATTATAGAACTATGTAAATAGTGGGGGCCAGTACAACTCCCTACCGATTGAGTCTTATAATGGTGAATGATGTTAACGCGCTATTGAATTGTCTTTTAAGCATGAGGGCTTTAGATAGACTAATCTAGCTTTAATTCACTCTAGTAGAAGCTTACGTTTTAATCAACCGTTAACACATTGTCGATCCATTTCAACACACACTGTTCAATACCTTGGATCTATAAAATCCATGGGTACACAACATGTGTTGTTTCTGCTTGTCTACCCTGGTGAATGGCGTTAGGTTTCCAGAGTAATTTGCTAGCTACACGTATCGTGGACGGCTATGTCGATAGCGGACAAGTTCTTACGTCTAGAATCGTAGTACCGCCATTCCCAGGTGAAG
2 TTGAAACGGGTAATCTTACTTATCGATTATAAGCATCATACCTGATATGGTTGTTTGCTGATGGTTTCTTAGCTATAAAAGGCATTATAGTGTGCATGTGCCGTCAGACCCGCCTATTATAGAACTATGTAAATAGTGGGGGCCAGTACAACTCCCTACCGATTGAGTCTTATAATGGTGAATGATGTTAACGCGCTATTGAATTGTCTTTTAAGCATGAGGGCTTTAGATAGACTAATCTAGCTTTAATTCACTCTAGTAGAAGCTTACGTTTTAATCAACCGTTAACACATTGTCGATCCATTTCCACACACACTGTTCAATACCTTGGATCTATAAAATCCATGGGTACACAACATGTGTTGTTTCTGCTTGTCTACCCTGGTGAGTGGCGTTAGGTTTCCAGAGTAATCTGCTAGCTACACGTATCGTGGACGGCTATGTCGATAGCGGACAAGTTCTTACGTCTAGAATCGTAGTACCGCCATTCCCAGGTGAAG
3 TTGAAACGGGTAATCATACTTATCGATTATAAGCATCATACATGATACGGTTGTTTGCTGATGATTTCTTAGCTATAAAAGGCATTATAGTGTGCATGTGCCGTCAGACCCGCCTATTATAGAACTATGTAAATAGTGGGGGCCAGTACAACTCCCTACCGATTGAATCTTATAATGGTGATTGATGTTAACGCTCTATTGAATTGTCTTTCAATCATAAGGGCTTTAGATAGACTAATCTAGCTTTAATTCACTCTAGTAGAAGCTTACGTTTTAATCAATCGTTAACACATTGTCGATCAATTTCAACACACACTGTTCAATACCTTGGATCTATAAAATCCTTGGGTACACAACATTTGTTGTTTTTGCTTGTATACCCTGGTGAATGGCGTTAGGTTTCCAGAGTAATCTGCTAGCTACACGTATCGTGAACGGCTATGTCGATAGCGGACAAGTTCTTACGTCTAGAATCGTAGTACCACCGTTCCCAGGTGAAG
4 TTCAAACGGGTAATCATACTTATCGATTATAAGCATCATACATGATATGGTTGTTTGCTGATGATTTCTTAGCTATCAAAGGCATTATAGTGTGCATGTGCCGTCAGACCCGCCTATTATAGAACTATGTAGATAATGGGGGCCAGTACAACTCCCTACCGATTGAATCTTATAATGGTGAATGATGTTAACGCTCTATTGAATTGTCTTTCAAGCATAAGGGCTTTAGATAGACTAATCTAGCTTTAATTCACTCTAGTAGAAGCTTACGTTTTAATCAATCGTTAACACATTGTCGATCAATTTCAACACACACTGTTCAATACCTTGGATCTATAAAATCCATGGGTACACAACATTTGTTGTTTCTGCTTGTATACCCTGGTGAATGGCGTTAGGTTTCCAGAGTAATCTGCCAGCTACACGTATCGTGAACGGCTATGTCGATAGCGGACAAGTTCTTACGTCTAGAATCGTAGTACCGCCATTCCCAGGTGAAG
    ;
END;

now go back to the main folder for the baseline.gamma0.3_n30 data, because later analyses will start from there:

$ cd ..
$ pwd
/home/moleuser/phylo-networks/data_results/baseline.gamma0.3_n30

Copyright JuliaPhylo © 2025. Distributed by an MIT license.