SANS 2.2_04A
SANS-2.2_04A
Symmetric Alignment-free phylogeNomic Splits
— Space and time Efficient Re-Implementation including Filters
-
Reference-free
-
Alignment-free
-
Assembled genomes or reads as input
-
Phylogenetic splits or tree as output
-
NEW: Coding sequences / amino acid sequences as input (see –code and –amino), implemented by Marco Sohn
Details on filter option
The sorted list of splits is greedily filtered, i.e., splits are iterated from strongest to weakest and a split is kept if and only if the filter criterion is met.
-
strict
: a split is kept if it is compatible to all previously filtered splits, i.e., the resulting set of splits is equivalent to a tree. -
weakly
: a split is kept if it is weakly compatible to all previously filtered splits (see publication for definition of “weak compatibility”). -
n-tree
: several sets of compatible splits (=trees) are maintained. A split is added to the first, second, … n-th set if possible (compatible).
Examples
-
Determine splits from assemblies or read files
1 2 3
SANS -i list.txt -o sans.splits -k 31
The 31-mers (
-k 31
) of those fasta or fastq files listed in list.txt (-i list.txt
) are extracted. Splits are determined and written to sans.splits (-o sans.splits
).To extract a tree (
-f strict
) in NEWICK format (-N sans_greedytree.new
), use1 2 3
SANS -i list.txt -k 31 -f strict -N sans_greedytree.new
or filter from a set of splits (
-s sans.splits
)1 2 3
SANS -s sans.splits -f strict -N sans_greedytree.new
-
Drosophila example data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
# go to example directory cd <SANS directory> cd example_data/drosophila # download data: whole genome and coding sequences ./download_WG.sh ./download_CDS.sh # run SANS greedy tree ../../SANS -i WG/list.txt -o sans_greedytree_WG.splits -f strict -N sans_greedytree_WG.new -v ../../SANS -i CDS/list.txt -o sans_greedytree_CDS.splits -f strict -N sans_greedytree_CDS.new -v -c # compare to reference ../../scripts/newick2sans.py Reference.new > Reference.splits ../../scripts/comp.py sans_greedytree_WG.splits Reference.splits WG/list.txt > sans_greedytree_WG.comp ../../scripts/comp.py sans_greedytree_CDS.splits Reference.splits CDS/list.txt > sans_greedytree_CDS.comp
-
Virus example data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
# go to example directory cd <SANS directory> cd example_data/prasinoviruses # download data ./download.sh # run SANS ../../SANS -i fa/list.txt -o sans.splits -k 11 -t 130 -v # compare to references ../../scripts/newick2sans.py Reference_Fig3.new > Reference_Fig3.splits ../../scripts/comp.py sans.splits Reference_Fig3.splits fa/list.txt > fig3.comp ../../scripts/newick2sans.py Reference_Fig4.new > Reference_Fig4.splits ../../scripts/comp.py sans.splits Reference_Fig4.splits fa/list.txt > fig4.comp
Performance evaluation on predicted open reading frames
SANS-serif can predict phylogenies based on amino acid sequences as input. Processing coding sequences is faster than processing whole genome data. Experiments show that the reconstruction quality is comparable.
If you want to use selected marker genes, the number of genes should be as high as possible to provide sufficient sequence information for extracting phylogenetic signals.
If the genomes at hand are not annotated, you can use a tool to predict open reading frames. The following experiment shows that the reconstruction quality does not suffer from such a simple pre-processing or even improves, while saving total running time, especially because genomes can easily be pre-processed in parallel.
The following tools have been used with the parameters shown in the table.
Tool | Parameters | Reference |
---|---|---|
SANS | -k (see below) -m geom2 -t 10n -filter strict [-a (if preprocessed)] | |
Getorf (EMBOSS) | -find (see below) -t 11 | Gary Williams, 2000 |
ORFfinder | -n true -g 11 | NCBI |
Prodigal | -q | V2.6.3,Doug Hyatt, 2016 |
For estimating the reconstruction accuracy, the (weighted) F1 score has been determined as follows.
Measure | |
---|---|
F1-score | harmonic mean of precision and recall |
precision | (number of called splits that are also in reference) / (total number of called splits) |
recall | (number of reference splits that are also called) / (total number of reference splits) |
weighted precision | (total weight of called splits that are also in reference) / (total weight of all called splits) |
weighted recall | (total weight of reference splits that are also called) / (total weight of all reference splits) |
Further information on the datasets can be found in the initial publication of SANS (Wittler, 2019), see above.
Salmonella enterica Para C
220 genomes, k=31
Preprocessing | Running time pre-processing |
Running time SANS |
Running time both (parall. pre-proc., factor 16) |
F1-Score (weighted) |
---|---|---|---|---|
none (whole genome) | – | 610s | – | 0.878 (0.999) |
Getorf (-find 0) | 247s | 459s | 706s (474s) |
0.881 (0.999) |
Getorf (-find 1) | 199s | 339s | 538s (351s) |
0.868 (0.999) |
ORFfinder | 5195s | 166s | 5361s (491s) |
0.858 (0.997) |
Prodigal | 6292s | 134s | 6326s (521s) |
0.853 (0.998) |
Salmonella enterica subspecies enterica
2964 genomes, k=21
Preprocessing | Running time pre-processing |
Running time SANS |
Running time both (parall. pre-proc., factor 16) |
F1-Score (weighted) |
---|---|---|---|---|
none (whole genome) | – | 190min | – | 0.587 (0.792) |
Getorf (-find 0) | 55min | 220min | 274min (223min) |
0.624 (0.807) |
Getorf (-find 1) | 48min | 165min | 213min (168min) |
0.620 (0.799) |
ORFfinder | 1621min | 78min | 1698min (179min) |
0.594 (0.766) |
Prodigal | 1322min | 50min | 1372min (133min) |
0.587 (0.792) |
Clustering / dereplication of metagenome assembled genomes (MAGs)
For clustering of highly similar sequences, a tree can be constructed which is then chopped into many small subtrees such that the taxa in each subtree correspond to one cluster.
This procedure has been successfully applied for dereplication of metagenome assembled genomes (MAGs). Here, the input are MAGs, and the goal is to cluster these such that the MAGs in each cluster belong to the same strain.
The general procedure is:
|
|
Due to the usually very high number of input sequences, we recommend the usage of parameters --window
(-w
) and --top
(-t
) in order to save time and memory. (The experimental parameter --window
is not mentioned in the usage, because it can lower the accuracy of reconstructed phylogenies considerably. But in this case, the reconstructed tree does not need to be an accurate phylogeny and the parameter has only reasonable effect on the clustering.)
Setting | Parameters |
---|---|
quick | –window 25 –top 50n |
thoroughly | –window 10 –top 100n |
The tree is chopped into clusters as follows:
-
Re-root tree to maximum degree node
-
In post order traversal:
-
ignore non-branching node (merge edges)
-
get clusters from sub-trees (recursively)
-
if edge longer than parent edge:
-
remove found clusters from current leaf set
-
remaining leaf set =: new cluster
-
-
Dereplication efficiency and accuracy
This dereplication approach has been evaluated on a data set from the CAMI challenge [Meyer et al. Critical Assessment of Metagenome Interpretation - the second round of challenges, bioRxiv, 2021, doi: https://doi.org/10.1101/2021.07.12.451567], a simulated mouse gut metagenome representing 64 metagenome samples. Alexander Sczyrba and Peter Belmann filtered the MAGs, provided them for our clustering and compared the clustering to the gold standard.
Filter | # MAGs | # Strains |
---|---|---|
MIMAG medium | 5,786 | 686 |
MIMAG high | 2,510 | 349 |
no filter | 11,602 | 791 |
For the comparison, each called cluster is mapped to a gold standard cluster with maximum intersection, i.e., maximum agreement of contained MAGs. Then, the purity and completeness of the clusters are determined by investigating the number of correct (TP), false (FP) and missing (FN) MAGs in each cluster:
purity := TP / (TP + FP)
completeness := TP / (TP + FN)
These per-cluster measures were then averaged (weighted and unweighted). The following table shows the results for the different input and clustering settings.
Input | Setting | Running time | Memory | average purity (weighted) |
average completeness (weighted) |
---|---|---|---|---|---|
MIMAG medium | quick | 11h | 56G | 0.973 (0.956) |
0.884 (0.998) |
thoroughly | 50h | 130G | 0.972 (0.952) |
0.881 (0.998) |
|
MIMAG high | quick | 2h | 16G | 0.983 (0.978) |
0.890 (0.991) |
thoroughly | 6h | 36G | 0.983 (0.979) |
0.912 (0.993) |
|
no filter | quick | 59h | 127G | 0.996 (0.983) |
0.173 (0.668) |
thoroughly | 185h | 290G | 0.995 (0.979) |
0.190 (0.700) |
Location and version
|
|
help message
|
|
Recompiling
If you need to recompile and set -DmaxK
or -DmaxN
, you can find the
build.sh
file in /local/cluster/sans
. You should copy the directory into a
location where you have write access, change the settings in the makefile
,
and then re-run bash ./build.sh
. Of course, check out a node with qrsh
first as always when compiling software.
Examples and scripts
You can find the example data in /local/cluster/sans/example_data
and the
scripts in /local/cluster/sans/scripts
.
software ref: https://gitlab.ub.uni-bielefeld.de/gi/sans
research ref: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab444/6300510