Contents

SANS 2.2_04A

Installed
This software should be available with no extra configuration.

SANS-2.2_04A

Symmetric Alignment-free phylogeNomic Splits

— Space and time Efficient Re-Implementation including Filters

  • Reference-free

  • Alignment-free

  • Assembled genomes or reads as input

  • Phylogenetic splits or tree as output

  • NEW: Coding sequences / amino acid sequences as input (see –code and –amino), implemented by Marco Sohn

Details on filter option

The sorted list of splits is greedily filtered, i.e., splits are iterated from strongest to weakest and a split is kept if and only if the filter criterion is met.

  • strict: a split is kept if it is compatible to all previously filtered splits, i.e., the resulting set of splits is equivalent to a tree.

  • weakly: a split is kept if it is weakly compatible to all previously filtered splits (see publication for definition of “weak compatibility”).

  • n-tree: several sets of compatible splits (=trees) are maintained. A split is added to the first, second, … n-th set if possible (compatible).

Examples

  1. Determine splits from assemblies or read files

    1
    2
    3
    
    
    SANS -i list.txt -o sans.splits -k 31
    
    

    The 31-mers (-k 31) of those fasta or fastq files listed in list.txt (-i list.txt) are extracted. Splits are determined and written to sans.splits (-o sans.splits).

    To extract a tree (-f strict) in NEWICK format (-N sans_greedytree.new), use

    1
    2
    3
    
    
    SANS -i list.txt -k 31 -f strict -N sans_greedytree.new
    
    

    or filter from a set of splits (-s sans.splits)

    1
    2
    3
    
    
    SANS -s sans.splits -f strict -N sans_greedytree.new
    
    
  2. Drosophila example data

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    
    
    # go to example directory
    
    cd <SANS directory>
    
    cd example_data/drosophila
    
    # download data: whole genome and coding sequences
    
    ./download_WG.sh
    
    ./download_CDS.sh
    
    # run SANS greedy tree
    
    ../../SANS -i WG/list.txt -o sans_greedytree_WG.splits -f strict -N sans_greedytree_WG.new -v
    
    ../../SANS -i CDS/list.txt -o sans_greedytree_CDS.splits -f strict -N sans_greedytree_CDS.new -v -c
    
    
    
    # compare to reference
    
    ../../scripts/newick2sans.py Reference.new > Reference.splits
    
    ../../scripts/comp.py sans_greedytree_WG.splits Reference.splits WG/list.txt > sans_greedytree_WG.comp
    
    ../../scripts/comp.py sans_greedytree_CDS.splits Reference.splits CDS/list.txt > sans_greedytree_CDS.comp
    
    
  3. Virus example data

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    
    
    # go to example directory
    
    cd <SANS directory>
    
    cd example_data/prasinoviruses
    
    # download data
    
    ./download.sh
    
    # run SANS
    
    ../../SANS -i fa/list.txt -o sans.splits -k 11 -t 130 -v
    
    
    
    # compare to references
    
    ../../scripts/newick2sans.py Reference_Fig3.new > Reference_Fig3.splits
    
    ../../scripts/comp.py sans.splits Reference_Fig3.splits fa/list.txt > fig3.comp
    
    ../../scripts/newick2sans.py Reference_Fig4.new > Reference_Fig4.splits
    
    ../../scripts/comp.py sans.splits Reference_Fig4.splits fa/list.txt > fig4.comp
    
    

Performance evaluation on predicted open reading frames

SANS-serif can predict phylogenies based on amino acid sequences as input. Processing coding sequences is faster than processing whole genome data. Experiments show that the reconstruction quality is comparable.

If you want to use selected marker genes, the number of genes should be as high as possible to provide sufficient sequence information for extracting phylogenetic signals.

If the genomes at hand are not annotated, you can use a tool to predict open reading frames. The following experiment shows that the reconstruction quality does not suffer from such a simple pre-processing or even improves, while saving total running time, especially because genomes can easily be pre-processed in parallel.

The following tools have been used with the parameters shown in the table.

Tool Parameters Reference
SANS -k (see below) -m geom2 -t 10n -filter strict [-a (if preprocessed)]
Getorf (EMBOSS) -find (see below) -t 11 Gary Williams, 2000
ORFfinder -n true -g 11 NCBI
Prodigal -q V2.6.3,Doug Hyatt, 2016

For estimating the reconstruction accuracy, the (weighted) F1 score has been determined as follows.

Measure
F1-score harmonic mean of precision and recall
precision (number of called splits that are also in reference)
/ (total number of called splits)
recall (number of reference splits that are also called)
/ (total number of reference splits)
weighted precision (total weight of called splits that are also in reference)
/ (total weight of all called splits)
weighted recall (total weight of reference splits that are also called)
/ (total weight of all reference splits)

Further information on the datasets can be found in the initial publication of SANS (Wittler, 2019), see above.

Salmonella enterica Para C

220 genomes, k=31

Preprocessing Running time
pre-processing
Running time
SANS
Running time both
(parall. pre-proc.,
factor 16)
F1-Score
(weighted)
none (whole genome) 610s 0.878
(0.999)
Getorf (-find 0) 247s 459s 706s
(474s)
0.881
(0.999)
Getorf (-find 1) 199s 339s 538s
(351s)
0.868
(0.999)
ORFfinder 5195s 166s 5361s
(491s)
0.858
(0.997)
Prodigal 6292s 134s 6326s
(521s)
0.853
(0.998)

Salmonella enterica subspecies enterica

2964 genomes, k=21

Preprocessing Running time
pre-processing
Running time
SANS
Running time both
(parall. pre-proc.,
factor 16)
F1-Score
(weighted)
none (whole genome) 190min 0.587
(0.792)
Getorf (-find 0) 55min 220min 274min
(223min)
0.624
(0.807)
Getorf (-find 1) 48min 165min 213min
(168min)
0.620
(0.799)
ORFfinder 1621min 78min 1698min
(179min)
0.594
(0.766)
Prodigal 1322min 50min 1372min
(133min)
0.587
(0.792)

Clustering / dereplication of metagenome assembled genomes (MAGs)

For clustering of highly similar sequences, a tree can be constructed which is then chopped into many small subtrees such that the taxa in each subtree correspond to one cluster.

This procedure has been successfully applied for dereplication of metagenome assembled genomes (MAGs). Here, the input are MAGs, and the goal is to cluster these such that the MAGs in each cluster belong to the same strain.

The general procedure is:

1
2
3
4
5
6
7
8
9

# reconstruct a tree

SANS --input <list_of_files> --newick <tree_to_cluster.new> --filter strict --kmer 15 (--verbose) --window W --top T (see below)

# determine clusters from tree

scripts/newick2clusters.py <tree_to_cluster.new> > <clusters.tsv>

Due to the usually very high number of input sequences, we recommend the usage of parameters --window (-w) and --top (-t) in order to save time and memory. (The experimental parameter --window is not mentioned in the usage, because it can lower the accuracy of reconstructed phylogenies considerably. But in this case, the reconstructed tree does not need to be an accurate phylogeny and the parameter has only reasonable effect on the clustering.)

Setting Parameters
quick –window 25 –top 50n
thoroughly –window 10 –top 100n

The tree is chopped into clusters as follows:

  • Re-root tree to maximum degree node

  • In post order traversal:

    • ignore non-branching node (merge edges)

    • get clusters from sub-trees (recursively)

    • if edge longer than parent edge:

      • remove found clusters from current leaf set

      • remaining leaf set =: new cluster

Dereplication efficiency and accuracy

This dereplication approach has been evaluated on a data set from the CAMI challenge [Meyer et al. Critical Assessment of Metagenome Interpretation - the second round of challenges, bioRxiv, 2021, doi: https://doi.org/10.1101/2021.07.12.451567], a simulated mouse gut metagenome representing 64 metagenome samples. Alexander Sczyrba and Peter Belmann filtered the MAGs, provided them for our clustering and compared the clustering to the gold standard.

Filter # MAGs # Strains
MIMAG medium 5,786 686
MIMAG high 2,510 349
no filter 11,602 791

For the comparison, each called cluster is mapped to a gold standard cluster with maximum intersection, i.e., maximum agreement of contained MAGs. Then, the purity and completeness of the clusters are determined by investigating the number of correct (TP), false (FP) and missing (FN) MAGs in each cluster:

purity := TP / (TP + FP)

completeness := TP / (TP + FN)

These per-cluster measures were then averaged (weighted and unweighted). The following table shows the results for the different input and clustering settings.

Input Setting Running time Memory average purity
(weighted)
average completeness
(weighted)
MIMAG medium quick 11h 56G 0.973
(0.956)
0.884
(0.998)
thoroughly 50h 130G 0.972
(0.952)
0.881
(0.998)
MIMAG high quick 2h 16G 0.983
(0.978)
0.890
(0.991)
thoroughly 6h 36G 0.983
(0.979)
0.912
(0.993)
no filter quick 59h 127G 0.996
(0.983)
0.173
(0.668)
thoroughly 185h 290G 0.995
(0.979)
0.190
(0.700)

Location and version

1
2
$ which SANS
/local/cluster/bin/SANS

help message

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
$ SANS --help

SANS serif | version 2.2_04A
Usage: SANS [PARAMETERS]

  Input arguments:

    -i, --input   	Input file: file of files format
                  	Either: one genome per line (space-separated for multifile genomes)
                  	Or: kmtricks input format (see https://github.com/tlemane/kmtricks)

    -g, --graph   	Graph file: load a Bifrost graph, file name prefix
                  	(requires compiler flag -DuseBF, please edit makefile)

    -s, --splits  	Splits file: load an existing list of splits file
                  	(allows to filter -t/-f, other arguments are ignored)

    (either --input and/or --graph, or --splits must be provided)

  Output arguments:

    -o, --output  	Output TSV file: list of splits, sorted by weight desc.

    -N, --newick  	Output Newick file
                  	(only applicable in combination with -f strict or n-tree)

    (at least --output or --newick must be provided, or both)

  Optional arguments:

    -k, --kmer    	Length of k-mers (default: 31, or 10 for --amino and --code)

    -t, --top     	Number of splits in the output list (default: all).
                  	Use -t <integer>n to limit relative to number of input files, or
                  	use -t <integer> to limit by absolute value.

    -m, --mean    	Mean weight function to handle asymmetric splits
                  	options: arith: arithmetic mean
                  	         geom:  geometric mean
                  	         geom2: geometric mean with pseudo-counts (default)

    -f, --filter  	Output a greedy maximum weight subset
                  	options: strict: compatible to a tree
                  	         weakly: weakly compatible network
                  	         n-tree: compatible to a union of n trees
                  	                 (where n is an arbitrary number, e.g. 2-tree)

    -x, --iupac   	Extended IUPAC alphabet, resolve ambiguous bases or amino acids
                  	Specify a number to limit the k-mers per position between
                  	1 (no ambiguity) and 4^k respectively 22^k (allows NNN...N)
                  	Without --iupac respective k-mers are ignored

    -n, --norev   	Do not consider reverse complement k-mers

    -a, --amino   	Consider amino acids: --input provides amino acid sequences
                  	Implies --norev and a default k of 10

    -c, --code   	Translate DNA: --input provides coding sequences
                 	Implies --norev and a default k of 10
                 	optional: ID of the genetic code to be used
                 	Default: 1
                 	Use 11 for Bacterial, Archaeal, and Plant Plastid Code
                 	(See https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details.)

    -M, --maxN 	Compare number of input genomes to compile paramter DmaxN
               	Add path/to/makefile (default is makefile in current working directory).

    -v, --verbose 	Print information messages during execution

    -h, --help    	Display this help page and quit

  Contact: sans-service@cebitec.uni-bielefeld.de
  Evaluation: https://www.surveymonkey.de/r/denbi-service?sc=bigi&tool=sans

Recompiling

If you need to recompile and set -DmaxK or -DmaxN, you can find the build.sh file in /local/cluster/sans. You should copy the directory into a location where you have write access, change the settings in the makefile, and then re-run bash ./build.sh. Of course, check out a node with qrsh first as always when compiling software.

Examples and scripts

You can find the example data in /local/cluster/sans/example_data and the scripts in /local/cluster/sans/scripts.


software ref: https://gitlab.ub.uni-bielefeld.de/gi/sans
research ref: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab444/6300510