SANS 2.2_04A

CQLS Software Support

2022-08-31 1823 words 9 minutes

Contents

Installed

This software should be available with no extra configuration.

SANS-2.2_04A

Symmetric Alignment-free phylogeNomic Splits

— Space and time Efficient Re-Implementation including Filters

Reference-free
Alignment-free
Assembled genomes or reads as input
Phylogenetic splits or tree as output
NEW: Coding sequences / amino acid sequences as input (see –code and –amino), implemented by Marco Sohn

The sorted list of splits is greedily filtered, i.e., splits are iterated from strongest to weakest and a split is kept if and only if the filter criterion is met.

strict: a split is kept if it is compatible to all previously filtered splits, i.e., the resulting set of splits is equivalent to a tree.
weakly: a split is kept if it is weakly compatible to all previously filtered splits (see publication for definition of “weak compatibility”).
n-tree: several sets of compatible splits (=trees) are maintained. A split is added to the first, second, … n-th set if possible (compatible).

Examples

Determine splits from assemblies or read files
1 2 3

SANS -i list.txt -o sans.splits -k 31
The 31-mers (-k 31) of those fasta or fastq files listed in list.txt (-i list.txt) are extracted. Splits are determined and written to sans.splits (-o sans.splits).

To extract a tree (-f strict) in NEWICK format (-N sans_greedytree.new), use
1 2 3

SANS -i list.txt -k 31 -f strict -N sans_greedytree.new
or filter from a set of splits (-s sans.splits)
1 2 3

SANS -s sans.splits -f strict -N sans_greedytree.new

Drosophila example data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29



# go to example directory

cd <SANS directory>

cd example_data/drosophila

# download data: whole genome and coding sequences

./download_WG.sh

./download_CDS.sh

# run SANS greedy tree

../../SANS -i WG/list.txt -o sans_greedytree_WG.splits -f strict -N sans_greedytree_WG.new -v

../../SANS -i CDS/list.txt -o sans_greedytree_CDS.splits -f strict -N sans_greedytree_CDS.new -v -c



# compare to reference

../../scripts/newick2sans.py Reference.new > Reference.splits

../../scripts/comp.py sans_greedytree_WG.splits Reference.splits WG/list.txt > sans_greedytree_WG.comp

../../scripts/comp.py sans_greedytree_CDS.splits Reference.splits CDS/list.txt > sans_greedytree_CDS.comp

Virus example data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27



# go to example directory

cd <SANS directory>

cd example_data/prasinoviruses

# download data

./download.sh

# run SANS

../../SANS -i fa/list.txt -o sans.splits -k 11 -t 130 -v



# compare to references

../../scripts/newick2sans.py Reference_Fig3.new > Reference_Fig3.splits

../../scripts/comp.py sans.splits Reference_Fig3.splits fa/list.txt > fig3.comp

../../scripts/newick2sans.py Reference_Fig4.new > Reference_Fig4.splits

../../scripts/comp.py sans.splits Reference_Fig4.splits fa/list.txt > fig4.comp

Performance evaluation on predicted open reading frames

SANS-serif can predict phylogenies based on amino acid sequences as input. Processing coding sequences is faster than processing whole genome data. Experiments show that the reconstruction quality is comparable.

If you want to use selected marker genes, the number of genes should be as high as possible to provide sufficient sequence information for extracting phylogenetic signals.

If the genomes at hand are not annotated, you can use a tool to predict open reading frames. The following experiment shows that the reconstruction quality does not suffer from such a simple pre-processing or even improves, while saving total running time, especially because genomes can easily be pre-processed in parallel.

The following tools have been used with the parameters shown in the table.

Tool	Parameters	Reference
SANS	-k (see below) -m geom2 -t 10n -filter strict [-a (if preprocessed)]
Getorf (EMBOSS)	-find (see below) -t 11	Gary Williams, 2000
ORFfinder	-n true -g 11	NCBI
Prodigal	-q	V2.6.3,Doug Hyatt, 2016

For estimating the reconstruction accuracy, the (weighted) F1 score has been determined as follows.

Measure
F1-score	harmonic mean of precision and recall
precision	(number of called splits that are also in reference) / (total number of called splits)
recall	(number of reference splits that are also called) / (total number of reference splits)
weighted precision	(total weight of called splits that are also in reference) / (total weight of all called splits)
weighted recall	(total weight of reference splits that are also called) / (total weight of all reference splits)

Further information on the datasets can be found in the initial publication of SANS (Wittler, 2019), see above.

Salmonella enterica Para C

220 genomes, k=31

Preprocessing	Running time pre-processing	Running time SANS	Running time both (parall. pre-proc., factor 16)	F1-Score (weighted)
none (whole genome)	–	610s	–	0.878 (0.999)
Getorf (-find 0)	247s	459s	706s (474s)	0.881 (0.999)
Getorf (-find 1)	199s	339s	538s (351s)	0.868 (0.999)
ORFfinder	5195s	166s	5361s (491s)	0.858 (0.997)
Prodigal	6292s	134s	6326s (521s)	0.853 (0.998)

Salmonella enterica subspecies enterica

2964 genomes, k=21

Preprocessing	Running time pre-processing	Running time SANS	Running time both (parall. pre-proc., factor 16)	F1-Score (weighted)
none (whole genome)	–	190min	–	0.587 (0.792)
Getorf (-find 0)	55min	220min	274min (223min)	0.624 (0.807)
Getorf (-find 1)	48min	165min	213min (168min)	0.620 (0.799)
ORFfinder	1621min	78min	1698min (179min)	0.594 (0.766)
Prodigal	1322min	50min	1372min (133min)	0.587 (0.792)

Clustering / dereplication of metagenome assembled genomes (MAGs)

For clustering of highly similar sequences, a tree can be constructed which is then chopped into many small subtrees such that the taxa in each subtree correspond to one cluster.

This procedure has been successfully applied for dereplication of metagenome assembled genomes (MAGs). Here, the input are MAGs, and the goal is to cluster these such that the MAGs in each cluster belong to the same strain.

The general procedure is:

1
2
3
4
5
6
7
8
9



# reconstruct a tree

SANS --input <list_of_files> --newick <tree_to_cluster.new> --filter strict --kmer 15 (--verbose) --window W --top T (see below)

# determine clusters from tree

scripts/newick2clusters.py <tree_to_cluster.new> > <clusters.tsv>

Due to the usually very high number of input sequences, we recommend the usage of parameters --window (-w) and --top (-t) in order to save time and memory. (The experimental parameter --window is not mentioned in the usage, because it can lower the accuracy of reconstructed phylogenies considerably. But in this case, the reconstructed tree does not need to be an accurate phylogeny and the parameter has only reasonable effect on the clustering.)

Setting	Parameters
quick	–window 25 –top 50n
thoroughly	–window 10 –top 100n

The tree is chopped into clusters as follows:

Re-root tree to maximum degree node
In post order traversal:
- ignore non-branching node (merge edges)
- get clusters from sub-trees (recursively)
- if edge longer than parent edge:
  - remove found clusters from current leaf set
  - remaining leaf set =: new cluster

Dereplication efficiency and accuracy

This dereplication approach has been evaluated on a data set from the CAMI challenge [Meyer et al. Critical Assessment of Metagenome Interpretation - the second round of challenges, bioRxiv, 2021, doi: https://doi.org/10.1101/2021.07.12.451567], a simulated mouse gut metagenome representing 64 metagenome samples. Alexander Sczyrba and Peter Belmann filtered the MAGs, provided them for our clustering and compared the clustering to the gold standard.

Filter	# MAGs	# Strains
MIMAG medium	5,786	686
MIMAG high	2,510	349
no filter	11,602	791

For the comparison, each called cluster is mapped to a gold standard cluster with maximum intersection, i.e., maximum agreement of contained MAGs. Then, the purity and completeness of the clusters are determined by investigating the number of correct (TP), false (FP) and missing (FN) MAGs in each cluster:

purity := TP / (TP + FP)

completeness := TP / (TP + FN)

These per-cluster measures were then averaged (weighted and unweighted). The following table shows the results for the different input and clustering settings.

Input	Setting	Running time	Memory	average purity (weighted)	average completeness (weighted)
MIMAG medium	quick	11h	56G	0.973 (0.956)	0.884 (0.998)
	thoroughly	50h	130G	0.972 (0.952)	0.881 (0.998)
MIMAG high	quick	2h	16G	0.983 (0.978)	0.890 (0.991)
	thoroughly	6h	36G	0.983 (0.979)	0.912 (0.993)
no filter	quick	59h	127G	0.996 (0.983)	0.173 (0.668)
	thoroughly	185h	290G	0.995 (0.979)	0.190 (0.700)

Location and version

1
2


$ which SANS
/local/cluster/bin/SANS

help message

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73


$ SANS --help

SANS serif | version 2.2_04A
Usage: SANS [PARAMETERS]

  Input arguments:

    -i, --input   	Input file: file of files format
                  	Either: one genome per line (space-separated for multifile genomes)
                  	Or: kmtricks input format (see https://github.com/tlemane/kmtricks)

    -g, --graph   	Graph file: load a Bifrost graph, file name prefix
                  	(requires compiler flag -DuseBF, please edit makefile)

    -s, --splits  	Splits file: load an existing list of splits file
                  	(allows to filter -t/-f, other arguments are ignored)

    (either --input and/or --graph, or --splits must be provided)

  Output arguments:

    -o, --output  	Output TSV file: list of splits, sorted by weight desc.

    -N, --newick  	Output Newick file
                  	(only applicable in combination with -f strict or n-tree)

    (at least --output or --newick must be provided, or both)

  Optional arguments:

    -k, --kmer    	Length of k-mers (default: 31, or 10 for --amino and --code)

    -t, --top     	Number of splits in the output list (default: all).
                  	Use -t <integer>n to limit relative to number of input files, or
                  	use -t <integer> to limit by absolute value.

    -m, --mean    	Mean weight function to handle asymmetric splits
                  	options: arith: arithmetic mean
                  	         geom:  geometric mean
                  	         geom2: geometric mean with pseudo-counts (default)

    -f, --filter  	Output a greedy maximum weight subset
                  	options: strict: compatible to a tree
                  	         weakly: weakly compatible network
                  	         n-tree: compatible to a union of n trees
                  	                 (where n is an arbitrary number, e.g. 2-tree)

    -x, --iupac   	Extended IUPAC alphabet, resolve ambiguous bases or amino acids
                  	Specify a number to limit the k-mers per position between
                  	1 (no ambiguity) and 4^k respectively 22^k (allows NNN...N)
                  	Without --iupac respective k-mers are ignored

    -n, --norev   	Do not consider reverse complement k-mers

    -a, --amino   	Consider amino acids: --input provides amino acid sequences
                  	Implies --norev and a default k of 10

    -c, --code   	Translate DNA: --input provides coding sequences
                 	Implies --norev and a default k of 10
                 	optional: ID of the genetic code to be used
                 	Default: 1
                 	Use 11 for Bacterial, Archaeal, and Plant Plastid Code
                 	(See https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details.)

    -M, --maxN 	Compare number of input genomes to compile paramter DmaxN
               	Add path/to/makefile (default is makefile in current working directory).

    -v, --verbose 	Print information messages during execution

    -h, --help    	Display this help page and quit

  Contact: sans-service@cebitec.uni-bielefeld.de
  Evaluation: https://www.surveymonkey.de/r/denbi-service?sc=bigi&tool=sans

Recompiling

If you need to recompile and set -DmaxK or -DmaxN, you can find the build.sh file in /local/cluster/sans. You should copy the directory into a location where you have write access, change the settings in the makefile, and then re-run bash ./build.sh. Of course, check out a node with qrsh first as always when compiling software.

Examples and scripts

You can find the example data in /local/cluster/sans/example_data and the scripts in /local/cluster/sans/scripts.

software ref: https://gitlab.ub.uni-bielefeld.de/gi/sans
research ref: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab444/6300510

Contents

SANS 2.2_04A

SANS-2.2_04A

Details on filter option

Examples

Performance evaluation on predicted open reading frames

Salmonella enterica Para C

Salmonella enterica subspecies enterica

Clustering / dereplication of metagenome assembled genomes (MAGs)

Dereplication efficiency and accuracy

Location and version

help message

Recompiling

Examples and scripts