SKESA 2.5.0 and SAUTE 1.3.0

2021-09-07 2208 words 11 minutes

SKESA: strategic k-mer extension for scrupulous assemblies

SKESA is a de-novo sequence read assembler for microbial genomes. It uses conservative heuristics and is designed to create breaks at repeat regions in the genome. This leads to excellent sequence quality without significantly compromising contiguity. If desired, SKESA contigs could be connected into a GFA graph using GFA connector.

SAUTE: sequence assembly using target enrichment

SAUTE is a de Bruijn graph based target enriched de-novo assembler designed for assembling genomic and RNA-seq reads sequenced using Illumina. The result is reported as a GFA graph and two nucleotide fasta sequence files for assemblies in the graph.

Location and version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


$ which skesa
/local/cluster/bin/skesa
$ skesa --version
skesa --version

SKESA 2.5.0
$ which saute
/local/cluster/bin/saute
$ saute --version
saute --version

saute 1.3.0

help message:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60


Running
   skesa 
or 
   skesa -h
or 
   skesa --help
gives information about options and produces the following:

General options:
  -h [ --help ]                 Produce help message
  -v [ --version ]              Print version
  --cores arg (=0)              Number of cores to use (default all) [integer]
  --memory arg (=32)            Memory available (GB, only for sorted counter) 
                                [integer]
  --hash_count                  Use hash counter [flag]
  --estimated_kmers arg (=100)  Estimated number of distinct kmers for bloom 
                                filter (M, only for hash counter) [integer]
  --skip_bloom_filter           Don't do bloom filter; use --estimated_kmers as
                                the hash table size (only for hash counter) 
                                [flag]

Input/output options: at least one input providing reads for assembly must be specified:
  --reads arg                   Input fasta/fastq file(s) for reads (could be 
                                used multiple times for different runs, could 
                                be gzipped) [string]
  --use_paired_ends             Indicates that single (not comma separated) 
                                fasta/fastq files contain paired reads [flag]
  --sra_run arg                 Input sra run accession (could be used multiple
                                times for different runs) [string]
  --contigs_out arg             Output file for contigs (stdout if not 
                                specified) [string]

Assembly options:
  --kmer arg (=21)              Minimal kmer length for assembly [integer]
  --min_count arg               Minimal count for kmers retained for comparing 
                                alternate choices [integer]
  --max_kmer_count arg          Minimum acceptable average count for estimating
                                the maximal kmer length in reads [integer]
  --vector_percent arg (=0.05)  Percentage of reads containing 19-mer for the
                                19-mer to be considered a vector (1. disables) [float (0,1]]
  --insert_size arg             Expected insert size for paired reads (if not 
                                provided, it will be estimated) [integer]
  --steps arg (=11)             Number of assembly iterations from minimal to 
                                maximal kmer length in reads [integer]
  --fraction arg (=0.1)         Maximum noise to signal ratio acceptable for 
                                extension [float [0,1)]
  --max_snp_len arg (=150)      Maximal snp length [integer]
  --min_contig arg (=200)       Minimal contig length reported in output 
                                [integer]
  --allow_snps                  Allow additional step for snp discovery [flag]

Debugging options:
  --force_single_ends           Don't use paired-end information [flag]
  --seeds arg                   Input file with seeds [string]
  --all arg                     Output fasta for each iteration [string]
  --dbg_out arg                 Output kmer file [string]
  --hist arg                    File for histogram [string]
  --connected_reads arg         File for connected paired reads [string]

Note that --sra_run option is not available if SKESA is built using Makefile.nongs

SKESA usage examples:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


 In all the examples below, we are providing 4 cores and have 48 Gb of memory.

 Example of an assembly that directly accesses SRA for an unpaired read set SRR867211 is:

   $ skesa --sra_run SRR867211 --cores 4 --memory 48 > SRR867211.skesa.fa

 Example of an assembly that directly accesses SRA for a paired read set SRR1960353 is:

   $ skesa --sra_run SRR1960353 --cores 4 --memory 48 > SRR1960353.skesa.fa

 Example of an assembly that uses separate fastq files for each mate of SRR1703350 is:

   $ skesa --reads SRR1703350_1.fq,SRR1703350_2.fq --cores 4 --memory 48 > SRR1703350.skesa.fa

 Example of an assembly that uses interleaved mates for SRR1703350 as fastq input is:

   $ skesa --reads SRR1703350.fq --use_paired_ends --cores 4 --memory 48 > SRR1703350.skesa.fa

 Example of an assembly that uses reads from SRA for SRR1695624 and gzipped fasta for SRR1745628 is:

   $ skesa --sra_run SRR1695624 --reads SRR1745628.fa.gz --use_paired_ends --cores 4 --memory 48 > SAMN03218571.skesa.fa

 Example of the same assembly as above done with both runs accessed from SRA is:

   $ skesa --sra_run SRR1695624 --sra_run SRR1745628 --cores 4 --memory 48 > SAMN03218571.skesa.fa

SAUTE help:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107


General options:
-h [ --help ]                    Produce help message
-v [ --version ]                 Print version
--cores (=0)                     Number of cores to use (default all) [integer]
--estimated_kmers (=1000)        Estimated number of distinct kmers for bloom filter (millions) [integer]
                                 To avoid expensive rehashing of the de Bruijn graph, SAUTE uses a light-weight counting Bloom filter to find 
                                 the number of different kmers satisfying the --min_count criterion. Parameter --estimated_kmers provides the 
                                 number (in millions) of different kmers with any count present in the reads. Overestimating will result in a 
                                 small waste of memory. Underestimating will trigger iterative Bloom filter recalculation until a good estimate
                                 is found.

Input/output options : target file, at least one input for reads and output for gfa must be specified:
--gfa                            Output file for GFA graph [string]
--all_variants                   Output file for sequences with all variant combinations [string]
--max_variants (=1000)           Restricts the number of variants reported in --all_variants per graph [integer]
--selected_variants              Output file for selected sequences representing all graph elements [string]
--targets                        Input file with reference sequences [string]
--sra_run                        Input sra run accession (could be used multiple times for different runs) [string]
--reads                          Input fasta/fastq file(s) for reads (could be used multiple times for different runs, could be gzipped) [string]
--use_paired_ends                Indicates that single (not comma separated) fasta/fastq files contain paired reads [flag]

Assembly options:
--vector_percent arg (=0.05)  Percentage of reads containing 19-mer for the 19-mer to be considered a vector (1. disables) [float (0,1]]
                                 Before doing any assembly SAUTE removes vectors from the reads. It finds and clips read after all 19-mers found in a relatively high number of reads.

--min_count (=2)                 Minimal count for kmers retained in graph [integer]
                                 This option keeps noise kmers out of the de Bruijn graphs used for assembling. By default, a kmer has to be seen in 
                                 at least 2 reads to be included. It is allowed to use --min_count 1, and it could help in assembling low 
                                 coverage spots. The drawback of this setting is that all read errors satisfying --fraction criterion will be 
                                 recognized as valid SNPs. To alleviate this, the program will remove all sections supported by only 1 read 
                                 if they were not necessary for a connection. 
--fraction (=0.05)               Maximum noise to signal ratio acceptable for extension [float [0,1)]
                                 All fork branches with relative abundance less than this are ignored.
--kmer                           Primary kmer length for assembly (default automatic) [integer]
--secondary_kmer                 Shorter kmer length for low coverage spots (default automatic) [integer]
                                 For assembling SAUTE uses two de Bruijn graphs with different kmer lengths. It uses the longer kmer most of the 
                                 time and switches to the shorter kmer to assemble the low coverage spots. Both kmers must be odd numbers. In case 
                                 of protein references, both kmers must be divisible by 3. By default, the program will use the closest valid values
                                 not exceeding half and a fifth of the read length.
--secondary_kmer_threshold (=1)  Coverage threshold for using shorter kmer [integer]
                                 This parameter defines what is considered to be low coverage for the secondary kmer use. With default, these are spots with no 
                                 possible extension or extension supported by only 1 read (could happen only with --min-count 1). Specifying 
                                 a higher number may help with detecting some low coverage forks. With more work shifted toward the shorter 
                                 kmer, there is a higher chance that the program will enter a highly repetitive area and will create a very 
                                 complex graph.

--word                           Word size for seeds [integer <= 16]
                                 To start assembling, SAUTE scans the main de Bruijn graph and finds seed kmers with high similarity to the reference.
                                 These kmers are used as starting points for the assembly. To trigger a kmer comparison to the reference, one of its ends
                                 has to have an exact match of --word length. For a protein reference, the word must be divisible by 3. Defaults are 8 for 
                                 nucleotide references and 12 for protein references. Latter translates to 4 aa match.

--kmer_complexity (=2000)        Hard mask reference areas with high number of variant seed possibilities (0 disables masking) [integer]
                                 Repetitive regions could result in very complex output graph and excessive calculation time. All reference areas
                                 for which SAUTE found in each position more different seeds than this parameter are internally hard masked and
                                 will not be assembled. 
--max_fork_density (=0.1)        Maximal fork density averaged over --buf_length before abandoning assembling (0 disables) [float]
--buf_length (=200)              Buffer length for fork density [integer]
                                 If assembling enters an unmasked repetitive area, the number of forks and paths to follow may become very high. 
                                 SAUTE calculates the average fork density for the last --buf_length long portion of the assembly. If the fork 
                                 density is above the --max_fork_density threshold, assembling for the current seed is stopped and the program 
                                 moves to the next seed.

--target_coverage (=0.5)         Keep a path in output if it has alignment to the reference that is at least target_coverage*(reference length) long [float (0,1]]
--min_hit_len                    If a path is shorter than target_coverage*(reference length), use this length threshold for keeping paths (optional) [integer]

--extend_ends                    Unambiguously extend graph ends using de-novo assembly [flag]
--protect_reference_ends         Near the reference ends, don't check if reads support some minimal extension of the fork's branches [flag]
                                 Each fork is analysed using the algorithm described in the SKESA paper, and less supported branches are deleted. 
                                 There is one check which could be detrimental to assembling RNAseq -- each valid branch has to be extendable by ~100bp.
                                 If this is close to a transcript 5' or 3' end, extension is expected to fail. This option disables the extension
                                 check at the reference ends.
--keep_subgraphs                 Don't remove redundant subgraphs [flag]
                                 If the reference file contains similar references, SAUTE may create multiple graphs which are either identical to
                                 or are sub-graphs of some other graphs. By default, the program will remove redundant
                                 graphs and sub-graphs. This option disables sub-graph removal.
--use_ambiguous_na               Use ambiguous nucleotide codes for SNPs [flag]
                                 If the option is used, SAUTE will replace SNPs with ambiguous nucleotide codes. This will simplify the graph
                                 and will reduce the number of sequence variants. The counts for individual SNPs will be lost.

--gap_open                       Penalty for gap opening [integer] (default of 5 for nucleotide references and 11 for protein references)
--gap_extend                     Penalty for gap extension [integer] (default of 3 for nucleotide references and 1 for protein references)
--drop_off (=300)                Maximal decrease of score before abandoning assembly path [integer]
                                 SAUTE uses a Smith-Waterman type algorithm for aligning references to the de Bruijn graph. For nucleotide references,
                                 the usual match bonus and mismatch penalty are used. They could be changed from the input (see below Options specific 
                                 for nucleotide targets). For protein references BLOSUM62 substitution matrix is used. In both cases, penalties for gaps 
                                 are specified using --gap_open and --gap_extend parameters. The --drop_off parameter determines how far from the 
                                 reference the assembly may go before the extension is stopped, and the assembly is trimmed to the aligned part.
                                 Larger gap open penalty is used for opening frameshifts (see below).

Graph cleaning options:
--no_filter_by_reads             Don't use full length reads for variants filtering [flag]
--no_filter_by_pairs             Don't use mate pairs for variants filtering [flag]
                                 After assembling, SAUTE will use full length reads and read pairs to remove chimeric connections from the graph. 
                                 Above options will partially or completely disable these steps.
--max_path (=1000)               Maximal number of path extensions allowed for a single filtering check [integer]
                                 For each fork in graph, SAUTE expands all sequences starting at this fork using either the read length or the insert size.
                                 If the number of expanded sequences exceed this parameter value, the extension length is reduced.
--not_aligned_len (=10)          Not aligned read length for break count [integer]
--not_aligned_count (=3)         Number of not aligned reads to make a break [integer]
--aligned_count (=2)             Number of aligned reads to confirm a connection [integer]
                                 For each expanded sequence, SAUTE counts how many reads or pairs confirm this path and how many contradict it. Based on
                                 these counts, the path is either kept or discarded.

--remove_homopolymer_indels     Remove homopolymer indels [flag]
--homopolymer_len (=4)          Minimal length of homopolymer [integer]
--homopolymer_ratio (=0.33)     Coverage ratio threshold for removing homopolymer indels [float]

See the github repo link below for more information.

software ref: https://github.com/ncbi/SKESA
research ref: https://doi.org/10.1186/s13059-018-1540-z
research ref: https://doi.org/10.1186/s12859-021-04174-9