vsearch 2.22.1

2022-11-09 4098 words 20 minutes

Contents

Installed

This software should be available with no extra configuration.

vsearch-2.22.1

The aim of this project is to create an alternative to the USEARCH tool developed by Robert C. Edgar (2010). The new tool should:

have open source code with an appropriate open source license
be free of charge, gratis
have a 64-bit design that handles very large databases and much more than 4GB of memory
be as accurate or more accurate than usearch
be as fast or faster than usearch

We have implemented a tool called VSEARCH which supports de novo and reference based chimera detection, clustering, full-length and prefix dereplication, rereplication, reverse complementation, masking, all-vs-all pairwise global alignment, exact and global alignment searching, shuffling, subsampling and sorting. It also supports FASTQ file analysis, filtering, conversion and merging of paired-end reads.

VSEARCH stands for vectorized search, as the tool takes advantage of parallelism in the form of SIMD vectorization as well as multiple threads to perform accurate alignments at high speed. VSEARCH uses an optimal global aligner (full dynamic programming Needleman-Wunsch), in contrast to USEARCH which by default uses a heuristic seed and extend aligner. This usually results in more accurate alignments and overall improved sensitivity (recall) with VSEARCH, especially for alignments with gaps.

VSEARCH binaries are provided for GNU/Linux on three 64-bit processor architectures: x86-64, POWER8 (ppc64le) and ARMv8 (aarch64). Binaries are also provided for MacOS (version 10.9 Mavericks or later) on Intel (x86-64) and Apple Silicon (ARMv8), as well as Windows (64-bit, version 7 or higher, on x86_64). VSEARCH contains dedicated SIMD code for the three processor architectures (SSE2/SSSE3, AltiVec/VMX/VSX, Neon).

CPU \ OS	GNU/Linux	MacOS	Windows
x86_64	✔	✔	✔
ARMv8	✔	✔
POWER8	✔

VSEARCH can directly read input query and database files that are compressed using gzip and bzip2 (.gz and .bz2) if the zlib and bzip2 libraries are available.

Most of the nucleotide based commands and options in USEARCH version 7 are supported, as well as some in version 8. The same option names as in USEARCH version 7 has been used in order to make VSEARCH an almost drop-in replacement. VSEARCH does not support amino acid sequences or local alignments. These features may be added in the future.

Getting Help

If you can’t find an answer in the VSEARCH documentation, please visit the VSEARCH Web Forum to post a question or start a discussion.

Example

In the example below, VSEARCH will identify sequences in the file database.fsa that are at least 90% identical on the plus strand to the query sequences in the file queries.fsa and write the results to the file alnout.txt.

./vsearch --usearch_global queries.fsa --db database.fsa --id 0.9 --alnout alnout.txt

Location and version

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


$ which vsearch
/local/cluster/bin/vsearch
$ vsearch --version
vsearch v2.22.1_linux_x86_64, 995.4GB RAM, 64 cores
https://github.com/torognes/vsearch

Rognes T, Flouri T, Nichols B, Quince C, Mahe F (2016)
VSEARCH: a versatile open source tool for metagenomics
PeerJ 4:e2584 doi: 10.7717/peerj.2584 https://doi.org/10.7717/peerj.2584

Compiled with support for gzip-compressed files, and the library is loaded.
zlib version 1.2.8, compile flags a9
Compiled with support for bzip2-compressed files, and the library is loaded.

help message

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454


$ vsearch --help
vsearch v2.22.1_linux_x86_64, 995.4GB RAM, 64 cores
https://github.com/torognes/vsearch

Rognes T, Flouri T, Nichols B, Quince C, Mahe F (2016)
VSEARCH: a versatile open source tool for metagenomics
PeerJ 4:e2584 doi: 10.7717/peerj.2584 https://doi.org/10.7717/peerj.2584

Usage: vsearch [OPTIONS]

General options
  --bzip2_decompress          decompress input with bzip2 (required if pipe)
  --fasta_width INT           width of FASTA seq lines, 0 for no wrap (80)
  --gzip_decompress           decompress input with gzip (required if pipe)
  --help | -h                 display help information
  --log FILENAME              write messages, timing and memory info to file
  --maxseqlength INT          maximum sequence length (50000)
  --minseqlength INT          min seq length (clust/derep/search: 32, other:1)
  --no_progress               do not show progress indicator
  --notrunclabels             do not truncate labels at first space
  --quiet                     output just warnings and fatal errors to stderr
  --threads INT               number of threads to use, zero for all cores (0)
  --version | -v              display version information

Chimera detection
  --uchime_denovo FILENAME    detect chimeras de novo
  --uchime2_denovo FILENAME   detect chimeras de novo in denoised amplicons
  --uchime3_denovo FILENAME   detect chimeras de novo in denoised amplicons
  --uchime_ref FILENAME       detect chimeras using a reference database
 Data
  --db FILENAME               reference database for --uchime_ref
 Parameters
  --abskew REAL               minimum abundance ratio (2.0, 16.0 for uchime3)
  --dn REAL                   'no' vote pseudo-count (1.4)
  --mindiffs INT              minimum number of differences in segment (3) *
  --mindiv REAL               minimum divergence from closest parent (0.8) *
  --minh REAL                 minimum score (0.28) * ignored in uchime2/3
  --sizein                    propagate abundance annotation from input
  --self                      exclude identical labels for --uchime_ref
  --selfid                    exclude identical sequences for --uchime_ref
  --xn REAL                   'no' vote weight (8.0)
 Output
  --alignwidth INT            width of alignment in uchimealn output (80)
  --borderline FILENAME       output borderline chimeric sequences to file
  --chimeras FILENAME         output chimeric sequences to file
  --fasta_score               include chimera score in fasta output
  --nonchimeras FILENAME      output non-chimeric sequences to file
  --relabel STRING            relabel nonchimeras with this prefix string
  --relabel_keep              keep the old label after the new when relabelling
  --relabel_md5               relabel with md5 digest of normalized sequence
  --relabel_self              relabel with the sequence itself as label
  --relabel_sha1              relabel with sha1 digest of normalized sequence
  --sizeout                   include abundance information when relabelling
  --uchimealns FILENAME       output chimera alignments to file
  --uchimeout FILENAME        output to chimera info to tab-separated file
  --uchimeout5                make output compatible with uchime version 5
  --xsize                     strip abundance information in output

Clustering
  --cluster_fast FILENAME     cluster sequences after sorting by length
  --cluster_size FILENAME     cluster sequences after sorting by abundance
  --cluster_smallmem FILENAME cluster already sorted sequences (see -usersort)
  --cluster_unoise FILENAME   denoise Illumina amplicon reads
 Parameters (most searching options also apply)
  --cons_truncate             do not ignore terminal gaps in MSA for consensus
  --id REAL                   reject if identity lower, accepted values: 0-1.0
  --iddef INT                 id definition, 0-4=CD-HIT,all,int,MBL,BLAST (2)
  --qmask none|dust|soft      mask seqs with dust, soft or no method (dust)
  --sizein                    propagate abundance annotation from input
  --strand plus|both          cluster using plus or both strands (plus)
  --usersort                  indicate sequences not pre-sorted by length
  --minsize INT               minimum abundance (unoise only) (8)
  --unoise_alpha REAL         alpha parameter (unoise only) (2.0)
 Output
  --biomout FILENAME          filename for OTU table output in biom 1.0 format
  --centroids FILENAME        output centroid sequences to FASTA file
  --clusterout_id             add cluster id info to consout and profile files
  --clusterout_sort           order msaout, consout, profile by decr abundance
  --clusters STRING           output each cluster to a separate FASTA file
  --consout FILENAME          output cluster consensus sequences to FASTA file
  --mothur_shared_out FN      filename for OTU table output in mothur format
  --msaout FILENAME           output multiple seq. alignments to FASTA file
  --otutabout FILENAME        filename for OTU table output in classic format
  --profile FILENAME          output sequence profile of each cluster to file
  --relabel STRING            relabel centroids with this prefix string
  --relabel_keep              keep the old label after the new when relabelling
  --relabel_md5               relabel with md5 digest of normalized sequence
  --relabel_self              relabel with the sequence itself as label
  --relabel_sha1              relabel with sha1 digest of normalized sequence
  --sizeorder                 sort accepted centroids by abundance, AGC
  --sizeout                   write cluster abundances to centroid file
  --uc FILENAME               specify filename for UCLUST-like output
  --xsize                     strip abundance information in output

Convert SFF to FASTQ
  --sff_convert FILENAME      convert given SFF file to FASTQ format
 Parameters
  --sff_clip                  clip ends of sequences as indicated in file (no)
  --fastq_asciiout INT        FASTQ output quality score ASCII base char (33)
  --fastq_qmaxout INT         maximum base quality value for FASTQ output (41)
  --fastq_qminout INT         minimum base quality value for FASTQ output (0)
 Output
  --fastqout FILENAME         output converted sequences to given FASTQ file

Dereplication and rereplication
  --derep_fulllength FILENAME dereplicate sequences in the given FASTA file
  --derep_id FILENAME         dereplicate using both identifiers and sequences
  --derep_prefix FILENAME     dereplicate sequences in file based on prefixes
  --derep_smallmem FILENAME   dereplicate sequences in file using less memory
  --fastx_uniques FILENAME    dereplicate sequences in the FASTA/FASTQ file
  --rereplicate FILENAME      rereplicate sequences in the given FASTA file
 Parameters
  --maxuniquesize INT         maximum abundance for output from dereplication
  --minuniquesize INT         minimum abundance for output from dereplication
  --sizein                    propagate abundance annotation from input
  --strand plus|both          dereplicate plus or both strands (plus)
 Output
  --fastq_ascii INT           FASTQ input quality score ASCII base char (33)
  --fastq_qmax INT            maximum base quality value for FASTQ input (41)
  --fastq_qmaxout INT         maximum base quality value for FASTQ output (41)
  --fastq_qmin INT            minimum base quality value for FASTQ input (0)
  --fastq_qminout INT         minimum base quality value for FASTQ output (0)
  --fastaout FILENAME         output FASTA file (for fastx_uniques)
  --fastqout FILENAME         output FASTQ file (for fastx_uniques)
  --output FILENAME           output FASTA file (not for fastx_uniques)
  --relabel STRING            relabel with this prefix string
  --relabel_keep              keep the old label after the new when relabelling
  --relabel_md5               relabel with md5 digest of normalized sequence
  --relabel_self              relabel with the sequence itself as label
  --relabel_sha1              relabel with sha1 digest of normalized sequence
  --sizeout                   write abundance annotation to output
  --tabbedout FILENAME        write cluster info to tsv file for fastx_uniques
  --topn INT                  output only n most abundant sequences after derep
  --uc FILENAME               filename for UCLUST-like dereplication output
  --xsize                     strip abundance information in derep output

FASTA to FASTQ conversion
  --fasta2fastq FILENAME      convert from FASTA to FASTQ, fake quality scores
 Parameters
  --fastq_asciiout INT        FASTQ output quality score ASCII base char (33)
  --fastq_qmaxout INT         fake quality score for FASTQ output (41)
 Output
  --fastqout FILENAME         FASTQ output filename for converted sequences

FASTQ format conversion
  --fastq_convert FILENAME    convert between FASTQ file formats
 Parameters
  --fastq_ascii INT           FASTQ input quality score ASCII base char (33)
  --fastq_asciiout INT        FASTQ output quality score ASCII base char (33)
  --fastq_qmax INT            maximum base quality value for FASTQ input (41)
  --fastq_qmaxout INT         maximum base quality value for FASTQ output (41)
  --fastq_qmin INT            minimum base quality value for FASTQ input (0)
  --fastq_qminout INT         minimum base quality value for FASTQ output (0)
 Output
  --fastqout FILENAME         FASTQ output filename for converted sequences

FASTQ format detection and quality analysis
  --fastq_chars FILENAME      analyse FASTQ file for version and quality range
 Parameters
  --fastq_tail INT            min length of tails to count for fastq_chars (4)

FASTQ quality statistics
  --fastq_stats FILENAME      report statistics on FASTQ file
  --fastq_eestats FILENAME    quality score and expected error statistics
  --fastq_eestats2 FILENAME   expected error and length cutoff statistics
 Parameters
  --ee_cutoffs REAL,...       fastq_eestats2 expected error cutoffs (0.5,1,2)
  --fastq_ascii INT           FASTQ input quality score ASCII base char (33)
  --fastq_qmax INT            maximum base quality value for FASTQ input (41)
  --fastq_qmin INT            minimum base quality value for FASTQ input (0)
  --length_cutoffs INT,INT,INT fastq_eestats2 length (min,max,incr) (50,*,50)
 Output
  --log FILENAME              output file for fastq_stats statistics
  --output FILENAME           output file for fastq_eestats(2) statistics

Masking (new)
  --fastx_mask FILENAME       mask sequences in the given FASTA or FASTQ file
 Parameters
  --fastq_ascii INT           FASTQ input quality score ASCII base char (33)
  --fastq_qmax INT            maximum base quality value for FASTQ input (41)
  --fastq_qmin INT            minimum base quality value for FASTQ input (0)
  --hardmask                  mask by replacing with N instead of lower case
  --max_unmasked_pct          max unmasked % of sequences to keep (100.0)
  --min_unmasked_pct          min unmasked % of sequences to keep (0.0)
  --qmask none|dust|soft      mask seqs with dust, soft or no method (dust)
 Output
  --fastaout FILENAME         output to specified FASTA file
  --fastqout FILENAME         output to specified FASTQ file

Masking (old)
  --maskfasta FILENAME        mask sequences in the given FASTA file
 Parameters
  --hardmask                  mask by replacing with N instead of lower case
  --qmask none|dust|soft      mask seqs with dust, soft or no method (dust)
 Output
  --output FILENAME           output to specified FASTA file

Orient sequences in forward or reverse direction
  --orient FILENAME           orient sequences in given FASTA/FASTQ file
 Data
  --db FILENAME               database of sequences in correct orientation
  --dbmask none|dust|soft     mask db seqs with dust, soft or no method (dust)
  --qmask none|dust|soft      mask query with dust, soft or no method (dust)
  --wordlength INT            length of words used for matching 3-15 (12)
 Output
  --fastaout FILENAME         FASTA output filename for oriented sequences
  --fastqout FILENAME         FASTQ output filenamr for oriented sequences
  --notmatched FILENAME       output filename for undetermined sequences
  --tabbedout FILENAME        output filename for result information

Paired-end reads joining
  --fastq_join FILENAME       join paired-end reads into one sequence with gap
 Data
  --reverse FILENAME          specify FASTQ file with reverse reads
  --join_padgap STRING        sequence string used for padding (NNNNNNNN)
  --join_padgapq STRING       quality string used for padding (IIIIIIII)
 Output
  --fastaout FILENAME         FASTA output filename for joined sequences
  --fastqout FILENAME         FASTQ output filename for joined sequences

Paired-end reads merging
  --fastq_mergepairs FILENAME merge paired-end reads into one sequence
 Data
  --reverse FILENAME          specify FASTQ file with reverse reads
 Parameters
  --fastq_allowmergestagger   allow merging of staggered reads
  --fastq_ascii INT           FASTQ input quality score ASCII base char (33)
  --fastq_maxdiffpct REAL     maximum percentage diff. bases in overlap (100.0)
  --fastq_maxdiffs INT        maximum number of different bases in overlap (10)
  --fastq_maxee REAL          maximum expected error value for merged sequence
  --fastq_maxmergelen         maximum length of entire merged sequence
  --fastq_maxns INT           maximum number of N's
  --fastq_minlen INT          minimum input read length after truncation (1)
  --fastq_minmergelen         minimum length of entire merged sequence
  --fastq_minovlen            minimum length of overlap between reads (10)
  --fastq_nostagger           disallow merging of staggered reads (default)
  --fastq_qmax INT            maximum base quality value for FASTQ input (41)
  --fastq_qmaxout INT         maximum base quality value for FASTQ output (41)
  --fastq_qmin INT            minimum base quality value for FASTQ input (0)
  --fastq_qminout INT         minimum base quality value for FASTQ output (0)
  --fastq_truncqual INT       base quality value for truncation
 Output
  --eetabbedout FILENAME      output error statistics to specified file
  --fastaout FILENAME         FASTA output filename for merged sequences
  --fastaout_notmerged_fwd FN FASTA filename for non-merged forward sequences
  --fastaout_notmerged_rev FN FASTA filename for non-merged reverse sequences
  --fastq_eeout               include expected errors (ee) in FASTQ output
  --fastqout FILENAME         FASTQ output filename for merged sequences
  --fastqout_notmerged_fwd FN FASTQ filename for non-merged forward sequences
  --fastqout_notmerged_rev FN FASTQ filename for non-merged reverse sequences
  --label_suffix STRING       suffix to append to label of merged sequences
  --xee                       remove expected errors (ee) info from output

Pairwise alignment
  --allpairs_global FILENAME  perform global alignment of all sequence pairs
 Output (most searching options also apply)
  --alnout FILENAME           filename for human-readable alignment output
  --acceptall                 output all pairwise alignments

Restriction site cutting
  --cut FILENAME              filename of FASTA formatted input sequences
 Parameters
  --cut_pattern STRING        pattern to match with ^ and _ at cut sites
 Output
  --fastaout FILENAME         FASTA filename for fragments on forward strand
  --fastaout_rev FILENAME     FASTA filename for fragments on reverse strand
  --fastaout_discarded FN     FASTA filename for non-matching sequences
  --fastaout_discarded_rev FN FASTA filename for non-matching, reverse compl.

Reverse complementation
  --fastx_revcomp FILENAME    reverse-complement seqs in FASTA or FASTQ file
 Parameters
  --fastq_ascii INT           FASTQ input quality score ASCII base char (33)
  --fastq_qmax INT            maximum base quality value for FASTQ input (41)
  --fastq_qmin INT            minimum base quality value for FASTQ input (0)
 Output
  --fastaout FILENAME         FASTA output filename
  --fastqout FILENAME         FASTQ output filename
  --label_suffix STRING       label to append to identifier in the output

Searching
  --search_exact FILENAME     filename of queries for exact match search
  --usearch_global FILENAME   filename of queries for global alignment search
 Data
  --db FILENAME               name of UDB or FASTA database for search
 Parameters
  --dbmask none|dust|soft     mask db with dust, soft or no method (dust)
  --fulldp                    full dynamic programming alignment (always on)
  --gapext STRING             penalties for gap extension (2I/1E)
  --gapopen STRING            penalties for gap opening (20I/2E)
  --hardmask                  mask by replacing with N instead of lower case
  --id REAL                   reject if identity lower
  --iddef INT                 id definition, 0-4=CD-HIT,all,int,MBL,BLAST (2)
  --idprefix INT              reject if first n nucleotides do not match
  --idsuffix INT              reject if last n nucleotides do not match
  --lca_cutoff REAL           fraction of matching hits required for LCA (1.0)
  --leftjust                  reject if terminal gaps at alignment left end
  --match INT                 score for match (2)
  --maxaccepts INT            number of hits to accept and show per strand (1)
  --maxdiffs INT              reject if more substitutions or indels
  --maxgaps INT               reject if more indels
  --maxhits INT               maximum number of hits to show (unlimited)
  --maxid REAL                reject if identity higher
  --maxqsize INT              reject if query abundance larger
  --maxqt REAL                reject if query/target length ratio higher
  --maxrejects INT            number of non-matching hits to consider (32)
  --maxsizeratio REAL         reject if query/target abundance ratio higher
  --maxsl REAL                reject if shorter/longer length ratio higher
  --maxsubs INT               reject if more substitutions
  --mid REAL                  reject if percent identity lower, ignoring gaps
  --mincols INT               reject if alignment length shorter
  --minqt REAL                reject if query/target length ratio lower
  --minsizeratio REAL         reject if query/target abundance ratio lower
  --minsl REAL                reject if shorter/longer length ratio lower
  --mintsize INT              reject if target abundance lower
  --minwordmatches INT        minimum number of word matches required (12)
  --mismatch INT              score for mismatch (-4)
  --pattern STRING            option is ignored
  --qmask none|dust|soft      mask query with dust, soft or no method (dust)
  --query_cov REAL            reject if fraction of query seq. aligned lower
  --rightjust                 reject if terminal gaps at alignment right end
  --sizein                    propagate abundance annotation from input
  --self                      reject if labels identical
  --selfid                    reject if sequences identical
  --slots INT                 option is ignored
  --strand plus|both          search plus or both strands (plus)
  --target_cov REAL           reject if fraction of target seq. aligned lower
  --weak_id REAL              include aligned hits with >= id; continue search
  --wordlength INT            length of words for database index 3-15 (8)
 Output
  --alnout FILENAME           filename for human-readable alignment output
  --biomout FILENAME          filename for OTU table output in biom 1.0 format
  --blast6out FILENAME        filename for blast-like tab-separated output
  --dbmatched FILENAME        FASTA file for matching database sequences
  --dbnotmatched FILENAME     FASTA file for non-matching database sequences
  --fastapairs FILENAME       FASTA file with pairs of query and target
  --lcaout FILENAME           output LCA of matching sequences to file
  --matched FILENAME          FASTA file for matching query sequences
  --mothur_shared_out FN      filename for OTU table output in mothur format
  --notmatched FILENAME       FASTA file for non-matching query sequences
  --otutabout FILENAME        filename for OTU table output in classic format
  --output_no_hits            output non-matching queries to output files
  --rowlen INT                width of alignment lines in alnout output (64)
  --samheader                 include a header in the SAM output file
  --samout FILENAME           filename for SAM format output
  --sizeout                   write abundance annotation to dbmatched file
  --top_hits_only             output only hits with identity equal to the best
  --uc FILENAME               filename for UCLUST-like output
  --uc_allhits                show all, not just top hit with uc output
  --userfields STRING         fields to output in userout file
  --userout FILENAME          filename for user-defined tab-separated output

Shuffling and sorting
  --shuffle FILENAME          shuffle order of sequences in FASTA file randomly
  --sortbylength FILENAME     sort sequences by length in given FASTA file
  --sortbysize FILENAME       abundance sort sequences in given FASTA file
 Parameters
  --maxsize INT               maximum abundance for sortbysize
  --minsize INT               minimum abundance for sortbysize
  --randseed INT              seed for PRNG, zero to use random data source (0)
  --sizein                    propagate abundance annotation from input
 Output
  --output FILENAME           output to specified FASTA file
  --relabel STRING            relabel sequences with this prefix string
  --relabel_keep              keep the old label after the new when relabelling
  --relabel_md5               relabel with md5 digest of normalized sequence
  --relabel_self              relabel with the sequence itself as label
  --relabel_sha1              relabel with sha1 digest of normalized sequence
  --sizeout                   include abundance information when relabelling
  --topn INT                  output just first n sequences
  --xsize                     strip abundance information in output

Subsampling
  --fastx_subsample FILENAME  subsample sequences from given FASTA/FASTQ file
 Parameters
  --fastq_ascii INT           FASTQ input quality score ASCII base char (33)
  --fastq_qmax INT            maximum base quality value for FASTQ input (41)
  --fastq_qmin INT            minimum base quality value for FASTQ input (0)
  --randseed INT              seed for PRNG, zero to use random data source (0)
  --sample_pct REAL           sampling percentage between 0.0 and 100.0
  --sample_size INT           sampling size
  --sizein                    consider abundance info from input, do not ignore
 Output
  --fastaout FILENAME         output subsampled sequences to FASTA file
  --fastaout_discarded FILE   output non-subsampled sequences to FASTA file
  --fastqout FILENAME         output subsampled sequences to FASTQ file
  --fastqout_discarded        output non-subsampled sequences to FASTQ file
  --relabel STRING            relabel sequences with this prefix string
  --relabel_keep              keep the old label after the new when relabelling
  --relabel_md5               relabel with md5 digest of normalized sequence
  --relabel_self              relabel with the sequence itself as label
  --relabel_sha1              relabel with sha1 digest of normalized sequence
  --sizeout                   update abundance information in output
  --xsize                     strip abundance information in output

Taxonomic classification
  --sintax FILENAME           classify sequences in given FASTA/FASTQ file
 Parameters
  --db FILENAME               taxonomic reference db in given FASTA or UDB file
  --sintax_cutoff REAL        confidence value cutoff level (0.0)
 Output
  --tabbedout FILENAME        write results to given tab-delimited file

Trimming and filtering
  --fastx_filter FILENAME     trim and filter sequences in FASTA/FASTQ file
  --fastq_filter FILENAME     trim and filter sequences in FASTQ file
  --reverse FILENAME          FASTQ file with other end of paired-end reads
 Parameters
  --fastq_ascii INT           FASTQ input quality score ASCII base char (33)
  --fastq_maxee REAL          discard if expected error value is higher
  --fastq_maxee_rate REAL     discard if expected error rate is higher
  --fastq_maxlen INT          discard if length of sequence is longer
  --fastq_maxns INT           discard if number of N's is higher
  --fastq_minlen INT          discard if length of sequence is shorter
  --fastq_qmax INT            maximum base quality value for FASTQ input (41)
  --fastq_qmin INT            minimum base quality value for FASTQ input (0)
  --fastq_stripleft INT       delete given number of bases from the 5' end
  --fastq_stripright INT      delete given number of bases from the 3' end
  --fastq_truncee REAL        truncate to given maximum expected error
  --fastq_trunclen INT        truncate to given length (discard if shorter)
  --fastq_trunclen_keep INT   truncate to given length (keep if shorter)
  --fastq_truncqual INT       truncate to given minimum base quality
  --maxsize INT               discard if abundance of sequence is above
  --minsize INT               discard if abundance of sequence is below
 Output
  --eeout                     include expected errors in output
  --fastaout FN               FASTA filename for passed sequences
  --fastaout_discarded FN     FASTA filename for discarded sequences
  --fastaout_discarded_rev FN FASTA filename for discarded reverse sequences
  --fastaout_rev FN           FASTA filename for passed reverse sequences
  --fastqout FN               FASTQ filename for passed sequences
  --fastqout_discarded FN     FASTQ filename for discarded sequences
  --fastqout_discarded_rev FN FASTQ filename for discarded reverse sequences
  --fastqout_rev FN           FASTQ filename for passed reverse sequences
  --relabel STRING            relabel filtered sequences with given prefix
  --relabel_keep              keep the old label after the new when relabelling
  --relabel_md5               relabel filtered sequences with md5 digest
  --relabel_self              relabel with the sequence itself as label
  --relabel_sha1              relabel filtered sequences with sha1 digest
  --sizeout                   include abundance information when relabelling
  --xee                       remove expected errors (ee) info from output
  --xsize                     strip abundance information in output

UDB files
  --makeudb_usearch FILENAME  make UDB file from given FASTA file
  --udb2fasta FILENAME        output FASTA file from given UDB file
  --udbinfo FILENAME          show information about UDB file
  --udbstats FILENAME         report statistics about indexed words in UDB file
 Parameters
  --dbmask none|dust|soft     mask db with dust, soft or no method (dust)
  --hardmask                  mask by replacing with N instead of lower case
  --wordlength INT            length of words for database index 3-15 (8)
 Output
  --output FILENAME           UDB or FASTA output file

software ref: https://github.com/torognes/vsearch
research ref: https://doi.org/10.7717/peerj.2584