FASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and
protein sequences. Common manipulations of FASTA/Q file include converting,
searching, filtering, deduplication, splitting, shuffling, and sampling.
Existing tools only implement some of these manipulations, and not particularly
efficiently, and some are only available for certain operating systems.
Furthermore, the complicated installation process of required packages and
running environments can render these programs less user friendly.
This project describes a cross-platform ultrafast comprehensive toolkit for
FASTA/Q processing. SeqKit provides executable binary files for all major
operating systems, including Windows, Linux, and Mac OS X, and can be directly
used without any dependencies or pre-configurations. SeqKit demonstrates
competitive performance in execution time and memory usage compared to similar
tools. The efficiency and usability of SeqKit enable researchers to rapidly
accomplish common FASTA/Q file manipulations.
Location and version:
1
2
3
4
|
$ which seqkit
/local/cluster/bin/seqkit
$ seqkit version
seqkit v0.16.1
|
help message:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
|
$ seqkit -h
SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
Version: 0.16.1
Author: Wei Shen <shenwei356@gmail.com>
Documents : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite: https://doi.org/10.1371/journal.pone.0163962
Usage:
seqkit [command]
Available Commands:
amplicon retrieve amplicon (or specific region around it) via primer(s)
bam monitoring and online histograms of BAM record features
common find common sequences of multiple files by id/name/sequence
concat concatenate sequences with same ID from multiple files
convert convert FASTQ quality encoding between Sanger, Solexa and Illumina
duplicate duplicate sequences N times
faidx create FASTA index file and extract subsequence
fish look for short sequences in larger sequences using local alignment
fq2fa convert FASTQ to FASTA
fx2tab convert FASTA/Q to tabular format (with length/GC content/GC skew)
genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell)
grep search sequences by ID/name/sequence/sequence motifs, mismatch allowed
head print first N FASTA/Q records
head-genome print sequences of the first genome with common prefixes in name
help Help about any command
locate locate subsequences/motifs, mismatch allowed
mutate edit sequence (point mutation, insertion, deletion)
pair match up paired-end reads from two fastq files
range print FASTA/Q records in a range (start:end)
rename rename duplicated IDs
replace replace name/sequence by regular expression
restart reset start position for circular genome
rmdup remove duplicated sequences by id/name/sequence
sample sample sequences by number or proportion
sana sanitize broken single line fastq files
scat real time recursive concatenation and streaming of fastx files
seq transform sequences (revserse, complement, extract ID...)
shuffle shuffle sequences
sliding sliding sequences, circular genome supported
sort sort sequences by id/name/sequence/length
split split sequences into files by id/seq region/size/parts (mainly for FASTA)
split2 split sequences into files by size/parts (FASTA, PE/SE FASTQ)
stats simple statistics of FASTA/Q files
subseq get subsequences by region/gtf/bed, including flanking sequences
tab2fx convert tabular format to FASTA/Q format
translate translate DNA/RNA to protein sequence (supporting ambiguous bases)
version print version information and check for update
watch monitoring and online histograms of sequence features
Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
-h, --help help for seqkit
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
--infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others. can also set with environment variable SEQKIT_THREADS) (default 2)
Use "seqkit [command] --help" for more information about a command.
|
seqkit provides many standard sequence modification tools all in one place.
While other tools provide similar features, the convenience of only having to
remember one command is attractive for this software.
seqkit stats
is useful for getting an overview of the type and length of
sequences in a file. seqkit grep
is useful for extracting subsets of sequences
out of your file. You can use seqkit grep -f <FILE_WITH_SUBSET_HEADERS>
to
extract a subset of sequences from your file.
software ref: https://github.com/shenwei356/seqkit
software ref: https://bioinf.shenwei.me/seqkit/
research ref: https://doi.org/10.1371/journal.pone.0163962