[Centrifuge] is a novel microbial classification engine that enables rapid,
accurate and sensitive labeling of reads and quantification of species on
desktop computers. The system uses a novel indexing scheme based on the
Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized
specifically for the metagenomic classification problem. Centrifuge requires
a relatively small index (4.7 GB for all complete bacterial and viral genomes
plus the human genome) and classifies sequences at very high speed, allowing it
to process the millions of reads from a typical high-throughput DNA sequencing
run within a few minutes. Together these advances enable timely and accurate
analysis of large metagenomics data sets on conventional desktop computers
Location and version:
1
2
3
4
5
6
7
8
9
10
11
12
|
$ which centrifuge
/local/cluster/centrifuge/bin/centrifuge
$ centrifuge --version
/local/cluster/centrifuge/bin/centrifuge-class version 1.0.4
64-bit
Built on chrom1.cgrb.oregonstate.local
Mon Aug 16 14:41:32 PDT 2021
Compiler: gcc version 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC)
Options: -O3 -m64 -msse2 -funroll-loops -g3 -std=c++11 -DPOPCNT_CAPABILITY
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
$ echo $CENTRIFUGE_DB
/nfs1/CGRB/databases/centrifuge/latest
|
help message:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
|
$ centrifuge --help
Centrifuge version 1.0.4 by the Centrifuge developer team (centrifuge.metagenomics@gmail.com)
Usage:
centrifuge [options]* -x <cf-idx> {-1 <m1> -2 <m2> | -U <r> | --sample-sheet <s> } [-S <filename>] [--report-file <report>]
<cf-idx> Index filename prefix (minus trailing .X.cf).
<m1> Files with #1 mates, paired with files in <m2>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<m2> Files with #2 mates, paired with files in <m1>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<r> Files with unpaired reads.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<s> A TSV file where each line represents a sample.
<filename> File for classification output (default: stdout)
<report> File for tabular report output (default: centrifuge_report.tsv)
<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.
Options (defaults in parentheses):
Input:
-q query input files are FASTQ .fq/.fastq (default)
--qseq query input files are in Illumina's qseq format
-f query input files are (multi-)FASTA .fa/.mfa
-r query input files are raw one-sequence-per-line
-c <m1>, <m2>, <r> are sequences themselves, not files
-s/--skip <int> skip the first <int> reads/pairs in the input (none)
-u/--upto <int> stop after first <int> reads/pairs (no limit)
-5/--trim5 <int> trim <int> bases from 5'/left end of reads (0)
-3/--trim3 <int> trim <int> bases from 3'/right end of reads (0)
--phred33 qualities are Phred+33 (default)
--phred64 qualities are Phred+64
--int-quals qualities encoded as space-delimited integers
--ignore-quals treat all quality values as 30 on Phred scale (off)
--nofw do not align forward (original) version of read (off)
--norc do not align reverse-complement version of read (off)
Classification:
--min-hitlen <int> minimum length of partial hits (default 22, must be greater than 15)
-k <int> report upto <int> distinct, primary assignments for each read or pair
--host-taxids <taxids> comma-separated list of taxonomic IDs that will be preferred in classification
--exclude-taxids <taxids> comma-separated list of taxonomic IDs that will be excluded in classification
Output:
--out-fmt <str> define output format, either 'tab' or 'sam' (tab)
--tab-fmt-cols <str> columns in tabular format, comma separated
default: readID,seqID,taxID,score,2ndBestScore,hitLength,queryLength,numMatches
-t/--time print wall-clock time taken by search phases
--un <path> write unpaired reads that didn't align to <path>
--al <path> write unpaired reads that aligned at least once to <path>
--un-conc <path> write pairs that didn't align concordantly to <path>
--al-conc <path> write pairs that aligned concordantly at least once to <path>
(Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.
--un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)
--quiet print nothing to stderr except serious errors
--met-file <path> send metrics to file at <path> (off)
--met-stderr send metrics to stderr (off)
--met <int> report internal counters & metrics every <int> secs (1)
Performance:
-p/--threads <int> number of alignment threads to launch (1)
--mm use memory-mapped I/O for index; many instances can share
Other:
--qc-filter filter out reads that are bad according to QSEQ filter
--seed <int> seed for random number generator (0)
--non-deterministic seed rand. gen. arbitrarily instead of using read attributes
--version print version information and quit
-h/--help print this usage message
|
Use centrifuge -x $CENTRIFUGE_DB/hpvc ...
to take advantage of the precompiled
human prokaryote virus SARS-CoV-2 database.
Example code:
1
2
3
4
5
6
7
8
9
10
11
|
#run_centrifuge.sh
#!/bin/bash
#expects reads to be in $PWD and have the .fq.gz suffix
db=$CENTRIFUGE_DB/hpvc
suffix=fq.gz
for r1 in $(ls "./*R1.${suffix}); do
r2=${r1/R1/R2}
base=${r1/_R1.${suffix}/}
echo "centrifuge -x $db -1 $r1 -2 $r2 --report-file ${base}.report.tsv -p 8 --mm > ${base}.centrifuge.tsv"
done
|
Run through SGE_Array
:
1
2
|
bash ./run_centrifuge.sh > cmds.txt
cat cmds.txt | SGE_Array -q <QUEUE NAME> -P 8 -b 8
|
Then, to generate reports compatible with pavian:
1
2
3
4
5
6
7
|
#generate_kreports.sh
#!/bin/bash
for tsv in `ls *centrifuge.tsv`; do
base=$(basename $tsv)
out=${base/.tsv/.kreport}
centrifuge-kreport -x $CENTRIFUGE_DB/hpvc $tsv > $out
done
|
Run through SGE_Batch
:
1
|
SGE_Batch -c 'bash ./generate_kreports.sh' -r sge.generate_kreport -q QUEUE
|
Prepare dir and run pavian:
1
2
3
4
|
mkdir kreports
cd kreports
ln -s ../*.kreport .
SGE_Batch -c 'start_pavian' -r sge.pavian -q QUEUE@node
|
It’s best to specify the QUEUE@node
so that you know which node pavian is
running on. If you don’t require a particular node on a queue, then you can omit
the @node
part.
Then, check the standard output file in the sge.pavian
folder for directions
on how to start the ssh tunnel and the address to use in your local web browser.
software ref: https://github.com/DaehwanKimLab/centrifuge
software ref: http://www.ccb.jhu.edu/software/centrifuge
research ref: https://doi.org/10.1101/gr.210641.116
research ref: http://www.ccb.jhu.edu/people/infphilo/data/Centrifuge-poster.pdf