canu
Canu is a fork of the Celera
Assembler,
designed for high-noise single-molecule sequencing (such as the
PacBio
RS II/
Sequel
or Oxford Nanopore
MinION).
Canu is a hierarchical assembly pipeline which runs in four steps:
- Detect overlaps in high-noise sequences using MHAP
- Generate corrected sequence consensus
- Trim corrected sequences
- Assemble trimmed corrected sequences
Location and version:
1
2
3
4
|
$ which canu
/local/cluster/canu/bin/canu
$ canu --version
canu 2.2
|
help message:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
|
$ canu --help
usage: canu [-version] [-citation] \
[-haplotype | -correct | -trim | -assemble | -trim-assemble] \
[-s <assembly-specifications-file>] \
-p <assembly-prefix> \
-d <assembly-directory> \
genomeSize=<number>[g|m|k] \
[other-options] \
[-haplotype{NAME} illumina.fastq.gz] \
[-corrected] \
[-trimmed] \
[-pacbio |
-nanopore |
-pacbio-hifi] file1 file2 ...
example: canu -d run1 -p godzilla genomeSize=1g -nanopore-raw reads/*.fasta.gz
To restrict canu to only a specific stage, use:
-haplotype - generate haplotype-specific reads
-correct - generate corrected reads
-trim - generate trimmed reads
-assemble - generate an assembly
-trim-assemble - generate trimmed reads and then assemble them
The assembly is computed in the -d <assembly-directory>, with output files named
using the -p <assembly-prefix>. This directory is created if needed. It is not
possible to run multiple assemblies in the same directory.
The genome size should be your best guess of the haploid genome size of what is being
assembled. It is used primarily to estimate coverage in reads, NOT as the desired
assembly size. Fractional values are allowed: '4.7m' equals '4700k' equals '4700000'
Some common options:
useGrid=string
- Run under grid control (true), locally (false), or set up for grid control
but don't submit any jobs (remote)
rawErrorRate=fraction-error
- The allowed difference in an overlap between two raw uncorrected reads. For lower
quality reads, use a higher number. The defaults are 0.300 for PacBio reads and
0.500 for Nanopore reads.
correctedErrorRate=fraction-error
- The allowed difference in an overlap between two corrected reads. Assemblies of
low coverage or data with biological differences will benefit from a slight increase
in this. Defaults are 0.045 for PacBio reads and 0.144 for Nanopore reads.
gridOptions=string
- Pass string to the command used to submit jobs to the grid. Can be used to set
maximum run time limits. Should NOT be used to set memory limits; Canu will do
that for you.
minReadLength=number
- Ignore reads shorter than 'number' bases long. Default: 1000.
minOverlapLength=number
- Ignore read-to-read overlaps shorter than 'number' bases long. Default: 500.
A full list of options can be printed with '-options'. All options can be supplied in
an optional sepc file with the -s option.
For TrioCanu, haplotypes are specified with the -haplotype{NAME} option, with any
number of haplotype-specific Illumina read files after. The {NAME} of each haplotype
is free text (but only letters and numbers, please). For example:
-haplotypeNANNY nanny/*gz
-haplotypeBILLY billy1.fasta.gz billy2.fasta.gz
Reads can be either FASTA or FASTQ format, uncompressed, or compressed with gz, bz2 or xz.
Reads are specified by the technology they were generated with, and any processing performed.
[processing]
-corrected
-trimmed
[technology]
-pacbio <files>
-nanopore <files>
-pacbio-hifi <files>
Complete documentation at http://canu.readthedocs.org/en/latest/
|
software ref: http://canu.readthedocs.org/en/latest/
software ref: https://github.com/marbl/canu
research ref: https://github.com/marbl/canu#citation