1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
|
$ braker.pl -help
DESCRIPTION
braker.pl Pipeline for predicting genes with GeneMark-EX and AUGUSTUS with
RNA-Seq and/or proteins
SYNOPSIS
braker.pl [OPTIONS] --genome=genome.fa {--bam=rnaseq.bam | --prot_seq=prot.fa}
INPUT FILE OPTIONS
--genome=genome.fa fasta file with DNA sequences
--bam=rnaseq.bam bam file with spliced alignments from
RNA-Seq
--prot_seq=prot.fa A protein sequence file in multi-fasta
format used to generate protein hints.
Unless otherwise specified, braker.pl will
run in "EP mode" which uses ProtHint to
generate protein hints and GeneMark-EP+ to
train AUGUSTUS.
--hints=hints.gff Alternatively to calling braker.pl with a
bam or protein fasta file, it is possible to
call it with a .gff file that contains
introns extracted from RNA-Seq and/or
protein hints (most frequently coming
from ProtHint). If you wish to use the
ProtHint hints, use its
"prothint_augustus.gff" output file.
This flag also allows the usage of hints
from additional extrinsic sources for gene
prediction with AUGUSTUS. To consider such
additional extrinsic information, you need
to use the flag --extrinsicCfgFiles to
specify parameters for all sources in the
hints file (including the source "E" for
intron hints from RNA-Seq)
--prot_aln=prot.aln Alignment file generated from aligning
protein sequences against the genome with
either Exonerate (--prg=exonerate), or
Spaln (--prg=spaln), or GenomeThreader
(--prg=gth). This option can be used as
an alternative to --prot_seq file or protein
hints in the --hints file.
To prepare alignment file, run Spaln2 with
the following command:
spaln -O0 ... > spalnfile
To prepare alignment file, run Exonerate
with the following command:
exonerate --model protein2genome \
--showtargetgff T ... > exfile
To prepare alignment file, run
GenomeThreader with the following command:
gth -genomic genome.fa -protein \
protein.fa -gff3out \
-skipalignmentout ... -o gthfile
A valid option prg=... must be specified
in combination with --prot_aln. Generating
tool will not be guessed.
Currently, hints from protein alignment
files are only used in the prediction step
with AUGUSTUS.
FREQUENTLY USED OPTIONS
--species=sname Species name. Existing species will not be
overwritten. Uses Sp_1 etc., if no species
is assigned
--AUGUSTUS_ab_initio output ab initio predictions by AUGUSTUS
in addition to predictions with hints by
AUGUSTUS
--softmasking Softmasking option for soft masked genome
files. (Disabled by default.)
--esmode Run GeneMark-ES (genome sequence only) and
train AUGUSTUS on long genes predicted by
GeneMark-ES. Final predictions are ab initio
--epmode Run ProtHint to generate protein hints (if
not already specified with --hints option)
and use the hints in GeneMark-EP+ to create
a training set for AUGUSTUS.
--etpmode Use RNA-Seq and protein hints in GeneMark-ETP+
to create a training set for AUGUSTUS. The
protein hints are generated by ProtHint (see
--epmode).
--gff3 Output in GFF3 format (default is gtf
format)
--cores Specifies the maximum number of cores that
can be used during computation. Be aware:
optimize_augustus.pl will use max. 8
cores; augustus will use max. nContigs in
--genome=file cores.
--workingdir=/path/to/wd/ Set path to working directory. In the
working directory results and temporary
files are stored
--nice Execute all system calls within braker.pl
and its submodules with bash "nice"
(default nice value)
--alternatives-from-evidence=true Output alternative transcripts based on
explicit evidence from hints (default is
true).
--fungus GeneMark-EX option: run algorithm with
branch point model (most useful for fungal
genomes)
--crf Execute CRF training for AUGUSTUS;
resulting parameters are only kept for
final predictions if they show higher
accuracy than HMM parameters.
--keepCrf keep CRF parameters even if they are not
better than HMM parameters
--UTR=on create UTR training examples from RNA-Seq
coverage data; requires options
--bam=rnaseq.bam and --softmasking.
Alternatively, if UTR parameters already
exist, training step will be skipped and
those pre-existing parameters are used.
--addUTR=on Adds UTRs from RNA-Seq coverage data to
augustus.hints.gtf file. Does not perform
training of AUGUSTUS or gene prediction with
AUGUSTUS and UTR parameters.
--prg=gth|exonerate|spaln Specify an alternative method for generating
hints from similarity of protein sequence
data to genome data (alternative to the
default --epmode/--etpmode in which ProtHint
is used to generate the protein hints).
Available methods are: gth (GenomeThreader),
exonerate (Exonerate), or spaln (Spaln2).
Note that this option is suitable only for
proteins of closely related species (while
the --epmode is generally applicable).
This option is required in case --prot_aln
option is used.
--gth2traingenes Generate training gene structures for
AUGUSTUS from GenomeThreader alignments.
(These genes can either be used for
training AUGUSTUS alone with
--trainFromGth; or in addition to
GeneMark-ET training genes if also a
bam-file is supplied.)
--trainFromGth No GeneMark-Training, train AUGUSTUS from
GenomeThreader alignments
--makehub Create track data hub with make_hub.py
for visualizing BRAKER results with the
UCSC GenomeBrowser
--email E-mail address for creating track data hub
--version Print version number of braker.pl
--help Print this help message
CONFIGURATION OPTIONS (TOOLS CALLED BY BRAKER)
--AUGUSTUS_CONFIG_PATH=/path/ Set path to config directory of AUGUSTUS
(if not specified as environment
variable). BRAKER1 will assume that the
directories ../bin and ../scripts of
AUGUSTUS are located relative to the
AUGUSTUS_CONFIG_PATH. If this is not the
case, please specify AUGUSTUS_BIN_PATH
(and AUGUSTUS_SCRIPTS_PATH if required).
The braker.pl commandline argument
--AUGUSTUS_CONFIG_PATH has higher priority
than the environment variable with the
same name.
--AUGUSTUS_BIN_PATH=/path/ Set path to the AUGUSTUS directory that
contains binaries, i.e. augustus and
etraining. This variable must only be set
if AUGUSTUS_CONFIG_PATH does not have
../bin and ../scripts of AUGUSTUS relative
to its location i.e. for global AUGUSTUS
installations. BRAKER1 will assume that
the directory ../scripts of AUGUSTUS is
located relative to the AUGUSTUS_BIN_PATH.
If this is not the case, please specify
--AUGUSTUS_SCRIPTS_PATH.
--AUGUSTUS_SCRIPTS_PATH=/path/ Set path to AUGUSTUS directory that
contains scripts, i.e. splitMfasta.pl.
This variable must only be set if
AUGUSTUS_CONFIG_PATH or AUGUSTUS_BIN_PATH
do not contains the ../scripts directory
of AUGUSTUS relative to their location,
i.e. for special cases of a global
AUGUSTUS installation.
--BAMTOOLS_PATH=/path/to/ Set path to bamtools (if not specified as
environment BAMTOOLS_PATH variable). Has
higher priority than the environment
variable.
--GENEMARK_PATH=/path/to/ Set path to GeneMark-ET (if not specified
as environment GENEMARK_PATH variable).
Has higher priority than environment
variable.
--SAMTOOLS_PATH=/path/to/ Optionally set path to samtools (if not
specified as environment SAMTOOLS_PATH
variable) to fix BAM files automatically,
if necessary. Has higher priority than
environment variable.
--PROTHINT_PATH=/path/to/ Set path to the directory with prothint.py.
(if not specified as PROTHINT_PATH
environment variable). Has higher priority
than environment variable.
--ALIGNMENT_TOOL_PATH=/path/to/tool Set path to alignment tool
(GenomeThreader, Spaln, or Exonerate)
if not specified as environment
ALIGNMENT_TOOL_PATH variable. Has higher
priority than environment variable.
--DIAMOND_PATH=/path/to/diamond Set path to diamond, this is an alternative
to NCIB blast; you only need to specify one
out of DIAMOND_PATH or BLAST_PATH, not both.
DIAMOND is a lot faster that BLAST and yields
highly similar results for BRAKER.
--BLAST_PATH=/path/to/blastall Set path to NCBI blastall and formatdb
executables if not specified as
environment variable. Has higher priority
than environment variable.
--PYTHON3_PATH=/path/to Set path to python3 executable (if not
specified as envirnonment variable and if
executable is not in your $PATH).
--JAVA_PATH=/path/to Set path to java executable (if not
specified as environment variable and if
executable is not in your $PATH), only
required with flags --UTR=on and --addUTR=on
--GUSHR_PATH=/path/to Set path to gushr.py exectuable (if not
specified as an environment variable and if
executable is not in your $PATH), only required
with the flags --UTR=on and --addUTR=on
--MAKEHUB_PATH=/path/to Set path to make_hub.py (if option --makehub
is used).
--CDBTOOLS_PATH=/path/to cdbfasta/cdbyank are required for running
fix_in_frame_stop_codon_genes.py. Usage of
that script can be skipped with option
'--skip_fixing_broken_genes'.
EXPERT OPTIONS
--augustus_args="--some_arg=bla" One or several command line arguments to
be passed to AUGUSTUS, if several
arguments are given, separate them by
whitespace, i.e.
"--first_arg=sth --second_arg=sth".
--skipGeneMark-ES Skip GeneMark-ES and use provided
GeneMark-ES output (e.g. provided with
--geneMarkGtf=genemark.gtf)
--skipGeneMark-ET Skip GeneMark-ET and use provided
GeneMark-ET output (e.g. provided with
--geneMarkGtf=genemark.gtf)
--skipGeneMark-EP Skip GeneMark-EP and use provided
GeneMark-EP output (e.g. provided with
--geneMarkGtf=genemark.gtf)
--skipGeneMark-ETP Skip GeneMark-ETP and use provided
GeneMark-ETP output (e.g. provided with
--geneMarkGtf=genemark.gtf)
--geneMarkGtf=file.gtf If skipGeneMark-ET is used, braker will by
default look in the working directory in
folder GeneMarkET for an already existing
gtf file. Instead, you may provide such a
file from another location. If geneMarkGtf
option is set, skipGeneMark-ES/ET/EP/ETP is
automatically also set. Note that gene and
transcript ids in the final output may not
match the ids in the input genemark.gtf
because BRAKER internally re-assigns these
ids.
--rounds The number of optimization rounds used in
optimize_augustus.pl (default 5)
--skipAllTraining Skip GeneMark-EX (training and
prediction), skip AUGUSTUS training, only
runs AUGUSTUS with pre-trained and already
existing parameters (not recommended).
Hints from input are still generated.
This option automatically sets
--useexisting to true.
--useexisting Use the present config and parameter files
if they exist for 'species'; will overwrite
original parameters if BRAKER performs
an AUGUSTUS training.
--filterOutShort It may happen that a "good" training gene,
i.e. one that has intron support from
RNA-Seq in all introns predicted by
GeneMark-EX, is in fact too short. This flag
will discard such genes that have
supported introns and a neighboring
RNA-Seq supported intron upstream of the
start codon within the range of the
maximum CDS size of that gene and with a
multiplicity that is at least as high as
20% of the average intron multiplicity of
that gene.
--skipOptimize Skip optimize parameter step (not
recommended).
--skipIterativePrediction Skip iterative prediction in --epmode (does
not affect other modes, saves a bit of runtime)
--skipGetAnnoFromFasta Skip calling the python3 script
getAnnoFastaFromJoingenes.py from the
AUGUSTUS tool suite. This script requires
python3, biopython and re (regular
expressions) to be installed. It produces
coding sequence and protein FASTA files
from AUGUSTUS gene predictions and provides
information about genes with in-frame stop
codons. If you enable this flag, these files
will not be produced and python3 and
the required modules will not be necessary
for running braker.pl.
--skip_fixing_broken_genes If you do not have python3, you can choose
to skip the fixing of stop codon including
genes (not recommended).
--eval=reference.gtf Reference set to evaluate predictions
against (using evaluation scripts from GaTech)
--eval_pseudo=pseudo.gff3 File with pseudogenes that will be excluded
from accuracy evaluation (may be empty file)
--AUGUSTUS_hints_preds=s File with AUGUSTUS hints predictions; will
use this file as basis for UTR training;
only UTR training and prediction is
performed if this option is given.
--flanking_DNA=n Size of flanking region, must only be
specified if --AUGUSTUS_hints_preds is given
(for UTR training in a separate braker.pl
run that builds on top of an existing run)
--verbosity=n 0 -> run braker.pl quiet (no log)
1 -> only log warnings
2 -> also log configuration
3 -> log all major steps
4 -> very verbose, log also small steps
--downsampling_lambda=d The distribution of introns in training
gene structures generated by GeneMark-EX
has a huge weight on single-exon and
few-exon genes. Specifying the lambda
parameter of a poisson distribution will
make braker call a script for downsampling
of training gene structures according to
their number of introns distribution, i.e.
genes with none or few exons will be
downsampled, genes with many exons will be
kept. Default value is 2.
If you want to avoid downsampling, you have
to specify 0.
--checkSoftware Only check whether all required software
is installed, no execution of BRAKER
--nocleanup Skip deletion of all files that are typically not
used in an annotation project after
running braker.pl. (For tracking any
problems with a braker.pl run, you
might want to keep these files, therefore
nocleanup can be activated.)
DEVELOPMENT OPTIONS (PROBABLY STILL DYSFUNCTIONAL)
--splice_sites=patterns list of splice site patterns for UTR
prediction; default: GTAG, extend like this:
--splice_sites=GTAG,ATAC,...
this option only affects UTR training
example generation, not gene prediction
by AUGUSTUS
--overwrite Overwrite existing files (except for
species parameter files) Beware, currently
not implemented properly!
-- CfgFiles=file1,file2,... Depending on the mode in which braker.pl
is executed, it may require one ore several
extrinsicCfgFiles. Don't use this option
unless you know what you are doing!
--stranded=+,-,+,-,... If UTRs are trained, i.e.~strand-specific
bam-files are supplied and coverage
information is extracted for gene prediction,
create stranded ep hints. The order of
strand specifications must correspond to the
order of bam files. Possible values are
+, -, .
If stranded data is provided, ONLY coverage
data from the stranded data is used to
generate UTR examples! Coverage data from
unstranded data is used in the prediction
step, only.
The stranded label is applied to coverage
data, only. Intron hints are generated
from all libraries treated as "unstranded"
(because splice site filtering eliminates
intron hints from the wrong strand, anyway).
--optCfgFile=ppx.cfg Optional custom config file for AUGUSTUS
for running PPX (currently not
implemented)
--grass Switch this flag on if you are using braker.pl
for predicting genes in grasses with
GeneMark-EX. The flag will enable
GeneMark-EX to handle GC-heterogenicity
within genes more properly.
NOTHING IMPLEMENTED FOR GRASS YET!
--transmasked_fasta=file.fa Transmasked genome FASTA file for GeneMark-EX
(to be used instead of the regular genome
FASTA file).
--min_contig=INT Minimal contig length for GeneMark-EX, could
for example be set to 10000 if transmasked_fasta
option is used because transmasking might
introduce many very short contigs.
--translation_table=INT Change translation table from non-standard
to something else.
DOES NOT WORK YET BECAUSE BRAKER DOESNT
SWITCH TRANSLATION TABLE FOR GENEMARK-EX, YET!
--gc_probability=DECIMAL Probablity for donor splice site pattern GC
for gene prediction with GeneMark-EX,
default value is 0.001
--gm_max_intergenic=INT Adjust maximum allowed size of intergenic
regions in GeneMark-EX. If not used, the value
is automatically determined by GeneMark-EX.
EXAMPLE
To run with RNA-Seq
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
--bam=accepted_hits.bam
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
--hints=rnaseq.gff
To run with protein sequences
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
--prot_seq=proteins.fa
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
--hints=prothint_augustus.gff
|