Contents

Braker2 Genomethreader

Conda
See the ‘activating the conda environment’ section below to access this software.
Configuration required
See the relevant section below to configure this software before use.

braker2 with genomethreader

The CQLS has installed braker2 and genomethreader into a conda environment for you to use.

In order for it to work, you will have to:

  1. Copy the augustus config directory using the script below
  2. Activate the conda environment

You will be able to use the updated braker2, augustus, and genomethreader from within the conda environment.

Future updates will be for braker3 in a singularity image. More details to come.

The BRAKER documentation starts now.

What is BRAKER?

The rapidly growing number of sequenced genomes requires fully automated methods for accurate gene structure annotation. With this goal in mind, we have developed BRAKER1, a combination of GeneMark-ET and AUGUSTUS, that uses genomic and RNA-Seq data to automatically generate full gene structure annotations in novel genome.

However, the quality of RNA-Seq data that is available for annotating a novel genome is variable, and in some cases, RNA-Seq data is not available, at all.

BRAKER2 is an extension of BRAKER1 which allows for fully automated training of the gene prediction tools GeneMark-EX and AUGUSTUS from RNA-Seq and/or protein homology information, and that integrates the extrinsic evidence from RNA-Seq and protein homology information into the prediction.

In contrast to other available methods that rely on protein homology information, BRAKER2 reaches high gene prediction accuracy even in the absence of the annotation of very closely related species and in the absence of RNA-Seq data.

In this user guide, we will refer to BRAKER1 and BRAKER2 simply as BRAKER because they are executed by the same script (braker.pl).

Keys to successful gene prediction

  • Use a high quality genome assembly. If you have a huge number of very short scaffolds in your genome assembly, those short scaffolds will likely increase runtime dramatically but will not increase prediction accuracy.

  • Use simple scaffold names in the genome file (e.g. >contig1 will work better than >contig1my custom species namesome putative function /more/information/  and lots of special characters %&!*(){}). Make the scaffold names in all your fasta files simple before running any alignment program.

  • In order to predict genes accurately in a novel genome, the genome should be masked for repeats. This will avoid the prediction of false positive gene structures in repetitive and low complexitiy regions. Repeat masking is also essential for mapping RNA-Seq data to a genome with some tools (other RNA-Seq mappers, such as HISAT2, ignore masking information). In case of GeneMark-EX and AUGUSTUS, softmasking (i.e. putting repeat regions into lower case letters and all other regions into upper case letters) leads to better results than hardmasking (i.e. replacing letters in repetitive regions by the letter N for unknown nucleotide). If the genome is masked, use the –softmasking flag of braker.pl.

  • Many genomes have gene structures that will be predicted accurately with standard parameters of GeneMark-EX and AUGUSTUS within BRAKER. However, some genomes have clade-specific features, i.e. special branch point model in fungi, or non-standard splice-site patterns. Please read the options section [options] in order to determine whether any of the custom options may improve gene prediction accuracy in the genome of your target species.

  • Always check gene prediction results before further usage! You can e.g. use a genome browser for visual inspection of gene models in context with extrinsic evidence data. BRAKER supports the generation of track data hubs for the UCSC Genome Browser with MakeHub for this purpose.


Configuring the conda environment

In order to use augustus with this software, you need to run the /local/cluster/conda/setup_braker2-gth_config.sh script and provide a path for the augustus config to get copied to such that you can write to the directory.

Then, you can run the command printed to the screen to activate the environment, or check out a node with qrsh and run:

1
2
bash
source ~/activate_braker2-gth.sh

To use over SGE, include the above source command in your shell script prior to your braker2 commands.

Location and version

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ source ~/activate_braker2-gth.sh
$ which augustus
/local/cluster/braker2-gth/bin/augustus
$ augustus --version
AUGUSTUS (3.4.0) is a gene prediction tool.
Sources and documentation at https://github.com/Gaius-Augustus/Augustus
$ which braker.pl
/local/cluster/braker2-gth/BRAKER-2.1.6/scripts/braker.pl
$ braker.pl --version
braker.pl version 2.1.6

help message

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
$ braker.pl -help

DESCRIPTION

braker.pl   Pipeline for predicting genes with GeneMark-EX and AUGUSTUS with
            RNA-Seq and/or proteins

SYNOPSIS

braker.pl [OPTIONS] --genome=genome.fa {--bam=rnaseq.bam | --prot_seq=prot.fa}

INPUT FILE OPTIONS

--genome=genome.fa                  fasta file with DNA sequences
--bam=rnaseq.bam                    bam file with spliced alignments from
                                    RNA-Seq
--prot_seq=prot.fa                  A protein sequence file in multi-fasta
                                    format used to generate protein hints.
                                    Unless otherwise specified, braker.pl will
                                    run in "EP mode" which uses ProtHint to
                                    generate protein hints and GeneMark-EP+ to
                                    train AUGUSTUS.
--hints=hints.gff                   Alternatively to calling braker.pl with a
                                    bam or protein fasta file, it is possible to
                                    call it with a .gff file that contains
                                    introns extracted from RNA-Seq and/or
                                    protein hints (most frequently coming
                                    from ProtHint). If you wish to use the
                                    ProtHint hints, use its
                                    "prothint_augustus.gff" output file.
                                    This flag also allows the usage of hints
                                    from additional extrinsic sources for gene
                                    prediction with AUGUSTUS. To consider such
                                    additional extrinsic information, you need
                                    to use the flag --extrinsicCfgFiles to
                                    specify parameters for all sources in the
                                    hints file (including the source "E" for
                                    intron hints from RNA-Seq)
--prot_aln=prot.aln                 Alignment file generated from aligning
                                    protein sequences against the genome with
                                    either Exonerate (--prg=exonerate), or
                                    Spaln (--prg=spaln), or GenomeThreader
                                    (--prg=gth). This option can be used as
                                    an alternative to --prot_seq file or protein
                                    hints in the --hints file.
                                    To prepare alignment file, run Spaln2 with
                                    the following command:
                                    spaln -O0 ... > spalnfile
                                    To prepare alignment file, run Exonerate
                                    with the following command:
                                    exonerate --model protein2genome \
                                        --showtargetgff T ... > exfile
                                    To prepare alignment file, run
                                    GenomeThreader with the following command:
                                    gth -genomic genome.fa  -protein \
                                        protein.fa -gff3out \
                                        -skipalignmentout ... -o gthfile
                                    A valid option prg=... must be specified
                                    in combination with --prot_aln. Generating
                                    tool will not be guessed.
                                    Currently, hints from protein alignment
                                    files are only used in the prediction step
                                    with AUGUSTUS.

FREQUENTLY USED OPTIONS

--species=sname                     Species name. Existing species will not be
                                    overwritten. Uses Sp_1 etc., if no species
                                    is assigned
--AUGUSTUS_ab_initio                output ab initio predictions by AUGUSTUS
                                    in addition to predictions with hints by
                                    AUGUSTUS
--softmasking                       Softmasking option for soft masked genome
                                    files. (Disabled by default.)
--esmode                            Run GeneMark-ES (genome sequence only) and
                                    train AUGUSTUS on long genes predicted by
                                    GeneMark-ES. Final predictions are ab initio
--epmode                            Run ProtHint to generate protein hints (if
                                    not already specified with --hints option)
                                    and use the hints in GeneMark-EP+ to create
                                    a training set for AUGUSTUS.
--etpmode                           Use RNA-Seq and protein hints in GeneMark-ETP+
                                    to create a training set for AUGUSTUS. The
                                    protein hints are generated by ProtHint (see
                                    --epmode).
--gff3                              Output in GFF3 format (default is gtf
                                    format)
--cores                             Specifies the maximum number of cores that
                                    can be used during computation. Be aware:
                                    optimize_augustus.pl will use max. 8
                                    cores; augustus will use max. nContigs in
                                    --genome=file cores.
--workingdir=/path/to/wd/           Set path to working directory. In the
                                    working directory results and temporary
                                    files are stored
--nice                              Execute all system calls within braker.pl
                                    and its submodules with bash "nice"
                                    (default nice value)

--alternatives-from-evidence=true   Output alternative transcripts based on
                                    explicit evidence from hints (default is
                                    true).
--fungus                            GeneMark-EX option: run algorithm with
                                    branch point model (most useful for fungal
                                    genomes)
--crf                               Execute CRF training for AUGUSTUS;
                                    resulting parameters are only kept for
                                    final predictions if they show higher
                                    accuracy than HMM parameters.
--keepCrf                           keep CRF parameters even if they are not
                                    better than HMM parameters
--UTR=on                            create UTR training examples from RNA-Seq
                                    coverage data; requires options
                                    --bam=rnaseq.bam and --softmasking.
                                    Alternatively, if UTR parameters already
                                    exist, training step will be skipped and
                                    those pre-existing parameters are used.
--addUTR=on                         Adds UTRs from RNA-Seq coverage data to
                                    augustus.hints.gtf file. Does not perform
                                    training of AUGUSTUS or gene prediction with
                                    AUGUSTUS and UTR parameters.
--prg=gth|exonerate|spaln           Specify an alternative method for generating
                                    hints from similarity of protein sequence
                                    data to genome data (alternative to the
                                    default --epmode/--etpmode in which ProtHint
                                    is used to generate the protein hints).
                                    Available methods  are: gth (GenomeThreader),
                                    exonerate (Exonerate), or spaln (Spaln2).
                                    Note that this option is suitable only for
                                    proteins of closely related species (while
                                    the --epmode is generally applicable).
                                    This option is required in case --prot_aln
                                    option is used.
--gth2traingenes                    Generate training gene structures for
                                    AUGUSTUS from GenomeThreader alignments.
                                    (These genes can either be used for
                                    training AUGUSTUS alone with
                                    --trainFromGth; or in addition to
                                    GeneMark-ET training genes if also a
                                    bam-file is supplied.)
--trainFromGth                      No GeneMark-Training, train AUGUSTUS from
                                    GenomeThreader alignments
--makehub                           Create track data hub with make_hub.py
                                    for visualizing BRAKER results with the
                                    UCSC GenomeBrowser
--email                             E-mail address for creating track data hub
--version                           Print version number of braker.pl
--help                              Print this help message

CONFIGURATION OPTIONS (TOOLS CALLED BY BRAKER)

--AUGUSTUS_CONFIG_PATH=/path/       Set path to config directory of AUGUSTUS
                                    (if not specified as environment
                                    variable). BRAKER1 will assume that the
                                    directories ../bin and ../scripts of
                                    AUGUSTUS are located relative to the
                                    AUGUSTUS_CONFIG_PATH. If this is not the
                                    case, please specify AUGUSTUS_BIN_PATH
                                    (and AUGUSTUS_SCRIPTS_PATH if required).
                                    The braker.pl commandline argument
                                    --AUGUSTUS_CONFIG_PATH has higher priority
                                    than the environment variable with the
                                    same name.
--AUGUSTUS_BIN_PATH=/path/          Set path to the AUGUSTUS directory that
                                    contains binaries, i.e. augustus and
                                    etraining. This variable must only be set
                                    if AUGUSTUS_CONFIG_PATH does not have
                                    ../bin and ../scripts of AUGUSTUS relative
                                     to its location i.e. for global AUGUSTUS
                                    installations. BRAKER1 will assume that
                                    the directory ../scripts of AUGUSTUS is
                                    located relative to the AUGUSTUS_BIN_PATH.
                                    If this is not the case, please specify
                                    --AUGUSTUS_SCRIPTS_PATH.
--AUGUSTUS_SCRIPTS_PATH=/path/      Set path to AUGUSTUS directory that
                                    contains scripts, i.e. splitMfasta.pl.
                                    This variable must only be set if
                                    AUGUSTUS_CONFIG_PATH or AUGUSTUS_BIN_PATH
                                    do not contains the ../scripts directory
                                    of AUGUSTUS relative to their location,
                                    i.e. for special cases of a global
                                    AUGUSTUS installation.
--BAMTOOLS_PATH=/path/to/           Set path to bamtools (if not specified as
                                    environment BAMTOOLS_PATH variable). Has
                                    higher priority than the environment
                                    variable.
--GENEMARK_PATH=/path/to/           Set path to GeneMark-ET (if not specified
                                    as environment GENEMARK_PATH variable).
                                    Has higher priority than environment
                                    variable.
--SAMTOOLS_PATH=/path/to/           Optionally set path to samtools (if not
                                    specified as environment SAMTOOLS_PATH
                                    variable) to fix BAM files automatically,
                                    if necessary. Has higher priority than
                                    environment variable.
--PROTHINT_PATH=/path/to/           Set path to the directory with prothint.py.
                                    (if not specified as PROTHINT_PATH
                                    environment variable). Has higher priority
                                    than environment variable.
--ALIGNMENT_TOOL_PATH=/path/to/tool Set path to alignment tool
                                    (GenomeThreader, Spaln, or Exonerate)
                                    if not specified as environment
                                    ALIGNMENT_TOOL_PATH variable. Has higher
                                    priority than environment variable.
--DIAMOND_PATH=/path/to/diamond     Set path to diamond, this is an alternative
                                    to NCIB blast; you only need to specify one
                                    out of DIAMOND_PATH or BLAST_PATH, not both.
                                    DIAMOND is a lot faster that BLAST and yields
                                    highly similar results for BRAKER.
--BLAST_PATH=/path/to/blastall      Set path to NCBI blastall and formatdb
                                    executables if not specified as
                                    environment variable. Has higher priority
                                    than environment variable.
--PYTHON3_PATH=/path/to             Set path to python3 executable (if not
                                    specified as envirnonment variable and if
                                    executable is not in your $PATH).
--JAVA_PATH=/path/to                Set path to java executable (if not
                                    specified as environment variable and if
                                    executable is not in your $PATH), only
                                    required with flags --UTR=on and --addUTR=on
--GUSHR_PATH=/path/to               Set path to gushr.py exectuable (if not
                                    specified as an environment variable and if
                                    executable is not in your $PATH), only required
                                    with the flags --UTR=on and --addUTR=on
--MAKEHUB_PATH=/path/to             Set path to make_hub.py (if option --makehub
                                    is used).
--CDBTOOLS_PATH=/path/to            cdbfasta/cdbyank are required for running
                                    fix_in_frame_stop_codon_genes.py. Usage of
                                    that script can be skipped with option
                                    '--skip_fixing_broken_genes'.


EXPERT OPTIONS

--augustus_args="--some_arg=bla"    One or several command line arguments to
                                    be passed to AUGUSTUS, if several
                                    arguments are given, separate them by
                                    whitespace, i.e.
                                    "--first_arg=sth --second_arg=sth".
--skipGeneMark-ES                   Skip GeneMark-ES and use provided
                                    GeneMark-ES output (e.g. provided with
                                    --geneMarkGtf=genemark.gtf)
--skipGeneMark-ET                   Skip GeneMark-ET and use provided
                                    GeneMark-ET output (e.g. provided with
                                    --geneMarkGtf=genemark.gtf)
--skipGeneMark-EP                   Skip GeneMark-EP and use provided
                                    GeneMark-EP output (e.g. provided with
                                    --geneMarkGtf=genemark.gtf)
--skipGeneMark-ETP                  Skip GeneMark-ETP and use provided
                                    GeneMark-ETP output (e.g. provided with
                                    --geneMarkGtf=genemark.gtf)
--geneMarkGtf=file.gtf              If skipGeneMark-ET is used, braker will by
                                    default look in the working directory in
                                    folder GeneMarkET for an already existing
                                    gtf file. Instead, you may provide such a
                                    file from another location. If geneMarkGtf
                                    option is set, skipGeneMark-ES/ET/EP/ETP is
                                    automatically also set. Note that gene and
                                    transcript ids in the final output may not
                                    match the ids in the input genemark.gtf
                                    because BRAKER internally re-assigns these
                                    ids.
--rounds                            The number of optimization rounds used in
                                    optimize_augustus.pl (default 5)
--skipAllTraining                   Skip GeneMark-EX (training and
                                    prediction), skip AUGUSTUS training, only
                                    runs AUGUSTUS with pre-trained and already
                                    existing parameters (not recommended).
                                    Hints from input are still generated.
                                    This option automatically sets
                                    --useexisting to true.
--useexisting                       Use the present config and parameter files
                                    if they exist for 'species'; will overwrite
                                    original parameters if BRAKER performs
                                    an AUGUSTUS training.
--filterOutShort                    It may happen that a "good" training gene,
                                    i.e. one that has intron support from
                                    RNA-Seq in all introns predicted by
                                    GeneMark-EX, is in fact too short. This flag
                                    will discard such genes that have
                                    supported introns and a neighboring
                                    RNA-Seq supported intron upstream of the
                                    start codon within the range of the
                                    maximum CDS size of that gene and with a
                                    multiplicity that is at least as high as
                                    20% of the average intron multiplicity of
                                    that gene.
--skipOptimize                      Skip optimize parameter step (not
                                    recommended).
--skipIterativePrediction           Skip iterative prediction in --epmode (does
                                    not affect other modes, saves a bit of runtime)
--skipGetAnnoFromFasta              Skip calling the python3 script
                                    getAnnoFastaFromJoingenes.py from the
                                    AUGUSTUS tool suite. This script requires
                                    python3, biopython and re (regular
                                    expressions) to be installed. It produces
                                    coding sequence and protein FASTA files
                                    from AUGUSTUS gene predictions and provides
                                    information about genes with in-frame stop
                                    codons. If you enable this flag, these files
                                    will not be produced and python3 and
                                    the required modules will not be necessary
                                    for running braker.pl.
--skip_fixing_broken_genes          If you do not have python3, you can choose
                                    to skip the fixing of stop codon including
                                    genes (not recommended).
--eval=reference.gtf                Reference set to evaluate predictions
                                    against (using evaluation scripts from GaTech)
--eval_pseudo=pseudo.gff3           File with pseudogenes that will be excluded
                                    from accuracy evaluation (may be empty file)
--AUGUSTUS_hints_preds=s            File with AUGUSTUS hints predictions; will
                                    use this file as basis for UTR training;
                                    only UTR training and prediction is
                                    performed if this option is given.
--flanking_DNA=n                    Size of flanking region, must only be
                                    specified if --AUGUSTUS_hints_preds is given
                                    (for UTR training in a separate braker.pl
                                    run that builds on top of an existing run)
--verbosity=n                       0 -> run braker.pl quiet (no log)
                                    1 -> only log warnings
                                    2 -> also log configuration
                                    3 -> log all major steps
                                    4 -> very verbose, log also small steps
--downsampling_lambda=d             The distribution of introns in training
                                    gene structures generated by GeneMark-EX
                                    has a huge weight on single-exon and
                                    few-exon genes. Specifying the lambda
                                    parameter of a poisson distribution will
                                    make braker call a script for downsampling
                                    of training gene structures according to
                                    their number of introns distribution, i.e.
                                    genes with none or few exons will be
                                    downsampled, genes with many exons will be
                                    kept. Default value is 2.
                                    If you want to avoid downsampling, you have
                                    to specify 0.
--checkSoftware                     Only check whether all required software
                                    is installed, no execution of BRAKER
--nocleanup                         Skip deletion of all files that are typically not
                                    used in an annotation project after
                                    running braker.pl. (For tracking any
                                    problems with a braker.pl run, you
                                    might want to keep these files, therefore
                                    nocleanup can be activated.)


DEVELOPMENT OPTIONS (PROBABLY STILL DYSFUNCTIONAL)

--splice_sites=patterns             list of splice site patterns for UTR
                                    prediction; default: GTAG, extend like this:
                                    --splice_sites=GTAG,ATAC,...
                                    this option only affects UTR training
                                    example generation, not gene prediction
                                    by AUGUSTUS
--overwrite                         Overwrite existing files (except for
                                    species parameter files) Beware, currently
                                    not implemented properly!
-- CfgFiles=file1,file2,... Depending on the mode in which braker.pl
                                    is executed, it may require one ore several
                                    extrinsicCfgFiles. Don't use this option
                                    unless you know what you are doing!
--stranded=+,-,+,-,...              If UTRs are trained, i.e.~strand-specific
                                    bam-files are supplied and coverage
                                    information is extracted for gene prediction,
                                    create stranded ep hints. The order of
                                    strand specifications must correspond to the
                                    order of bam files. Possible values are
                                    +, -, .
                                    If stranded data is provided, ONLY coverage
                                    data from the stranded data is used to
                                    generate UTR examples! Coverage data from
                                    unstranded data is used in the prediction
                                    step, only.
                                    The stranded label is applied to coverage
                                    data, only. Intron hints are generated
                                    from all libraries treated as "unstranded"
                                    (because splice site filtering eliminates
                                    intron hints from the wrong strand, anyway).
--optCfgFile=ppx.cfg                Optional custom config file for AUGUSTUS
                                    for running PPX (currently not
                                    implemented)
--grass                             Switch this flag on if you are using braker.pl
                                    for predicting genes in grasses with
                                    GeneMark-EX. The flag will enable
                                    GeneMark-EX to handle GC-heterogenicity
                                    within genes more properly.
                                    NOTHING IMPLEMENTED FOR GRASS YET!
--transmasked_fasta=file.fa         Transmasked genome FASTA file for GeneMark-EX
                                    (to be used instead of the regular genome
                                    FASTA file).
--min_contig=INT                    Minimal contig length for GeneMark-EX, could
                                    for example be set to 10000 if transmasked_fasta
                                    option is used because transmasking might
                                    introduce many very short contigs.
--translation_table=INT             Change translation table from non-standard
                                    to something else.
                                    DOES NOT WORK YET BECAUSE BRAKER DOESNT
                                    SWITCH TRANSLATION TABLE FOR GENEMARK-EX, YET!
--gc_probability=DECIMAL            Probablity for donor splice site pattern GC
                                    for gene prediction with GeneMark-EX,
                                    default value is 0.001
--gm_max_intergenic=INT             Adjust maximum allowed size of intergenic
                                    regions in GeneMark-EX. If not used, the value
                                    is automatically determined by GeneMark-EX.


EXAMPLE

To run with RNA-Seq

braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --bam=accepted_hits.bam
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --hints=rnaseq.gff

To run with protein sequences

braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --prot_seq=proteins.fa
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --hints=prothint_augustus.gff

software ref: https://github.com/Gaius-Augustus/BRAKER
research ref: https://github.com/Gaius-Augustus/BRAKER#citing-braker-and-software-called-by-braker