# Braker2 Genomethreader {{< admonition tip "Conda" true >}} See the 'activating the conda environment' section below to access this software. {{< /admonition >}} {{< admonition warning "Configuration required" true >}} See the relevant section below to configure this software before use. {{< /admonition >}} ## braker2 with genomethreader The CQLS has installed braker2 and genomethreader into a conda environment for you to use. In order for it to work, you will have to: 1. Copy the augustus config directory using the script below 2. Activate the conda environment You will be able to use the updated braker2, augustus, and genomethreader from within the conda environment. Future updates will be for braker3 in a singularity image. More details to come. The BRAKER documentation starts now. ### What is BRAKER? The rapidly growing number of sequenced genomes requires fully automated methods for accurate gene structure annotation. With this goal in mind, we have developed BRAKER1, a combination of GeneMark-ET and AUGUSTUS, that uses genomic and RNA-Seq data to automatically generate full gene structure annotations in novel genome. However, the quality of RNA-Seq data that is available for annotating a novel genome is variable, and in some cases, RNA-Seq data is not available, at all. BRAKER2 is an extension of BRAKER1 which allows for fully automated training of the gene prediction tools GeneMark-EX and AUGUSTUS from RNA-Seq and/or protein homology information, and that integrates the extrinsic evidence from RNA-Seq and protein homology information into the prediction. In contrast to other available methods that rely on protein homology information, BRAKER2 reaches high gene prediction accuracy even in the absence of the annotation of very closely related species and in the absence of RNA-Seq data. In this user guide, we will refer to BRAKER1 and BRAKER2 simply as BRAKER because they are executed by the same script (braker.pl). ### Keys to successful gene prediction * Use a high quality genome assembly. If you have a huge number of very short scaffolds in your genome assembly, those short scaffolds will likely increase runtime dramatically but will not increase prediction accuracy. * Use simple scaffold names in the genome file (e.g. `>contig1` will work better than `>contig1my custom species namesome putative function /more/information/  and lots of special characters %&!*(){}).` Make the scaffold names in all your fasta files simple before running any alignment program. * In order to predict genes accurately in a novel genome, the genome should be masked for repeats. This will avoid the prediction of false positive gene structures in repetitive and low complexitiy regions. Repeat masking is also essential for mapping RNA-Seq data to a genome with some tools (other RNA-Seq mappers, such as HISAT2, ignore masking information). In case of GeneMark-EX and AUGUSTUS, softmasking (i.e. putting repeat regions into lower case letters and all other regions into upper case letters) leads to better results than hardmasking (i.e. replacing letters in repetitive regions by the letter N for unknown nucleotide). If the genome is masked, use the --softmasking flag of braker.pl. * Many genomes have gene structures that will be predicted accurately with standard parameters of GeneMark-EX and AUGUSTUS within BRAKER. However, some genomes have clade-specific features, i.e. special branch point model in fungi, or non-standard splice-site patterns. Please read the options section [options] in order to determine whether any of the custom options may improve gene prediction accuracy in the genome of your target species. * Always check gene prediction results before further usage! You can e.g. use a genome browser for visual inspection of gene models in context with extrinsic evidence data. BRAKER supports the generation of track data hubs for the UCSC Genome Browser with MakeHub for this purpose. ------------------------------------------------------------------------------- ## Configuring the conda environment In order to use augustus with this software, you need to run the `/local/cluster/conda/setup_braker2-gth_config.sh` script and provide a path for the augustus config to get copied to such that you can write to the directory. Then, you can run the command printed to the screen to activate the environment, or check out a node with `qrsh` and run: ```console bash source ~/activate_braker2-gth.sh ``` To use over SGE, include the above source command in your shell script prior to your braker2 commands. ## Location and version ```console $ source ~/activate_braker2-gth.sh $ which augustus /local/cluster/braker2-gth/bin/augustus $ augustus --version AUGUSTUS (3.4.0) is a gene prediction tool. Sources and documentation at https://github.com/Gaius-Augustus/Augustus $ which braker.pl /local/cluster/braker2-gth/BRAKER-2.1.6/scripts/braker.pl $ braker.pl --version braker.pl version 2.1.6 ``` ## help message ```console $ braker.pl -help DESCRIPTION braker.pl Pipeline for predicting genes with GeneMark-EX and AUGUSTUS with RNA-Seq and/or proteins SYNOPSIS braker.pl [OPTIONS] --genome=genome.fa {--bam=rnaseq.bam | --prot_seq=prot.fa} INPUT FILE OPTIONS --genome=genome.fa fasta file with DNA sequences --bam=rnaseq.bam bam file with spliced alignments from RNA-Seq --prot_seq=prot.fa A protein sequence file in multi-fasta format used to generate protein hints. Unless otherwise specified, braker.pl will run in "EP mode" which uses ProtHint to generate protein hints and GeneMark-EP+ to train AUGUSTUS. --hints=hints.gff Alternatively to calling braker.pl with a bam or protein fasta file, it is possible to call it with a .gff file that contains introns extracted from RNA-Seq and/or protein hints (most frequently coming from ProtHint). If you wish to use the ProtHint hints, use its "prothint_augustus.gff" output file. This flag also allows the usage of hints from additional extrinsic sources for gene prediction with AUGUSTUS. To consider such additional extrinsic information, you need to use the flag --extrinsicCfgFiles to specify parameters for all sources in the hints file (including the source "E" for intron hints from RNA-Seq) --prot_aln=prot.aln Alignment file generated from aligning protein sequences against the genome with either Exonerate (--prg=exonerate), or Spaln (--prg=spaln), or GenomeThreader (--prg=gth). This option can be used as an alternative to --prot_seq file or protein hints in the --hints file. To prepare alignment file, run Spaln2 with the following command: spaln -O0 ... > spalnfile To prepare alignment file, run Exonerate with the following command: exonerate --model protein2genome \ --showtargetgff T ... > exfile To prepare alignment file, run GenomeThreader with the following command: gth -genomic genome.fa -protein \ protein.fa -gff3out \ -skipalignmentout ... -o gthfile A valid option prg=... must be specified in combination with --prot_aln. Generating tool will not be guessed. Currently, hints from protein alignment files are only used in the prediction step with AUGUSTUS. FREQUENTLY USED OPTIONS --species=sname Species name. Existing species will not be overwritten. Uses Sp_1 etc., if no species is assigned --AUGUSTUS_ab_initio output ab initio predictions by AUGUSTUS in addition to predictions with hints by AUGUSTUS --softmasking Softmasking option for soft masked genome files. (Disabled by default.) --esmode Run GeneMark-ES (genome sequence only) and train AUGUSTUS on long genes predicted by GeneMark-ES. Final predictions are ab initio --epmode Run ProtHint to generate protein hints (if not already specified with --hints option) and use the hints in GeneMark-EP+ to create a training set for AUGUSTUS. --etpmode Use RNA-Seq and protein hints in GeneMark-ETP+ to create a training set for AUGUSTUS. The protein hints are generated by ProtHint (see --epmode). --gff3 Output in GFF3 format (default is gtf format) --cores Specifies the maximum number of cores that can be used during computation. Be aware: optimize_augustus.pl will use max. 8 cores; augustus will use max. nContigs in --genome=file cores. --workingdir=/path/to/wd/ Set path to working directory. In the working directory results and temporary files are stored --nice Execute all system calls within braker.pl and its submodules with bash "nice" (default nice value) --alternatives-from-evidence=true Output alternative transcripts based on explicit evidence from hints (default is true). --fungus GeneMark-EX option: run algorithm with branch point model (most useful for fungal genomes) --crf Execute CRF training for AUGUSTUS; resulting parameters are only kept for final predictions if they show higher accuracy than HMM parameters. --keepCrf keep CRF parameters even if they are not better than HMM parameters --UTR=on create UTR training examples from RNA-Seq coverage data; requires options --bam=rnaseq.bam and --softmasking. Alternatively, if UTR parameters already exist, training step will be skipped and those pre-existing parameters are used. --addUTR=on Adds UTRs from RNA-Seq coverage data to augustus.hints.gtf file. Does not perform training of AUGUSTUS or gene prediction with AUGUSTUS and UTR parameters. --prg=gth|exonerate|spaln Specify an alternative method for generating hints from similarity of protein sequence data to genome data (alternative to the default --epmode/--etpmode in which ProtHint is used to generate the protein hints). Available methods are: gth (GenomeThreader), exonerate (Exonerate), or spaln (Spaln2). Note that this option is suitable only for proteins of closely related species (while the --epmode is generally applicable). This option is required in case --prot_aln option is used. --gth2traingenes Generate training gene structures for AUGUSTUS from GenomeThreader alignments. (These genes can either be used for training AUGUSTUS alone with --trainFromGth; or in addition to GeneMark-ET training genes if also a bam-file is supplied.) --trainFromGth No GeneMark-Training, train AUGUSTUS from GenomeThreader alignments --makehub Create track data hub with make_hub.py for visualizing BRAKER results with the UCSC GenomeBrowser --email E-mail address for creating track data hub --version Print version number of braker.pl --help Print this help message CONFIGURATION OPTIONS (TOOLS CALLED BY BRAKER) --AUGUSTUS_CONFIG_PATH=/path/ Set path to config directory of AUGUSTUS (if not specified as environment variable). BRAKER1 will assume that the directories ../bin and ../scripts of AUGUSTUS are located relative to the AUGUSTUS_CONFIG_PATH. If this is not the case, please specify AUGUSTUS_BIN_PATH (and AUGUSTUS_SCRIPTS_PATH if required). The braker.pl commandline argument --AUGUSTUS_CONFIG_PATH has higher priority than the environment variable with the same name. --AUGUSTUS_BIN_PATH=/path/ Set path to the AUGUSTUS directory that contains binaries, i.e. augustus and etraining. This variable must only be set if AUGUSTUS_CONFIG_PATH does not have ../bin and ../scripts of AUGUSTUS relative to its location i.e. for global AUGUSTUS installations. BRAKER1 will assume that the directory ../scripts of AUGUSTUS is located relative to the AUGUSTUS_BIN_PATH. If this is not the case, please specify --AUGUSTUS_SCRIPTS_PATH. --AUGUSTUS_SCRIPTS_PATH=/path/ Set path to AUGUSTUS directory that contains scripts, i.e. splitMfasta.pl. This variable must only be set if AUGUSTUS_CONFIG_PATH or AUGUSTUS_BIN_PATH do not contains the ../scripts directory of AUGUSTUS relative to their location, i.e. for special cases of a global AUGUSTUS installation. --BAMTOOLS_PATH=/path/to/ Set path to bamtools (if not specified as environment BAMTOOLS_PATH variable). Has higher priority than the environment variable. --GENEMARK_PATH=/path/to/ Set path to GeneMark-ET (if not specified as environment GENEMARK_PATH variable). Has higher priority than environment variable. --SAMTOOLS_PATH=/path/to/ Optionally set path to samtools (if not specified as environment SAMTOOLS_PATH variable) to fix BAM files automatically, if necessary. Has higher priority than environment variable. --PROTHINT_PATH=/path/to/ Set path to the directory with prothint.py. (if not specified as PROTHINT_PATH environment variable). Has higher priority than environment variable. --ALIGNMENT_TOOL_PATH=/path/to/tool Set path to alignment tool (GenomeThreader, Spaln, or Exonerate) if not specified as environment ALIGNMENT_TOOL_PATH variable. Has higher priority than environment variable. --DIAMOND_PATH=/path/to/diamond Set path to diamond, this is an alternative to NCIB blast; you only need to specify one out of DIAMOND_PATH or BLAST_PATH, not both. DIAMOND is a lot faster that BLAST and yields highly similar results for BRAKER. --BLAST_PATH=/path/to/blastall Set path to NCBI blastall and formatdb executables if not specified as environment variable. Has higher priority than environment variable. --PYTHON3_PATH=/path/to Set path to python3 executable (if not specified as envirnonment variable and if executable is not in your $PATH). --JAVA_PATH=/path/to Set path to java executable (if not specified as environment variable and if executable is not in your $PATH), only required with flags --UTR=on and --addUTR=on --GUSHR_PATH=/path/to Set path to gushr.py exectuable (if not specified as an environment variable and if executable is not in your $PATH), only required with the flags --UTR=on and --addUTR=on --MAKEHUB_PATH=/path/to Set path to make_hub.py (if option --makehub is used). --CDBTOOLS_PATH=/path/to cdbfasta/cdbyank are required for running fix_in_frame_stop_codon_genes.py. Usage of that script can be skipped with option '--skip_fixing_broken_genes'. EXPERT OPTIONS --augustus_args="--some_arg=bla" One or several command line arguments to be passed to AUGUSTUS, if several arguments are given, separate them by whitespace, i.e. "--first_arg=sth --second_arg=sth". --skipGeneMark-ES Skip GeneMark-ES and use provided GeneMark-ES output (e.g. provided with --geneMarkGtf=genemark.gtf) --skipGeneMark-ET Skip GeneMark-ET and use provided GeneMark-ET output (e.g. provided with --geneMarkGtf=genemark.gtf) --skipGeneMark-EP Skip GeneMark-EP and use provided GeneMark-EP output (e.g. provided with --geneMarkGtf=genemark.gtf) --skipGeneMark-ETP Skip GeneMark-ETP and use provided GeneMark-ETP output (e.g. provided with --geneMarkGtf=genemark.gtf) --geneMarkGtf=file.gtf If skipGeneMark-ET is used, braker will by default look in the working directory in folder GeneMarkET for an already existing gtf file. Instead, you may provide such a file from another location. If geneMarkGtf option is set, skipGeneMark-ES/ET/EP/ETP is automatically also set. Note that gene and transcript ids in the final output may not match the ids in the input genemark.gtf because BRAKER internally re-assigns these ids. --rounds The number of optimization rounds used in optimize_augustus.pl (default 5) --skipAllTraining Skip GeneMark-EX (training and prediction), skip AUGUSTUS training, only runs AUGUSTUS with pre-trained and already existing parameters (not recommended). Hints from input are still generated. This option automatically sets --useexisting to true. --useexisting Use the present config and parameter files if they exist for 'species'; will overwrite original parameters if BRAKER performs an AUGUSTUS training. --filterOutShort It may happen that a "good" training gene, i.e. one that has intron support from RNA-Seq in all introns predicted by GeneMark-EX, is in fact too short. This flag will discard such genes that have supported introns and a neighboring RNA-Seq supported intron upstream of the start codon within the range of the maximum CDS size of that gene and with a multiplicity that is at least as high as 20% of the average intron multiplicity of that gene. --skipOptimize Skip optimize parameter step (not recommended). --skipIterativePrediction Skip iterative prediction in --epmode (does not affect other modes, saves a bit of runtime) --skipGetAnnoFromFasta Skip calling the python3 script getAnnoFastaFromJoingenes.py from the AUGUSTUS tool suite. This script requires python3, biopython and re (regular expressions) to be installed. It produces coding sequence and protein FASTA files from AUGUSTUS gene predictions and provides information about genes with in-frame stop codons. If you enable this flag, these files will not be produced and python3 and the required modules will not be necessary for running braker.pl. --skip_fixing_broken_genes If you do not have python3, you can choose to skip the fixing of stop codon including genes (not recommended). --eval=reference.gtf Reference set to evaluate predictions against (using evaluation scripts from GaTech) --eval_pseudo=pseudo.gff3 File with pseudogenes that will be excluded from accuracy evaluation (may be empty file) --AUGUSTUS_hints_preds=s File with AUGUSTUS hints predictions; will use this file as basis for UTR training; only UTR training and prediction is performed if this option is given. --flanking_DNA=n Size of flanking region, must only be specified if --AUGUSTUS_hints_preds is given (for UTR training in a separate braker.pl run that builds on top of an existing run) --verbosity=n 0 -> run braker.pl quiet (no log) 1 -> only log warnings 2 -> also log configuration 3 -> log all major steps 4 -> very verbose, log also small steps --downsampling_lambda=d The distribution of introns in training gene structures generated by GeneMark-EX has a huge weight on single-exon and few-exon genes. Specifying the lambda parameter of a poisson distribution will make braker call a script for downsampling of training gene structures according to their number of introns distribution, i.e. genes with none or few exons will be downsampled, genes with many exons will be kept. Default value is 2. If you want to avoid downsampling, you have to specify 0. --checkSoftware Only check whether all required software is installed, no execution of BRAKER --nocleanup Skip deletion of all files that are typically not used in an annotation project after running braker.pl. (For tracking any problems with a braker.pl run, you might want to keep these files, therefore nocleanup can be activated.) DEVELOPMENT OPTIONS (PROBABLY STILL DYSFUNCTIONAL) --splice_sites=patterns list of splice site patterns for UTR prediction; default: GTAG, extend like this: --splice_sites=GTAG,ATAC,... this option only affects UTR training example generation, not gene prediction by AUGUSTUS --overwrite Overwrite existing files (except for species parameter files) Beware, currently not implemented properly! -- CfgFiles=file1,file2,... Depending on the mode in which braker.pl is executed, it may require one ore several extrinsicCfgFiles. Don't use this option unless you know what you are doing! --stranded=+,-,+,-,... If UTRs are trained, i.e.~strand-specific bam-files are supplied and coverage information is extracted for gene prediction, create stranded ep hints. The order of strand specifications must correspond to the order of bam files. Possible values are +, -, . If stranded data is provided, ONLY coverage data from the stranded data is used to generate UTR examples! Coverage data from unstranded data is used in the prediction step, only. The stranded label is applied to coverage data, only. Intron hints are generated from all libraries treated as "unstranded" (because splice site filtering eliminates intron hints from the wrong strand, anyway). --optCfgFile=ppx.cfg Optional custom config file for AUGUSTUS for running PPX (currently not implemented) --grass Switch this flag on if you are using braker.pl for predicting genes in grasses with GeneMark-EX. The flag will enable GeneMark-EX to handle GC-heterogenicity within genes more properly. NOTHING IMPLEMENTED FOR GRASS YET! --transmasked_fasta=file.fa Transmasked genome FASTA file for GeneMark-EX (to be used instead of the regular genome FASTA file). --min_contig=INT Minimal contig length for GeneMark-EX, could for example be set to 10000 if transmasked_fasta option is used because transmasking might introduce many very short contigs. --translation_table=INT Change translation table from non-standard to something else. DOES NOT WORK YET BECAUSE BRAKER DOESNT SWITCH TRANSLATION TABLE FOR GENEMARK-EX, YET! --gc_probability=DECIMAL Probablity for donor splice site pattern GC for gene prediction with GeneMark-EX, default value is 0.001 --gm_max_intergenic=INT Adjust maximum allowed size of intergenic regions in GeneMark-EX. If not used, the value is automatically determined by GeneMark-EX. EXAMPLE To run with RNA-Seq braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \ --bam=accepted_hits.bam braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \ --hints=rnaseq.gff To run with protein sequences braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \ --prot_seq=proteins.fa braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \ --hints=prothint_augustus.gff ``` software ref: research ref: