# Bakta 1.5.1


{{< admonition tip "Conda" true >}}
See the 'activating the conda environment' section below to access this
software.
{{< /admonition >}}

## Bakta 1.5.1: rapid & standardized annotation of bacterial genomes, MAGs & plasmids

Bakta is a tool for the rapid & standardized annotation of bacterial genomes
and plasmids from both isolates and MAGs. It provides **dbxref**-rich and
**sORF**-including annotations in machine-readable `JSON` & bioinformatics
standard file formats for automatic downstream analysis.

### Description

- **Comprehensive & taxonomy-independent database** Bakta provides a large and
  taxonomy-independent database using UniProt's entire
  [UniRef](https://www.uniprot.org/uniref/) protein sequence cluster universe.
  Thus, it achieves favourable annotations in terms of sensitivity and
  specificity along the broad continuum ranging from well-studied species to
  unknown genomes from MAGs.

- **Protein sequence identification** Bakta exactly identifies known identical
  protein sequences (**IPS**) from RefSeq and UniProt allowing the
  fine-grained annotation of gene alleles (`AMR`) or closely related but
  distinct protein families. This is achieved via an alignment-free sequence
  identification (**AFSI**) approach using full-length `MD5` protein sequence
  hash digests.

- **Fast** This AFSI approach substantially accellerates the annotation
  process by avoiding computationally expensive homology searches for
  identified genes. Thus, Bakta can annotate a typical bacterial genome in 10
  &plusmn;5 min on a laptop, plasmids in a couple of seconds/minutes.

- **Database cross-references** Fostering the
  [FAIR](https://www.go-fair.org/fair-principles) principles, Bakta exploits
  its AFSI approach to annotate CDS with database cross-references (**dbxref**)
  to RefSeq (`WP_*`), UniRef100 (`UniRef100_*`) and UniParc (`UPI*`). By doing
  so, IPS allow the surveillance of distinct gene alleles and streamlining
  comparative analysis as well as posterior (external) annotations of `putative`
  & `hypothetical` protein sequences which can be mapped back to existing CDS
  via these exact & stable identifiers (*E. coli* gene
  [ymiA](https://www.uniprot.org/uniprot/P0CB62)
  [...more](https://www.uniprot.org/help/dubious_sequences)). Currently, Bakta
  identifies ~214.8 mio, ~199 mio and ~161 mio distinct protein sequences from
  UniParc, UniRef100 and RefSeq, respectively. Hence, for certain genomes, up
  to 99 % of all CDS can be identified this way, skipping computationally
  expensive sequence alignments.

- **FAIR annotations** To provide standardized annotations adhearing to FAIR
  principles, Bakta utilizes a versioned custom annotation database comprising
  UniProt's [UniRef100 & UniRef90](https://www.uniprot.org/uniref/) protein
  clusters (FAIR ->
  [DOI](http://dx.doi.org/10.1038/s41597-019-0180-9)/[DOI](https://doi.org/10.1093/nar/gkaa1100))
  enriched with dbxrefs (`GO`, `COG`, `EC`) and annotated by specialized niche
  databases. For each db version we provide a comprehensive log file of all
  imported sequences and annotations.

- **Small proteins / short open reading frames** Bakta detects and annotates
  small proteins/short open reading frames (**sORF**) which are not predicted
  by tools like `Prodigal`.

- **Expert annotation systems** To provide high quality annotations for
  certain proteins of higher interest, *e.g.* AMR & VF genes, Bakta includes &
  merges different expert annotation systems. Currently, Bakta uses NCBI's
  AMRFinderPlus for AMR gene annotations as well as an generalized protein
  sequence expert system with distinct coverage, identity and priority values
  for each sequence, currenlty comprising the
  [VFDB](http://www.mgc.ac.cn/VFs/main.htm) as well as NCBI's
  [BlastRules](https://ftp.ncbi.nih.gov/pub/blastrules/).

- **Comprehensive workflow** Bakta annotates ncRNA cis-regulatory regions,
  oriC/oriV/oriT and assembly gaps as well as standard feature types: tRNA,
  tmRNA, rRNA, ncRNA genes, CRISPR, CDS and pseudogenes.

- **GFF3 & INSDC conform annotations** Bakta writes GFF3 and INSDC-compliant
  (Genbank & EMBL) annotation files ready for submission (checked via
  [GenomeTools GFF3Validator](http://genometools.org/cgi-bin/gff3validator.cgi),
  [table2asn_GFF](https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/#run) and
  [ENA Webin-CLI](https://github.com/enasequence/webin-cli) for GFF3 and EMBL
  file formats, respectively for representative genomes of all ESKAPE species).

- **Bacteria & plasmids only** Bakta was designed to annotate bacteria
  (isolates & MAGs) and plasmids, only. This decision by design has been made
  in order to tweak the annotation process regarding tools, preferences &
  databases and to streamline further development & maintenance of the software.

- **Reasoning** By annotating bacterial genomes in a standardized,
  taxonomy-independent, high-throughput and local manner, Bakta aims at a
  well-balanced tradeoff between fully featured but computationally demanding
  pipelines like [PGAP](https://github.com/ncbi/pgap) and rapid highly
  customizable offline tools like [Prokka](https://github.com/tseemann/prokka).
  Indeed, Bakta is heavily inspired by Prokka (kudos to 
  [Torsten Seemann](https://github.com/tseemann)) and many command line
  options are compatible for the sake of interoperability and user convenience.
  Hence, if Bakta does not fit your needs, please consider trying Prokka.

-------------------------------------------------------------------------------

## Activating the conda environment

Check out a node with `qrsh` and then:

```console
bash
source /local/cluster/bakta/activate.sh
```

To use over SGE, add the `source` line above to your shell script prior to the
bakta commands.

## Location, version, DB info

The database is still version 4.0.

```console
$ which bakta
/local/cluster/bakta-1.5.1/bin/bakta
$ bakta --version
bakta 1.5.1
$ echo $BAKTA_DB
/nfs1/CGRB/databases/bakta/latest
```

Note: `deepsig` is included in this install:

```console
$ which deepsig
/local/cluster/bakta-1.5.1/bin/deepsig
```

## help message

```console
$ bakta -h
usage: bakta [--db DB] [--min-contig-length MIN_CONTIG_LENGTH] [--prefix PREFIX] [--output OUTPUT] [--genus GENUS] [--species SPECIES] [--strain STRAIN]
             [--plasmid PLASMID] [--complete] [--prodigal-tf PRODIGAL_TF] [--translation-table {11,4}] [--gram {+,-,?}] [--locus LOCUS]
             [--locus-tag LOCUS_TAG] [--keep-contig-headers] [--replicons REPLICONS] [--compliant] [--proteins PROTEINS] [--skip-trna] [--skip-tmrna]
             [--skip-rrna] [--skip-ncrna] [--skip-ncrna-region] [--skip-crispr] [--skip-cds] [--skip-pseudo] [--skip-sorf] [--skip-gap] [--skip-ori]
             [--help] [--verbose] [--debug] [--threads THREADS] [--tmp-dir TMP_DIR] [--version]
             <genome>

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids

positional arguments:
  <genome>              Genome sequences in (zipped) fasta format

Input / Output:
  --db DB, -d DB        Database path (default = <bakta_path>/db). Can also be provided as BAKTA_DB environment variable.
  --min-contig-length MIN_CONTIG_LENGTH, -m MIN_CONTIG_LENGTH
                        Minimum contig size (default = 1; 200 in compliant mode)
  --prefix PREFIX, -p PREFIX
                        Prefix for output files
  --output OUTPUT, -o OUTPUT
                        Output directory (default = current working directory)

Organism:
  --genus GENUS         Genus name
  --species SPECIES     Species name
  --strain STRAIN       Strain name
  --plasmid PLASMID     Plasmid name

Annotation:
  --complete            All sequences are complete replicons (chromosome/plasmid[s])
  --prodigal-tf PRODIGAL_TF
                        Path to existing Prodigal training file to use for CDS prediction
  --translation-table {11,4}
                        Translation table: 11/4 (default = 11)
  --gram {+,-,?}        Gram type for signal peptide predictions: +/-/? (default = ?)
  --locus LOCUS         Locus prefix (default = 'contig')
  --locus-tag LOCUS_TAG
                        Locus tag prefix (default = autogenerated)
  --keep-contig-headers
                        Keep original contig headers
  --replicons REPLICONS, -r REPLICONS
                        Replicon information table (tsv/csv)
  --compliant           Force Genbank/ENA/DDJB compliance
  --proteins PROTEINS   Fasta file of trusted protein sequences for CDS annotation

Workflow:
  --skip-trna           Skip tRNA detection & annotation
  --skip-tmrna          Skip tmRNA detection & annotation
  --skip-rrna           Skip rRNA detection & annotation
  --skip-ncrna          Skip ncRNA detection & annotation
  --skip-ncrna-region   Skip ncRNA region detection & annotation
  --skip-crispr         Skip CRISPR array detection & annotation
  --skip-cds            Skip CDS detection & annotation
  --skip-pseudo         Skip pseudogene detection & annotation
  --skip-sorf           Skip sORF detection & annotation
  --skip-gap            Skip gap detection & annotation
  --skip-ori            Skip oriC/oriT detection & annotation

General:
  --help, -h            Show this help message and exit
  --verbose, -v         Print verbose information
  --debug               Run Bakta in debug mode. Temp data will not be removed.
  --threads THREADS, -t THREADS
                        Number of threads to use (default = number of available CPUs)
  --tmp-dir TMP_DIR     Location for temporary files (default = system dependent auto detection)
  --version             show program's version number and exit

Version: 1.5.1
DOI: 10.1099/mgen.0.000685
URL: github.com/oschwengers/bakta

Citation:
Schwengers O., Jelonek L., Dieckmann M. A., Beyvers S., Blom J., Goesmann A. (2021).
Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification.
Microbial Genomics, 7(11). https://doi.org/10.1099/mgen.0.000685
```

software ref: <https://github.com/oschwengers/bakta>  
research ref: <https://doi.org/10.1099/mgen.0.000685>