EnTAP 0.10.8

2022-11-22 2190 words 11 minutes

Contents

Configuration required

See the relevant section below to configure this software before use.

EnTAP-0.10.8

EnTAP (Eukaryotic Non-Model Transcriptome Annotation Pipeline) was designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non-model eukaryotes. This software package addresses the fragmentation and related assembly issues that result in inflated transcript estimates and poor annotation rates, while focusing primarily on protein-coding transcripts. Following filters applied through assessment of true expression and frame selection, open-source tools are leveraged to functionally annotate the translated proteins.

Downstream features include fast similarity search across three repositories, protein domain assignment, orthologous gene family assessment, and Gene Ontology term assignment. The final annotation integrates across multiple databases and selects an optimal assignment from a combination of weighted metrics describing similarity search score, taxonomic relationship, and informativeness. Researchers have the option to include additional filters to identify and remove contaminants, identify associated pathways, and prepare the transcripts for enrichment analysis.

This fully featured pipeline is easy to install, configure, and runs significantly faster than comparable annotation packages. It is developed to contend with many of the issues in existing software solutions. EnTAP is optimized to generate extensive functional information for the gene space of organisms with limited or poorly characterized genomic resources.

Full Documentation can be found at:

http://entap.readthedocs.io/en/latest/

For information/bug reports, contact Alexander Hart at entap.dev@gmail.com

Sample configuration ini file

I made a sample configuration file for the CQLS infrastructure. You can find it here /local/cluster/EnTAP-0.10.8-beta/share.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187


$ cat /local/cluster/EnTAP-0.10.8-beta/share/sample_config.ini
#-------------------------------
# [ini_instructions]
#When using this ini file keep the following in mind:
#	1. Do not edit the input keys to the left side of the '=' sign
#	2. Be sure to use the proper value type (either a string, list, or number)
#	3. Do not add unecessary spaces to your input
#	4. When inputting a list, only add a ',' between each entry
#-------------------------------
# [configuration]
#-------------------------------
#Specify which EnTAP database you would like to download/generate or use throughout execution. Only one is required.
#    0. Serialized Database (default)
#    1. SQLITE Database
#It is advised to use the default Serialized Database as this is fastest.
#type:list (integer)
data-type=0,
#-------------------------------
# [entap]
#-------------------------------
#Path to the EnTAP binary database
#type:string
entap-db-bin=/nfs1/CGRB/databases/EnTAP/latest/entap_database.bin
#Path to the EnTAP SQL database (not needed if you are using the binary database)
#type:string
entap-db-sql=
#Path to the EnTAP graphing script (entap_graphing.py)
#type:string
entap-graph=/local/cluster/EnTAP/src/entap_graphing.py
#-------------------------------
# [expression_analysis]
#-------------------------------
#Specify the FPKM threshold with expression analysis. EnTAP will filter out transcripts below this value. (default: 0.5)
#type:decimal
fpkm=0.5
#Specify this flag if your BAM/SAM file was generated through single-end reads
#Note: this is only required in expression analysis
#Default: paired-end
#type:boolean (true/false)
single-end=false
#-------------------------------
# [expression_analysis-rsem]
#-------------------------------
#Execution method of RSEM Calculate Expression.
#Example: rsem-calculate-expression
#type:string
rsem-calculate-expression=rsem-calculate-expression
#Execution method of RSEM SAM Validate.
#Example: rsem-sam-validator
#type:string
rsem-sam-validator=rsem-sam-validator
#Execution method of RSEM Prep Reference.
#Example: rsem-prepare-reference
#type:string
rsem-prepare-reference=rsem-prepare-reference
#Execution method of RSEM Convert SAM
#Example: convert-sam-for-rsem
#type:string
convert-sam-for-rsem=convert-sam-for-rsem
#-------------------------------
# [frame_selection]
#-------------------------------
#Select this option if all of your sequences are complete proteins.
#At this point, this option will merely flag the sequences in your output file
#type:boolean (true/false)
complete=false
#Specify the Frame Selection software you would like to use. Only one flag can be specified.
#Specify flags as follows:
#    1. GeneMarkS-T
#    2. Transdecoder (default)
#type:integer
frame-selection=2
#-------------------------------
# [frame_selection-genemarks-t]
#-------------------------------
#Method to execute GeneMarkST. This may be the path to the executable.
#type:string
genemarkst-exe=gmst.pl
#-------------------------------
# [frame_selection-transdecoder]
#-------------------------------
#Method to execute TransDecoder.LongOrfs. This may be the path to the executable or simply TransDecoder.LongOrfs
#type:string
transdecoder-long-exe=TransDecoder.LongOrfs
#Method to execute TransDecoder.Predict. This may be the path to the executable or simply TransDecoder.Predict
#type:string
transdecoder-predict-exe=TransDecoder.Predict
#Transdecoder only. Specify the minimum protein length
#type:integer
transdecoder-m=100
#Specify this flag if you would like to pipe the TransDecoder command '--no_refine_starts' when it is executed. Default: False
#This will 'start refinement identifies potential start codons for 5' partial ORFs using a PWM, process on by default.'
#type:boolean (true/false)
transdecoder-no-refine-starts=false
#-------------------------------
# [general]
#-------------------------------
#Specify the output format for the processed alignments.Multiple flags can be specified:
#    1. TSV Format (default)
#    2. CSV Format
#    3. FASTA Amino Acid (default)
#    4. FASTA Nucleotide (default)
#    5. Gene Enrichment Sequence ID vs. Effective Length TSV (default)
#    6. Gene Enrichment Sequence ID vs. GO Term TSV (default)
#type:list (integer)
output-format=1,3,4,5,6,
#-------------------------------
# [ontology]
#-------------------------------
# Specify the ontology software you would like to use
#Note: it is possible to specify more than one! Just usemultiple --ontology flags
#Specify flags as follows:
#    0. EggNOG (default)
#    1. InterProScan
#type:list (integer)
ontology=0,
#Specify the Gene Ontology levels you would like printed
#A level of 0 means that every term will be printed, while a level of 1 or higher
#means that that level and anything higher than it will be printed
#It is possible to specify multiple flags as well
#Example/Defaults: --level 0 --level 1
#type:list (integer)
level=0,1,
#-------------------------------
# [ontology-eggnog]
#-------------------------------
#Path to the EggNOG SQL database that was downloaded during the Configuration stage.
#type:string
eggnog-sql=/nfs1/CGRB/databases/EnTAP/latest/eggnog.db
#Path to EggNOG DIAMOND configured database that was generated during the Configuration stage.
#type:string
eggnog-dmnd=/nfs1/CGRB/databases/EnTAP/latest/eggnog_proteins.dmnd
#-------------------------------
# [ontology-interproscan]
#-------------------------------
#Execution method of InterProScan. This is how InterProScan is generally ran on your system.  It could be as simple as 'interproscan.sh' depending on if it is globally installed.
#type:string
interproscan-exe=interproscan.sh
#Select which databases you would like for InterProScan. Databases must be one of the following:
#    -tigrfam
#    -sfld
#    -prodom
#    -hamap
#    -pfam
#    -smart
#    -cdd
#    -prositeprofiles
#    -prositepatterns
#    -superfamily
#    -prints
#    -panther
#    -gene3d
#    -pirsf
#    -coils
#    -morbidblite
#Make sure the database is downloaded, EnTAP will not check!
#--protein tigrfam --protein pfam
#type:list (string)
protein=pfam,tigrfam,panther
#-------------------------------
# [similarity_search]
#-------------------------------
#Method to execute DIAMOND. This can be a path to the executable or simply 'diamond' if installed globally.
#type:string
diamond-exe=diamond
#Specify the type of species/taxon you are analyzing and would like alignments closer in taxonomic relevance to be favored (based on NCBI Taxonomic Database)
#Note: replace all spaces with underscores '_'
#type:string
taxon=
#Select the minimum query coverage to be allowed during similarity searching
#type:decimal
qcoverage=50
#Select the minimum target coverage to be allowed during similarity searching
#type:decimal
tcoverage=50
#Specify the contaminants you would like to flag for similarity searching. Contaminants can be selected by species or through a specific taxon (insecta) from the NCBI Taxonomy Database. If your taxon is more than one word just replace the spaces with underscores (_).
#Note: since hits are based upon a multitide of factors, a contaminant might end up being the best hit for an alignment. In this scenario, EnTAP will flagthe contaminant and it can be removed if you would like.
#type:list (string)
contam=
#Specify the E-Value that will be used as a cutoff during similarity searching.
#type:decimal
e-value=1e-05
#List of keywords that should be used to specify uninformativeness of hits during similarity searching. Generally something along the lines of 'hypothetical' or 'unknown' are used. Each term should be separated by a comma (,) This can be used if you would like to tag certain descriptions or would like to weigh certain alignments differently (see full documentation)
#Example (defaults):
#conserved, predicted, unknown, hypothetical, putative, unidentified, uncultured, uninformative, unnamed
#type:list (string)
uninformative=conserved,predicted,unknown,unnamed,hypothetical,putative,unidentified,uncharacterized,uncultured,uninformative,

Database location

The most up-to-date versions of the required database files can be found here:

1
2
3
4


$ ls -1 /nfs1/CGRB/databases/EnTAP/latest/
eggnog.db
eggnog_proteins.dmnd
entap_database.bin

Target database files (i.e. diamond index files [.dmnd]) can be found linked here:

1
2
3


$ ls -1 /nfs1/CGRB/databases/EnTAP/databases
plant.dmnd@
uniprot_sprot.dmnd@

New databases can be configured using the EnTAP --config flag.

Example run

You can find an example run here

1

/nfs1/CGRB/databases/EnTAP/test-dir/

Location and version

1
2
3
4
5


$ which EnTAP
/local/cluster/bin/EnTAP
$ EnTAP --version

EnTAP  version: 0.10.8

help message

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126


$ EnTAP --help

USAGE:

   EnTAP  [-a <string>] [--data-generate] [--no-check] [--state <string>]
          [-t <integer>] [--no-trim] [--graph] [-d <string list>] ...  [-i
          <string>] [--ini <string>] [--overwrite] [--runN] [--runP]
          [--config] [--out-dir <string>] [--] [--version] [-h]


Where:

   -a <string>,  --align <string>
     Specify the path to the BAM/SAM file for expression analysis

   --data-generate
     Specify whether you would like to generate EnTAP databases locally
     instead of downloading them. By default, EnTAP will download the
     databases. This may be used if you are experiencing  errors with the
     default process.

   --no-check
     Use this flag if you don't want your input to EnTAP verifed. This is
     not advised to use! Your run may fail later on if inputs are not
     checked.

   --state <string>
     Specify the state of execution (EXPERIMENTAL). More information is
     available in the documentation. This flag may have undesired affects
     and may not run properly!

   -t <integer>,  --threads <integer>
     Specify the number of threads that will be used throughout EnTAP
     execution


   --no-trim
     By default, EnTAP will trim the sequence ID to the nearest space to
     help with compatibility across software. This command will instead
     remove the spaces in a sequence ID rather than trimming.

   --graph
     Check whether or not your system supports graphing. This option does
     not require any other flags and will exit EnTAP after it determined
     that the proper Python libraries are present

   -d <string list>,  --database <string list>  (accepted multiple times)
     Provide the paths to the databases you would like to use for either
     'run' or 'configuration'.

     For running/execution:

     - Ensure the databases selected are in a DIAMOND configured format
     with an extension of .dmnd

     For configuration:

     - Ensure the databases are in a typical FASTA format

     Note: if your databases do not have the typical NCBI or UniProt header
     format, taxonomic  information and filtering may not be utilized.
     Refer to the documentation to see how to properly format any data.

   -i <string>,  --input <string>
     Path to the input transcriptome file

   --ini <string>
     [REQUIRED] Specify path to the entap_config.ini file that will be used
     to find all of the configuration data.

   --overwrite
     Select this option if you would like to overwrite files from a
     previous execution of EnTAP. This will DISABLE 'picking up where you
     left off' which enables you to continue an annotation from where you
     left off before. Refer to the documentation for more information.

   --runN
     Execute EnTAP functionality with 'blastx'. This means that EnTAP will
     use nucleotide sequences for all annotation stages. If you input a
     nucleotide trancsriptome, that will be used to  annotate. Due to this,
     you will not be able to select this option and input a protein
     transcriptome.

   --runP
     Execute EnTAP functionality with 'blastp'. This means that EnTAP will
     use protein sequences for all annotation stages. If you input a
     nucleotide transcriptome, they will be frame selected and the
     subsequent protein file will be used for annotation.

     This is typically how EnTAP would be ran.

   --config
     Configure EnTAP for execution later. If this is your first time
     running EnTAP run this first!

     This will perform the following:

     - Downloading EnTAP/NCBI taxonomic database

     - Downloading Gene Ontology term database

     - Formatting any database you would like for diamond

     - Downloading UniProt Swiss-Prot information

   --out-dir <string>
     Specify the output directory you would like the data produced by EnTAP
     to be saved to.

   --,  --ignore_rest
     Ignores the rest of the labeled arguments following this flag.

   --version
     Displays version information and exits.

   -h,  --help
     Displays usage information and exits.


   EnTAP

   Alexander Hart and Dr. Jill Wegrzyn

   University of Connecticut

   Copyright 2017-2021

software ref: https://gitlab.com/enTAP/EnTAP
research ref: https://doi.org/10.1111/1755-0998.13106