# EnTAP 0.10.8 {{< admonition warning "Configuration required" true >}} See the relevant section below to configure this software before use. {{< /admonition >}} ## EnTAP-0.10.8 EnTAP (Eukaryotic Non-Model Transcriptome Annotation Pipeline) was designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non-model eukaryotes. This software package addresses the fragmentation and related assembly issues that result in inflated transcript estimates and poor annotation rates, while focusing primarily on protein-coding transcripts. Following filters applied through assessment of true expression and frame selection, open-source tools are leveraged to functionally annotate the translated proteins. Downstream features include fast similarity search across three repositories, protein domain assignment, orthologous gene family assessment, and Gene Ontology term assignment. The final annotation integrates across multiple databases and selects an optimal assignment from a combination of weighted metrics describing similarity search score, taxonomic relationship, and informativeness. Researchers have the option to include additional filters to identify and remove contaminants, identify associated pathways, and prepare the transcripts for enrichment analysis. This fully featured pipeline is easy to install, configure, and runs significantly faster than comparable annotation packages. It is developed to contend with many of the issues in existing software solutions. EnTAP is optimized to generate extensive functional information for the gene space of organisms with limited or poorly characterized genomic resources. Full Documentation can be found at: http://entap.readthedocs.io/en/latest/ For information/bug reports, contact Alexander Hart at entap.dev@gmail.com ------------------------------------------------------------------------------- ## Sample configuration ini file I made a sample configuration file for the CQLS infrastructure. You can find it here `/local/cluster/EnTAP-0.10.8-beta/share`. ```console $ cat /local/cluster/EnTAP-0.10.8-beta/share/sample_config.ini #------------------------------- # [ini_instructions] #When using this ini file keep the following in mind: # 1. Do not edit the input keys to the left side of the '=' sign # 2. Be sure to use the proper value type (either a string, list, or number) # 3. Do not add unecessary spaces to your input # 4. When inputting a list, only add a ',' between each entry #------------------------------- # [configuration] #------------------------------- #Specify which EnTAP database you would like to download/generate or use throughout execution. Only one is required. # 0. Serialized Database (default) # 1. SQLITE Database #It is advised to use the default Serialized Database as this is fastest. #type:list (integer) data-type=0, #------------------------------- # [entap] #------------------------------- #Path to the EnTAP binary database #type:string entap-db-bin=/nfs1/CGRB/databases/EnTAP/latest/entap_database.bin #Path to the EnTAP SQL database (not needed if you are using the binary database) #type:string entap-db-sql= #Path to the EnTAP graphing script (entap_graphing.py) #type:string entap-graph=/local/cluster/EnTAP/src/entap_graphing.py #------------------------------- # [expression_analysis] #------------------------------- #Specify the FPKM threshold with expression analysis. EnTAP will filter out transcripts below this value. (default: 0.5) #type:decimal fpkm=0.5 #Specify this flag if your BAM/SAM file was generated through single-end reads #Note: this is only required in expression analysis #Default: paired-end #type:boolean (true/false) single-end=false #------------------------------- # [expression_analysis-rsem] #------------------------------- #Execution method of RSEM Calculate Expression. #Example: rsem-calculate-expression #type:string rsem-calculate-expression=rsem-calculate-expression #Execution method of RSEM SAM Validate. #Example: rsem-sam-validator #type:string rsem-sam-validator=rsem-sam-validator #Execution method of RSEM Prep Reference. #Example: rsem-prepare-reference #type:string rsem-prepare-reference=rsem-prepare-reference #Execution method of RSEM Convert SAM #Example: convert-sam-for-rsem #type:string convert-sam-for-rsem=convert-sam-for-rsem #------------------------------- # [frame_selection] #------------------------------- #Select this option if all of your sequences are complete proteins. #At this point, this option will merely flag the sequences in your output file #type:boolean (true/false) complete=false #Specify the Frame Selection software you would like to use. Only one flag can be specified. #Specify flags as follows: # 1. GeneMarkS-T # 2. Transdecoder (default) #type:integer frame-selection=2 #------------------------------- # [frame_selection-genemarks-t] #------------------------------- #Method to execute GeneMarkST. This may be the path to the executable. #type:string genemarkst-exe=gmst.pl #------------------------------- # [frame_selection-transdecoder] #------------------------------- #Method to execute TransDecoder.LongOrfs. This may be the path to the executable or simply TransDecoder.LongOrfs #type:string transdecoder-long-exe=TransDecoder.LongOrfs #Method to execute TransDecoder.Predict. This may be the path to the executable or simply TransDecoder.Predict #type:string transdecoder-predict-exe=TransDecoder.Predict #Transdecoder only. Specify the minimum protein length #type:integer transdecoder-m=100 #Specify this flag if you would like to pipe the TransDecoder command '--no_refine_starts' when it is executed. Default: False #This will 'start refinement identifies potential start codons for 5' partial ORFs using a PWM, process on by default.' #type:boolean (true/false) transdecoder-no-refine-starts=false #------------------------------- # [general] #------------------------------- #Specify the output format for the processed alignments.Multiple flags can be specified: # 1. TSV Format (default) # 2. CSV Format # 3. FASTA Amino Acid (default) # 4. FASTA Nucleotide (default) # 5. Gene Enrichment Sequence ID vs. Effective Length TSV (default) # 6. Gene Enrichment Sequence ID vs. GO Term TSV (default) #type:list (integer) output-format=1,3,4,5,6, #------------------------------- # [ontology] #------------------------------- # Specify the ontology software you would like to use #Note: it is possible to specify more than one! Just usemultiple --ontology flags #Specify flags as follows: # 0. EggNOG (default) # 1. InterProScan #type:list (integer) ontology=0, #Specify the Gene Ontology levels you would like printed #A level of 0 means that every term will be printed, while a level of 1 or higher #means that that level and anything higher than it will be printed #It is possible to specify multiple flags as well #Example/Defaults: --level 0 --level 1 #type:list (integer) level=0,1, #------------------------------- # [ontology-eggnog] #------------------------------- #Path to the EggNOG SQL database that was downloaded during the Configuration stage. #type:string eggnog-sql=/nfs1/CGRB/databases/EnTAP/latest/eggnog.db #Path to EggNOG DIAMOND configured database that was generated during the Configuration stage. #type:string eggnog-dmnd=/nfs1/CGRB/databases/EnTAP/latest/eggnog_proteins.dmnd #------------------------------- # [ontology-interproscan] #------------------------------- #Execution method of InterProScan. This is how InterProScan is generally ran on your system. It could be as simple as 'interproscan.sh' depending on if it is globally installed. #type:string interproscan-exe=interproscan.sh #Select which databases you would like for InterProScan. Databases must be one of the following: # -tigrfam # -sfld # -prodom # -hamap # -pfam # -smart # -cdd # -prositeprofiles # -prositepatterns # -superfamily # -prints # -panther # -gene3d # -pirsf # -coils # -morbidblite #Make sure the database is downloaded, EnTAP will not check! #--protein tigrfam --protein pfam #type:list (string) protein=pfam,tigrfam,panther #------------------------------- # [similarity_search] #------------------------------- #Method to execute DIAMOND. This can be a path to the executable or simply 'diamond' if installed globally. #type:string diamond-exe=diamond #Specify the type of species/taxon you are analyzing and would like alignments closer in taxonomic relevance to be favored (based on NCBI Taxonomic Database) #Note: replace all spaces with underscores '_' #type:string taxon= #Select the minimum query coverage to be allowed during similarity searching #type:decimal qcoverage=50 #Select the minimum target coverage to be allowed during similarity searching #type:decimal tcoverage=50 #Specify the contaminants you would like to flag for similarity searching. Contaminants can be selected by species or through a specific taxon (insecta) from the NCBI Taxonomy Database. If your taxon is more than one word just replace the spaces with underscores (_). #Note: since hits are based upon a multitide of factors, a contaminant might end up being the best hit for an alignment. In this scenario, EnTAP will flagthe contaminant and it can be removed if you would like. #type:list (string) contam= #Specify the E-Value that will be used as a cutoff during similarity searching. #type:decimal e-value=1e-05 #List of keywords that should be used to specify uninformativeness of hits during similarity searching. Generally something along the lines of 'hypothetical' or 'unknown' are used. Each term should be separated by a comma (,) This can be used if you would like to tag certain descriptions or would like to weigh certain alignments differently (see full documentation) #Example (defaults): #conserved, predicted, unknown, hypothetical, putative, unidentified, uncultured, uninformative, unnamed #type:list (string) uninformative=conserved,predicted,unknown,unnamed,hypothetical,putative,unidentified,uncharacterized,uncultured,uninformative, ``` ## Database location The most up-to-date versions of the required database files can be found here: ```console $ ls -1 /nfs1/CGRB/databases/EnTAP/latest/ eggnog.db eggnog_proteins.dmnd entap_database.bin ``` Target database files (i.e. diamond index files [.dmnd]) can be found linked here: ```console $ ls -1 /nfs1/CGRB/databases/EnTAP/databases plant.dmnd@ uniprot_sprot.dmnd@ ``` New databases can be configured using the `EnTAP --config` flag. ## Example run You can find an example run here ```console /nfs1/CGRB/databases/EnTAP/test-dir/ ``` ## Location and version ```console $ which EnTAP /local/cluster/bin/EnTAP $ EnTAP --version EnTAP version: 0.10.8 ``` ## help message ```console $ EnTAP --help USAGE: EnTAP [-a ] [--data-generate] [--no-check] [--state ] [-t ] [--no-trim] [--graph] [-d ] ... [-i ] [--ini ] [--overwrite] [--runN] [--runP] [--config] [--out-dir ] [--] [--version] [-h] Where: -a , --align Specify the path to the BAM/SAM file for expression analysis --data-generate Specify whether you would like to generate EnTAP databases locally instead of downloading them. By default, EnTAP will download the databases. This may be used if you are experiencing errors with the default process. --no-check Use this flag if you don't want your input to EnTAP verifed. This is not advised to use! Your run may fail later on if inputs are not checked. --state Specify the state of execution (EXPERIMENTAL). More information is available in the documentation. This flag may have undesired affects and may not run properly! -t , --threads Specify the number of threads that will be used throughout EnTAP execution --no-trim By default, EnTAP will trim the sequence ID to the nearest space to help with compatibility across software. This command will instead remove the spaces in a sequence ID rather than trimming. --graph Check whether or not your system supports graphing. This option does not require any other flags and will exit EnTAP after it determined that the proper Python libraries are present -d , --database (accepted multiple times) Provide the paths to the databases you would like to use for either 'run' or 'configuration'. For running/execution: - Ensure the databases selected are in a DIAMOND configured format with an extension of .dmnd For configuration: - Ensure the databases are in a typical FASTA format Note: if your databases do not have the typical NCBI or UniProt header format, taxonomic information and filtering may not be utilized. Refer to the documentation to see how to properly format any data. -i , --input Path to the input transcriptome file --ini [REQUIRED] Specify path to the entap_config.ini file that will be used to find all of the configuration data. --overwrite Select this option if you would like to overwrite files from a previous execution of EnTAP. This will DISABLE 'picking up where you left off' which enables you to continue an annotation from where you left off before. Refer to the documentation for more information. --runN Execute EnTAP functionality with 'blastx'. This means that EnTAP will use nucleotide sequences for all annotation stages. If you input a nucleotide trancsriptome, that will be used to annotate. Due to this, you will not be able to select this option and input a protein transcriptome. --runP Execute EnTAP functionality with 'blastp'. This means that EnTAP will use protein sequences for all annotation stages. If you input a nucleotide transcriptome, they will be frame selected and the subsequent protein file will be used for annotation. This is typically how EnTAP would be ran. --config Configure EnTAP for execution later. If this is your first time running EnTAP run this first! This will perform the following: - Downloading EnTAP/NCBI taxonomic database - Downloading Gene Ontology term database - Formatting any database you would like for diamond - Downloading UniProt Swiss-Prot information --out-dir Specify the output directory you would like the data produced by EnTAP to be saved to. --, --ignore_rest Ignores the rest of the labeled arguments following this flag. --version Displays version information and exits. -h, --help Displays usage information and exits. EnTAP Alexander Hart and Dr. Jill Wegrzyn University of Connecticut Copyright 2017-2021 ``` software ref: research ref: