# interproscan 5.53 ## interproscan ## What is InterProScan? [InterPro](http://www.ebi.ac.uk/interpro/) is a database which integrates together predictive information about proteins' function from a number of partner resources, giving an overview of the families that a protein belongs to and the domains and sites it contains. Users who have novel nucleotide or protein sequences that they wish to functionally characterise can use the software package InterProScan to run the scanning algorithms from the InterPro database in an integrated way. Sequences are submitted in FASTA format. Matches are then calculated against all of the required member database's signatures and the results are then output in a variety of formats. Location and version: ```console $ /local/cluster/interproscan/interproscan/interproscan.sh --version InterProScan version 5.53-87.0 InterProScan 64-Bit build (requires Java 11) ``` help message: ```console $ /local/cluster/interproscan/interproscan/interproscan.sh 28/12/2021 21:49:22:691 Welcome to InterProScan-5.53-87.0 28/12/2021 21:49:22:702 Running InterProScan v5 in STANDALONE mode... on Linux usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xms128M -Xmx2048M -jar interproscan-5.jar Please give us your feedback by sending an email to interhelp@ebi.ac.uk -appl,--applications Optional, comma separated list of analyses. If this option is not set, ALL analyses will be run. -b,--output-file-base Optional, base output filename (relative or absolute path). Note that this option, the --output-dir (-d) option and the --outfile (-o) option are mutually exclusive. The appropriate file extension for the output format(s) will be appended automatically. By default the input file path/name will be used. -cpu,--cpu Optional, number of cores for inteproscan. -d,--output-dir Optional, output directory. Note that this option, the --outfile (-o) option and the --output-file-base (-b) option are mutually exclusive. The output filename(s) are the same as the input filename, with the appropriate file extension(s) for the output format(s) appended automatically . -dp,--disable-precalc Optional. Disables use of the precalculated match lookup service. All match calculations will be run locally. -dra,--disable-residue-annot Optional, excludes sites from the XML, JSON output -etra,--enable-tsv-residue-annot Optional, includes sites in TSV output -exclappl,--excl-applications Optional, comma separated list of analyses you want to exclude. -f,--formats Optional, case-insensitive, comma separated list of output formats. Supported formats are TSV, XML, JSON, GFF3, HTML and SVG. Default for protein sequences are TSV, XML and GFF3, or for nucleotide sequences GFF3 and XML. -goterms,--goterms Optional, switch on lookup of corresponding Gene Ontology annotation (IMPLIES -iprlookup option) -help,--help Optional, display help information -i,--input Optional, path to fasta file that should be loaded on Master startup. Alternatively, in CONVERT mode, the InterProScan 5 XML file to convert. -incldepappl,--incl-dep-applications Optional, comma separated list of deprecated analyses that you want included. If this option is not set, deprecated analyses will not run. -iprlookup,--iprlookup Also include lookup of corresponding InterPro annotation in the TSV and GFF3 output formats. -ms,--minsize Optional, minimum nucleotide size of ORF to report. Will only be considered if n is specified as a sequence type. Please be aware of the fact that if you specify a too short value it might be that the analysis takes a very long time! -o,--outfile Optional explicit output file name (relative or absolute path). Note that this option, the --output-dir (-d) option and the --output-file-base (-b) option are mutually exclusive. If this option is given, you MUST specify a single output format using the -f option. The output file name will not be modified. Note that specifying an output file name using this option OVERWRITES ANY EXISTING FILE. -pa,--pathways Optional, switch on lookup of corresponding Pathway annotation (IMPLIES -iprlookup option) -t,--seqtype Optional, the type of the input sequences (dna/rna (n) or protein (p)). The default sequence type is protein. -T,--tempdir Optional, specify temporary file directory (relative or absolute path). The default location is temp/. -verbose,--verbose Optional, display more verbose log output -version,--version Optional, display version number -vl,--verbose-level Optional, display verbose log output at level specified. -vtsv,--output-tsv-version Optional, includes a TSV version file along with any TSV output (when TSV output requested) Copyright © EMBL European Bioinformatics Institute, Hinxton, Cambridge, UK. (http://www.ebi.ac.uk) The InterProScan software itself is provided under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html). Third party components (e.g. member database binaries and models) are subject to separate licensing - please see the individual member database websites for details. Available analyses: TIGRFAM (15.0) : TIGRFAMs are protein families based on hidden Markov models (HMMs). SFLD (4) : SFLD is a database of protein families based on hidden Markov models (HMMs). SignalP_GRAM_NEGATIVE (4.1) : SignalP (gram-negative) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for gram-negative prokaryotes. SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotations for all proteins and genomes. PANTHER (15.0) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence. Gene3D (4.3.0) : Structural assignment for whole genes and genomes using the CATH domain structure database. Hamap (2020_05) : High-quality Automated and Manual Annotation of Microbial Proteomes. ProSiteProfiles (2021_01) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them. Coils (2.2.1) : Prediction of coiled coil regions in proteins. SMART (7.1) : SMART allows the identification and analysis of domain architectures based on hidden Markov models (HMMs). CDD (3.18) : CDD predicts protein domains and families based on a collection of well-annotated multiple sequence alignment models. PRINTS (42.0) : A compendium of protein fingerprints - a fingerprint is a group of conserved motifs used to characterise a protein family. PIRSR (2021_02) : PIRSR is a database of protein families based on hidden Markov models (HMMs) and Site Rules. ProSitePatterns (2021_01) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them. SignalP_EUK (4.1) : SignalP (eukaryotes) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for eukaryotes. Pfam (34.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). MobiDBLite (2.0) : Prediction of intrinsically disordered regions in proteins. SignalP_GRAM_POSITIVE (4.1) : SignalP (gram-positive) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for gram-positive prokaryotes. PIRSF (3.10) : The PIRSF concept is used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships. TMHMM (2.0c) : Prediction of transmembrane helices in proteins. Deactivated analyses: Phobius (1.01) : Analysis Phobius is deactivated, because the resources expected at the following paths do not exist: bin/phobius/1.01/phobius.pl /local/cluster/interproscan/interproscan/interproscan.sh 58.84s user 3.46s system 326% cpu 19.090 total ``` ## Tips and Tricks interproscan is set to use only a single thread for the hmmsearch jobs. However, there is some overhead from the master thread. Therefore, you should leave 2-4 threads open from what you request using the `--cpus` option. Specify the `-T /data` tempdir setting to take advantage of the local HDDs on the compute nodes. This will reduce bandwidth usage and also speed up writes so that your jobs complete faster. software ref: software ref: software ref: research ref: