Interproscan 5.65

2023-11-17 1316 words 7 minutes

Contents

Installed

This software should be available with no extra configuration.

interproscan-5.65

InterPro is a database which integrates together predictive information about proteins' function from a number of partner resources, giving an overview of the families that a protein belongs to and the domains and sites it contains.

Users who have novel nucleotide or protein sequences that they wish to functionally characterise can use the software package InterProScan to run the scanning algorithms from the InterPro database in an integrated way. Sequences are submitted in FASTA format. Matches are then calculated against all of the required member database’s signatures and the results are then output in a variety of formats.

Tips and Tricks

interproscan is set to use only a single thread for the hmmsearch jobs. However, there is some overhead from the master thread. Therefore, you should leave 2-4 threads open from what you request using the --cpus option.

Specify the -T /data tempdir setting to take advantage of the local HDDs on the compute nodes. This will reduce bandwidth usage and also speed up writes so that your jobs complete faster.

Activating the conda environment

Check out a node with qrsh and run:

1
2


bash
source /local/cluster/conda-envs/envs/interproscan-5/activate.sh

Location and version

1
2
3
4
5


$ which interproscan.sh
/local/cluster/bin/interproscan.sh
$ interproscan.sh -version
InterProScan version 5.65-97.0
InterProScan 64-Bit build  (requires Java 11)

help message

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99


$ interproscan.sh
17/11/2023 12:02:49:106 Welcome to InterProScan-5.65-97.0
17/11/2023 12:02:49:107 Running InterProScan v5 in STANDALONE mode... on Linux
usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xms128M
            -Xmx2048M -jar interproscan-5.jar


Please give us your feedback by sending an email to

interhelp@ebi.ac.uk

 -appl,--applications <ANALYSES>                           Optional, comma separated list of analyses.  If this option
                                                           is not set, ALL analyses will be run.
 -b,--output-file-base <OUTPUT-FILE-BASE>                  Optional, base output filename (relative or absolute path).
                                                           Note that this option, the --output-dir (-d) option and the
                                                           --outfile (-o) option are mutually exclusive.  The
                                                           appropriate file extension for the output format(s) will be
                                                           appended automatically. By default the input file path/name
                                                           will be used.
 -cpu,--cpu <CPU>                                          Optional, number of cores for inteproscan.
 -d,--output-dir <OUTPUT-DIR>                              Optional, output directory.  Note that this option, the
                                                           --outfile (-o) option and the --output-file-base (-b) option
                                                           are mutually exclusive. The output filename(s) are the same
                                                           as the input filename, with the appropriate file extension(s)
                                                           for the output format(s) appended automatically .
 -dp,--disable-precalc                                     Optional.  Disables use of the precalculated match lookup
                                                           service.  All match calculations will be run locally.
 -dra,--disable-residue-annot                              Optional, excludes sites from the XML, JSON output
 -etra,--enable-tsv-residue-annot                          Optional, includes sites in TSV output
 -exclappl,--excl-applications <EXC-ANALYSES>              Optional, comma separated list of analyses you want to
                                                           exclude.
 -f,--formats <OUTPUT-FORMATS>                             Optional, case-insensitive, comma separated list of output
                                                           formats. Supported formats are TSV, XML, JSON, and GFF3.
                                                           Default for protein sequences are TSV, XML and GFF3, or for
                                                           nucleotide sequences GFF3 and XML.
 -goterms,--goterms                                        Optional, switch on lookup of corresponding Gene Ontology
                                                           annotation (IMPLIES -iprlookup option)
 -help,--help                                              Optional, displayhelp information
 -i,--input <INPUT-FILE-PATH>                              Optional, path tofasta file that should be loaded on Master
                                                           startup. Alternatively, in CONVERT mode, the InterProScan 5
                                                           XML file to convert.
 -incldepappl,--incl-dep-applications <INC-DEP-ANALYSES>   Optional, comma separated list of deprecated analyses that
                                                           you want included.  If this option is not set, deprecated
                                                           analyses will notrun.
 -iprlookup,--iprlookup                                    Also include lookup of corresponding InterPro annotation in
                                                           the TSV and GFF3 output formats.
 -ms,--minsize <MINIMUM-SIZE>                              Optional, minimumnucleotide size of ORF to report. Will only
                                                           be considered if n is specified as a sequence type. Please be
                                                           aware of the factthat if you specify a too short value it
                                                           might be that theanalysis takes a very long time!
 -o,--outfile <EXPLICIT_OUTPUT_FILENAME>                   Optional explicitoutput file name (relative or absolute
                                                           path).  Note thatthis option, the --output-dir (-d) option
                                                           and the --output-file-base (-b) option are mutually
                                                           exclusive. If this option is given, you MUST specify a single
                                                           output format using the -f option.  The output file name will
                                                           not be modified. Note that specifying an output file name
                                                           using this optionOVERWRITES ANY EXISTING FILE.
 -pa,--pathways                                            Optional, switch on lookup of corresponding Pathway
                                                           annotation (IMPLIES -iprlookup option)
 -t,--seqtype <SEQUENCE-TYPE>                              Optional, the type of the input sequences (dna/rna (n) or
                                                           protein (p)).  The default sequence type is protein.
 -T,--tempdir <TEMP-DIR>                                   Optional, specifytemporary file directory (relative or
                                                           absolute path). The default location is temp/.
 -verbose,--verbose                                        Optional, displaymore verbose log output
 -version,--version                                        Optional, displayversion number
 -vl,--verbose-level <VERBOSE-LEVEL>                       Optional, displayverbose log output at level specified.
 -vtsv,--output-tsv-version                                Optional, includes a TSV version file along with any TSV
                                                           output (when TSV output requested)
Copyright © EMBL European Bioinformatics Institute, Hinxton, Cambridge, UK. (http://www.ebi.ac.uk) The InterProScan
software itself is provided under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html).
Third party components (e.g. member database binaries and models) are subject to separate licensing - please see the
individual member database websites for details.

Available analyses:
                       FunFam (4.3.0) : Prediction of functional annotationsfor novel, uncharacterized sequences.
                         SFLD (4) : SFLD is a database of protein families based on hidden Markov models (HMMs).
        SignalP_GRAM_NEGATIVE (4.1) : SignalP (gram-negative) predicts the presence and location of signal peptide cleavage sites in amino acid sequencesfor gram-negative prokaryotes.
                      PANTHER (18.0) : The PANTHER (Protein ANalysis THroughEvolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.
                       Gene3D (4.3.0) : Structural assignment for whole genes and genomes using the CATH domain structure database.
                        Hamap (2023_01) : High-quality Automated and Manual Annotation of Microbial Proteomes.
                       PRINTS (42.0) : A compendium of protein fingerprints - a fingerprint is a group of conserved motifs used to characterise a proteinfamily.
              ProSiteProfiles (2022_05) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
                        Coils (2.2.1) : Prediction of coiled coil regions inproteins.
                  SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotations for all proteins and genomes.
                        SMART (9.0) : SMART allows the identification and analysis of domain architectures based on hidden Markov models (HMMs).
                          CDD (3.20) : CDD predicts protein domains and families based on a collection of well-annotated multiple sequence alignment models.
                        PIRSR (2023_05) : PIRSR is a database of protein families based on hidden Markov models (HMMs) and Site Rules.
              ProSitePatterns (2022_05) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
                      AntiFam (7.0) : AntiFam is a resource of profile-HMMs designed to identify spurious protein predictions.
                  SignalP_EUK (4.1) : SignalP (eukaryotes) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for eukaryotes.
                         Pfam (36.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
                   MobiDBLite (2.0) : Prediction of intrinsically disorderedregions in proteins.
        SignalP_GRAM_POSITIVE (4.1) : SignalP (gram-positive) predicts the presence and location of signal peptide cleavage sites in amino acid sequencesfor gram-positive prokaryotes.
                        PIRSF (3.10) : The PIRSF concept is used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.
                        TMHMM (2.0c) : Prediction of transmembrane helices in proteins.
                      NCBIfam (13.0) : NCBIfam is a collection of protein families based on Hidden Markov Models (HMMs).

Deactivated analyses:
                      Phobius (1.01) : Analysis Phobius is deactivated, because the resources expected at the following paths do not exist: bin/phobius/1.01/phobius.pl

software ref: https://github.com/ebi-pf-team/interproscan
software ref: https://www.ebi.ac.uk/interpro/about/interproscan
software ref: https://interproscan-docs.readthedocs.io
research ref: https://doi.org/10.1093/bioinformatics/btu031