HMMER 3.3.2

2022-05-25 843 words 4 minutes

HMMER: biosequence analysis using profile hidden Markov models

HMMER is used for searching sequence databases for sequence homologs, and for making sequence alignments. It implements methods using probabilistic models called profile hidden Markov models (profile HMMs).

HMMER is often used together with a profile database, such as Pfam or many of the databases that participate in Interpro. But HMMER can also work with query sequences, not just profiles, just like BLAST. For example, you can search a protein query sequence against a database with phmmer, or do an iterative search with jackhmmer.

HMMER is designed to detect remote homologs as sensitively as possible, relying on the strength of its underlying probability models. In the past, this strength came at significant computational expense, but as of the new HMMER3 project, HMMER is now essentially as fast as BLAST.

HMMER can be downloaded and installed as a command line tool on your own hardware, and now it is also more widely accessible to the scientific community via new search servers at the European Bioinformatics Institute.

Location:

1
2


$ which hmmbuild
/local/cluster/hmmer/bin/hmmbuild

help message:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70


$ hmmbuild -h
# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.3.2 (Nov 2020); http://hmmer.org/
# Copyright (C) 2020 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: hmmbuild [-options] <hmmfile_out> <msafile>

Basic options:
  -h     : show brief help on version and usage
  -n <s> : name the HMM <s>
  -o <f> : direct summary output to file <f>, not stdout
  -O <f> : resave annotated, possibly modified MSA to file <f>

Options for selecting alphabet rather than guessing it:
  --amino : input alignment is protein sequence data
  --dna   : input alignment is DNA sequence data
  --rna   : input alignment is RNA sequence data

Alternative model construction strategies:
  --fast           : assign cols w/ >= symfrac residues as consensus  [default]
  --hand           : manual construction (requires reference annotation)
  --symfrac <x>    : sets sym fraction controlling --fast construction  [0.5]
  --fragthresh <x> : if L <= x*alen, tag sequence as a fragment  [0.5]

Alternative relative sequence weighting strategies:
  --wpb     : Henikoff position-based weights  [default]
  --wgsc    : Gerstein/Sonnhammer/Chothia tree weights
  --wblosum : Henikoff simple filter weights
  --wnone   : don't do any relative weighting; set all to 1
  --wgiven  : use weights as given in MSA file
  --wid <x> : for --wblosum: set identity cutoff  [0.62]  (0<=x<=1)

Alternative effective sequence weighting strategies:
  --eent       : adjust eff seq # to achieve relative entropy target  [default]
  --eclust     : eff seq # is # of single linkage clusters
  --enone      : no effective seq # weighting: just use nseq
  --eset <x>   : set eff seq # for all models to <x>
  --ere <x>    : for --eent: set minimum rel entropy/position to <x>
  --esigma <x> : for --eent: set sigma param to <x>  [45.0]
  --eid <x>    : for --eclust: set fractional identity cutoff to <x>  [0.62]

Alternative prior strategies:
  --pnone    : don't use any prior; parameters are frequencies
  --plaplace : use a Laplace +1 prior

Handling single sequence inputs:
  --singlemx    : use substitution score matrix for single-sequence inputs
  --mx <s>      : substitution score matrix (built-in matrices, with --singlemx)
  --mxfile <f>  : read substitution score matrix from file <f> (with --singlemx)
  --popen <x>   : force gap open prob. (w/ --singlemx, aa default 0.02, nt 0.031)
  --pextend <x> : force gap extend prob. (w/ --singlemx, aa default 0.4, nt 0.75)

Control of E-value calibration:
  --EmL <n> : length of sequences for MSV Gumbel mu fit  [200]  (n>0)
  --EmN <n> : number of sequences for MSV Gumbel mu fit  [200]  (n>0)
  --EvL <n> : length of sequences for Viterbi Gumbel mu fit  [200]  (n>0)
  --EvN <n> : number of sequences for Viterbi Gumbel mu fit  [200]  (n>0)
  --EfL <n> : length of sequences for Forward exp tail tau fit  [100]  (n>0)
  --EfN <n> : number of sequences for Forward exp tail tau fit  [200]  (n>0)
  --Eft <x> : tail mass for Forward exponential tail tau fit  [0.04]  (0<x<1)

Other options:
  --cpu <n>          : number of parallel CPU workers for multithreads  [2]
  --stall            : arrest after start: for attaching debugger to process
  --informat <s>     : assert input alifile is in format <s> (no autodetect)
  --seed <n>         : set RNG seed to <n> (if 0: one-time arbitrary seed)  [42]
  --w_beta <x>       : tail mass at which window length is determined
  --w_length <n>     : window length
  --maxinsertlen <n> : pretend all inserts are length <= <n>

software ref: http://hmmer.org/
research ref: https://doi.org/10.1093/nar/gkt263