Conda
See the ‘activating the conda environment’ section below to access this
software.
PICRUSt2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States)
PICRUSt2 is a software for predicting functional abundances based only on
marker gene sequences. Check out the paper
here.
“Function” usually refers to gene families such as KEGG orthologs and Enzyme
Classification numbers, but predictions can be made for any arbitrary trait.
Similarly, predictions are typically based on 16S rRNA gene sequencing data,
but other marker genes can also be used.
PICRUSt2 includes these and other improvements over the original version:
- Allow users to predict functions for any 16S sequences. Representative
sequences from OTUs or amplicon sequence variants (e.g. DADA2 and deblur
output) can be used as input by taking a sequence placement approach
- Database of reference genomes used for prediction has been expanded by >10X.
- Addition of hidden-state prediction algorithms from the
castor
R
package.
- Allows output of MetaCyc ontology predictions that will be comparable with
common shotgun metagenomics outputs.
- Inference of pathway abundances now relies on MinPath, which makes these
predictions more stringent.
PICRUSt2 Flowchart
Citations
PICRUSt2 wraps a number of tools to generate functional predictions from
amplicon sequences. The PICRUSt2 paper can be found
here. However, if you use
PICRUSt2 you also need to cite the below tools.
For phylogenetic placement of reads:
For hidden state prediction:
For pathway inference:
- MinPath
(paper,
website) - A modified
version of this tool from the HMP project is packaged with PICRUSt2. This
tool was released under the GNU General Public License.
Frequently asked questions
FAQs
Activating the conda environment
1
2
|
bash
source /local/cluster/picrust2/activate.sh
|
To use in SGE, add the source line above to your shell script before the
commands you would like to run.
Location and version
1
2
3
4
|
$ which picrust2_pipeline.py
/local/cluster/picrust2/bin/picrust2_pipeline.py
$ picrust2_pipeline.py --version
picrust2_pipeline.py 2.5.0
|
help message
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
|
$ picrust2_pipeline.py --help
usage: picrust2_pipeline.py [-h] -s PATH -i PATH -o PATH [-p PROCESSES] [-t epa-ng|sepp] [-r PATH]
[--in_traits IN_TRAITS] [--custom_trait_tables PATH] [--marker_gene_table PATH]
[--pathway_map MAP] [--reaction_func MAP] [--no_pathways] [--regroup_map ID_MAP]
[--no_regroup] [--stratified] [--max_nsti FLOAT] [--min_reads INT]
[--min_samples INT] [-m {mp,emp_prob,pic,scp,subtree_average}] [-e EDGE_EXPONENT]
[--min_align MIN_ALIGN] [--skip_nsti] [--skip_minpath] [--no_gap_fill] [--coverage]
[--per_sequence_contrib] [--wide_table] [--skip_norm] [--remove_intermediate]
[--verbose] [-v]
Wrapper for full PICRUSt2 pipeline. Run sequence placement with EPA-NG and GAPPA to place study sequences (i.e. OTUs and ASVs) into a reference tree. Then runs hidden-state prediction with the castor R package to predict genome for each study sequence. Metagenome profiles are then generated, which can be optionally stratified by the contributing sequence. Finally, pathway abundances are predicted based on metagenome profiles. By default, output files include predictions for Enzyme Commission (EC) numbers, KEGG Orthologs (KOs), and MetaCyc pathway abundances. However, this script enables users to use custom reference and trait tables to customize analyses.
optional arguments:
-h, --help show this help message and exit
-s PATH, --study_fasta PATH
FASTA of unaligned study sequences (e.g. ASVs). The headerline should be only one field
(i.e. no additional whitespace-delimited fields).
-i PATH, --input PATH
Input table of sequence abundances (BIOM, TSV or mothur shared file format).
-o PATH, --output PATH
Output folder for final files.
-p PROCESSES, --processes PROCESSES
Number of processes to run in parallel (default: 1).
-t epa-ng|sepp, --placement_tool epa-ng|sepp
Placement tool to use when placing sequences into reference tree. One of "epa-ng" or
"sepp" must be input (default: epa-ng)
-r PATH, --ref_dir PATH
Directory containing reference sequence files (default:
/local/cluster/picrust2/lib/python3.8/site-
packages/picrust2/default_files/prokaryotic/pro_ref). Please see the online documentation
for how to name the files in this directory.
--in_traits IN_TRAITS
Comma-delimited list (with no spaces) of which gene families to predict from this set:
COG, EC, KO, PFAM, TIGRFAM. Note that EC numbers will always be predicted unless
--no_pathways is set (default: EC,KO).
--custom_trait_tables PATH
Optional path to custom trait tables with gene families as columns and genomes as rows
(overrides --in_traits setting) to be used for hidden-state prediction. Multiple tables
can be specified by delimiting filenames by commas. Importantly, the first custom table
specified will be used for inferring pathway abundances. Typically this command would be
used with a custom marker gene table (--marker_gene_table) as well.
--marker_gene_table PATH
Path to marker gene copy number table (16S copy numbers by default).
--pathway_map MAP MinPath mapfile. The default mapfile maps MetaCyc reactions to prokaryotic pathways
(default: /local/cluster/picrust2/lib/python3.8/site-
packages/picrust2/default_files/pathway_mapfiles/metacyc_path2rxn_struc_filt_pro.txt).
--reaction_func MAP Functional database to use as reactions for inferring pathway abundances (default: EC).
This should be either the short-form of the database as specified in --in_traits, or the
path to the file as would be specified for --custom_trait_tables. Note that when
functions besides the default EC numbers are used typically the --no_regroup option would
also be set.
--no_pathways Flag to indicate that pathways should NOT be inferred (otherwise they will be inferred by
default). Predicted EC number abundances are used to infer pathways when the default
reference files are used.
--regroup_map ID_MAP Mapfile of ids to regroup gene families to before running MinPath. The default mapfile is
for regrouping EC numbers to MetaCyc reactions (default:
/local/cluster/picrust2/lib/python3.8/site-
packages/picrust2/default_files/pathway_mapfiles/ec_level4_to_metacyc_rxn.tsv).
--no_regroup Do not regroup input gene families to reactions as specified in the regrouping mapfile.
This option should only be used if you are using custom reference and/or mapping files.
--stratified Flag to indicate that stratified tables should be generated at all steps (will increase
run-time).
--max_nsti FLOAT Sequences with NSTI values above this value will be excluded (default: 2).
--min_reads INT Minimum number of reads across all samples for each input ASV. ASVs below this cut-off
will be counted as part of the "RARE" category in the stratified output (default: 1).
--min_samples INT Minimum number of samples that an ASV needs to be identfied within. ASVs below this cut-
off will be counted as part of the "RARE" category in the stratified output (default: 1).
-m {mp,emp_prob,pic,scp,subtree_average}, --hsp_method {mp,emp_prob,pic,scp,subtree_average}
HSP method to use."mp": predict discrete traits using max parsimony. "emp_prob": predict
discrete traits based on empirical state probabilities across tips. "subtree_average":
predict continuous traits using subtree averaging. "pic": predict continuous traits with
phylogentic independent contrast. "scp": reconstruct continuous traits using squared-
change parsimony (default: mp).
-e EDGE_EXPONENT, --edge_exponent EDGE_EXPONENT
Setting for maximum parisomony hidden-state prediction. Specifies weighting transition
costs by the inverse length of edge lengths. If 0, then edge lengths do not influence
predictions. Must be a non-negative real-valued number (default: 0.500000).
--min_align MIN_ALIGN
Proportion of the total length of an input query sequence that must align with reference
sequences. Any sequences with lengths below this value after making an alignment with
reference sequences will be excluded from the placement and all subsequent steps.
(default: 0).
--skip_nsti Do not calculate nearest-sequenced taxon index (NSTI).
--skip_minpath Do not run MinPath to identify which pathways are present as a first pass (on by
default).
--no_gap_fill Do not perform gap filling before predicting pathway abundances (Gap filling is on
otherwise by default.
--coverage Calculate pathway coverages as well as abundances, which are experimental and only useful
for advanced users.
--per_sequence_contrib
Flag to specify that MinPath is run on the genes contributed by each sequence (i.e. a
predicted genome) individually. Note this will greatly increase the runtime. The output
will be the predicted pathway abundance contributed by each individual sequence. This is
in contrast to the default stratified output, which is the contribution to the community-
wide pathway abundances. Pathway coverage stratified by contributing sequence will also
be output when --coverage is set (default: False).
--wide_table Output wide-format stratified table of metagenome and pathway predictions when "--
stratified" is set. This is the deprecated method of generating stratified tables since
it is extremely memory intensive. The stratified filenames contain "strat" rather than
"contrib" when this option is used.
--skip_norm Skip normalizing sequence abundances by predicted marker gene copy numbers (typically 16S
rRNA genes). This step will be performed automatically unless this option is specified.
--remove_intermediate
Remove the intermediate outfiles of the sequence placement and pathway inference steps.
--verbose Print out details as commands are run.
-v, --version show program's version number and exit
Run full default pipeline with 10 cores (only unstratified output):
picrust2_pipeline.py -s study_seqs.fna -i seqabun.biom -o picrust2_out --processes 10
Run full default pipeline with 10 cores with stratified output (including pathway stratified output based on per-sequence contributions):
picrust2_pipeline.py -s study_seqs.fna -i seqabun.biom -o picrust2_out --processes 10 --stratified --per_sequence_contrib
|
software ref: https://huttenhower.sph.harvard.edu/picrust
software ref: https://github.com/picrust/picrust2/wiki
research ref: https://www.nature.com/articles/s41587-020-0548-6