PICRUSt2 2.5.0

2022-06-13 1507 words 8 minutes

Contents

Conda

See the ‘activating the conda environment’ section below to access this software.

PICRUSt2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States)

PICRUSt2 is a software for predicting functional abundances based only on marker gene sequences. Check out the paper here.

“Function” usually refers to gene families such as KEGG orthologs and Enzyme Classification numbers, but predictions can be made for any arbitrary trait. Similarly, predictions are typically based on 16S rRNA gene sequencing data, but other marker genes can also be used.

PICRUSt2 includes these and other improvements over the original version:

Allow users to predict functions for any 16S sequences. Representative sequences from OTUs or amplicon sequence variants (e.g. DADA2 and deblur output) can be used as input by taking a sequence placement approach
Database of reference genomes used for prediction has been expanded by >10X.
Addition of hidden-state prediction algorithms from the castor R package.
Allows output of MetaCyc ontology predictions that will be comparable with common shotgun metagenomics outputs.
Inference of pathway abundances now relies on MinPath, which makes these predictions more stringent.

PICRUSt2 Flowchart

Citations

PICRUSt2 wraps a number of tools to generate functional predictions from amplicon sequences. The PICRUSt2 paper can be found here. However, if you use PICRUSt2 you also need to cite the below tools.

For phylogenetic placement of reads:

EPA-NG (paper, website) - Default placement option.
gappa (paper, website)
SEPP (paper, website) - If alternative placement option used.

For hidden state prediction:

castor (paper, website)

For pathway inference:

MinPath (paper, website) - A modified version of this tool from the HMP project is packaged with PICRUSt2. This tool was released under the GNU General Public License.

Frequently asked questions

FAQs

Activating the conda environment

1
2


bash
source /local/cluster/picrust2/activate.sh

To use in SGE, add the source line above to your shell script before the commands you would like to run.

Location and version

1
2
3
4


$ which picrust2_pipeline.py
/local/cluster/picrust2/bin/picrust2_pipeline.py
$ picrust2_pipeline.py --version
picrust2_pipeline.py 2.5.0

help message

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112


$ picrust2_pipeline.py --help
usage: picrust2_pipeline.py [-h] -s PATH -i PATH -o PATH [-p PROCESSES] [-t epa-ng|sepp] [-r PATH]
                            [--in_traits IN_TRAITS] [--custom_trait_tables PATH] [--marker_gene_table PATH]
                            [--pathway_map MAP] [--reaction_func MAP] [--no_pathways] [--regroup_map ID_MAP]
                            [--no_regroup] [--stratified] [--max_nsti FLOAT] [--min_reads INT]
                            [--min_samples INT] [-m {mp,emp_prob,pic,scp,subtree_average}] [-e EDGE_EXPONENT]
                            [--min_align MIN_ALIGN] [--skip_nsti] [--skip_minpath] [--no_gap_fill] [--coverage]
                            [--per_sequence_contrib] [--wide_table] [--skip_norm] [--remove_intermediate]
                            [--verbose] [-v]

Wrapper for full PICRUSt2 pipeline. Run sequence placement with EPA-NG and GAPPA to place study sequences (i.e. OTUs and ASVs) into a reference tree. Then runs hidden-state prediction with the castor R package to predict genome for each study sequence. Metagenome profiles are then generated, which can be optionally stratified by the contributing sequence. Finally, pathway abundances are predicted based on metagenome profiles. By default, output files include predictions for Enzyme Commission (EC) numbers, KEGG Orthologs (KOs), and MetaCyc pathway abundances. However, this script enables users to use custom reference and trait tables to customize analyses.

optional arguments:
  -h, --help            show this help message and exit
  -s PATH, --study_fasta PATH
                        FASTA of unaligned study sequences (e.g. ASVs). The headerline should be only one field
                        (i.e. no additional whitespace-delimited fields).
  -i PATH, --input PATH
                        Input table of sequence abundances (BIOM, TSV or mothur shared file format).
  -o PATH, --output PATH
                        Output folder for final files.
  -p PROCESSES, --processes PROCESSES
                        Number of processes to run in parallel (default: 1).
  -t epa-ng|sepp, --placement_tool epa-ng|sepp
                        Placement tool to use when placing sequences into reference tree. One of "epa-ng" or
                        "sepp" must be input (default: epa-ng)
  -r PATH, --ref_dir PATH
                        Directory containing reference sequence files (default:
                        /local/cluster/picrust2/lib/python3.8/site-
                        packages/picrust2/default_files/prokaryotic/pro_ref). Please see the online documentation
                        for how to name the files in this directory.
  --in_traits IN_TRAITS
                        Comma-delimited list (with no spaces) of which gene families to predict from this set:
                        COG, EC, KO, PFAM, TIGRFAM. Note that EC numbers will always be predicted unless
                        --no_pathways is set (default: EC,KO).
  --custom_trait_tables PATH
                        Optional path to custom trait tables with gene families as columns and genomes as rows
                        (overrides --in_traits setting) to be used for hidden-state prediction. Multiple tables
                        can be specified by delimiting filenames by commas. Importantly, the first custom table
                        specified will be used for inferring pathway abundances. Typically this command would be
                        used with a custom marker gene table (--marker_gene_table) as well.
  --marker_gene_table PATH
                        Path to marker gene copy number table (16S copy numbers by default).
  --pathway_map MAP     MinPath mapfile. The default mapfile maps MetaCyc reactions to prokaryotic pathways
                        (default: /local/cluster/picrust2/lib/python3.8/site-
                        packages/picrust2/default_files/pathway_mapfiles/metacyc_path2rxn_struc_filt_pro.txt).
  --reaction_func MAP   Functional database to use as reactions for inferring pathway abundances (default: EC).
                        This should be either the short-form of the database as specified in --in_traits, or the
                        path to the file as would be specified for --custom_trait_tables. Note that when
                        functions besides the default EC numbers are used typically the --no_regroup option would
                        also be set.
  --no_pathways         Flag to indicate that pathways should NOT be inferred (otherwise they will be inferred by
                        default). Predicted EC number abundances are used to infer pathways when the default
                        reference files are used.
  --regroup_map ID_MAP  Mapfile of ids to regroup gene families to before running MinPath. The default mapfile is
                        for regrouping EC numbers to MetaCyc reactions (default:
                        /local/cluster/picrust2/lib/python3.8/site-
                        packages/picrust2/default_files/pathway_mapfiles/ec_level4_to_metacyc_rxn.tsv).
  --no_regroup          Do not regroup input gene families to reactions as specified in the regrouping mapfile.
                        This option should only be used if you are using custom reference and/or mapping files.
  --stratified          Flag to indicate that stratified tables should be generated at all steps (will increase
                        run-time).
  --max_nsti FLOAT      Sequences with NSTI values above this value will be excluded (default: 2).
  --min_reads INT       Minimum number of reads across all samples for each input ASV. ASVs below this cut-off
                        will be counted as part of the "RARE" category in the stratified output (default: 1).
  --min_samples INT     Minimum number of samples that an ASV needs to be identfied within. ASVs below this cut-
                        off will be counted as part of the "RARE" category in the stratified output (default: 1).
  -m {mp,emp_prob,pic,scp,subtree_average}, --hsp_method {mp,emp_prob,pic,scp,subtree_average}
                        HSP method to use."mp": predict discrete traits using max parsimony. "emp_prob": predict
                        discrete traits based on empirical state probabilities across tips. "subtree_average":
                        predict continuous traits using subtree averaging. "pic": predict continuous traits with
                        phylogentic independent contrast. "scp": reconstruct continuous traits using squared-
                        change parsimony (default: mp).
  -e EDGE_EXPONENT, --edge_exponent EDGE_EXPONENT
                        Setting for maximum parisomony hidden-state prediction. Specifies weighting transition
                        costs by the inverse length of edge lengths. If 0, then edge lengths do not influence
                        predictions. Must be a non-negative real-valued number (default: 0.500000).
  --min_align MIN_ALIGN
                        Proportion of the total length of an input query sequence that must align with reference
                        sequences. Any sequences with lengths below this value after making an alignment with
                        reference sequences will be excluded from the placement and all subsequent steps.
                        (default: 0).
  --skip_nsti           Do not calculate nearest-sequenced taxon index (NSTI).
  --skip_minpath        Do not run MinPath to identify which pathways are present as a first pass (on by
                        default).
  --no_gap_fill         Do not perform gap filling before predicting pathway abundances (Gap filling is on
                        otherwise by default.
  --coverage            Calculate pathway coverages as well as abundances, which are experimental and only useful
                        for advanced users.
  --per_sequence_contrib
                        Flag to specify that MinPath is run on the genes contributed by each sequence (i.e. a
                        predicted genome) individually. Note this will greatly increase the runtime. The output
                        will be the predicted pathway abundance contributed by each individual sequence. This is
                        in contrast to the default stratified output, which is the contribution to the community-
                        wide pathway abundances. Pathway coverage stratified by contributing sequence will also
                        be output when --coverage is set (default: False).
  --wide_table          Output wide-format stratified table of metagenome and pathway predictions when "--
                        stratified" is set. This is the deprecated method of generating stratified tables since
                        it is extremely memory intensive. The stratified filenames contain "strat" rather than
                        "contrib" when this option is used.
  --skip_norm           Skip normalizing sequence abundances by predicted marker gene copy numbers (typically 16S
                        rRNA genes). This step will be performed automatically unless this option is specified.
  --remove_intermediate
                        Remove the intermediate outfiles of the sequence placement and pathway inference steps.
  --verbose             Print out details as commands are run.
  -v, --version         show program's version number and exit

Run full default pipeline with 10 cores (only unstratified output):
picrust2_pipeline.py -s study_seqs.fna -i seqabun.biom -o picrust2_out --processes 10

Run full default pipeline with 10 cores with stratified output (including pathway stratified output based on per-sequence contributions):
picrust2_pipeline.py -s study_seqs.fna -i seqabun.biom -o picrust2_out --processes 10 --stratified --per_sequence_contrib

software ref: https://huttenhower.sph.harvard.edu/picrust
software ref: https://github.com/picrust/picrust2/wiki
research ref: https://www.nature.com/articles/s41587-020-0548-6