# PICRUSt2 2.5.0 {{< admonition tip "Conda" true >}} See the 'activating the conda environment' section below to access this software. {{< /admonition >}} ## PICRUSt2 (**P**hylogenetic **I**nvestigation of **C**ommunities by **R**econstruction of **U**nobserved **St**ates) PICRUSt2 is a software for predicting functional abundances based only on marker gene sequences. Check out the paper [here](https://www.nature.com/articles/s41587-020-0548-6). "Function" usually refers to gene families such as KEGG orthologs and Enzyme Classification numbers, but predictions can be made for any arbitrary trait. Similarly, predictions are typically based on 16S rRNA gene sequencing data, but other marker genes can also be used. PICRUSt2 includes these and other improvements over the original version: * Allow users to predict functions for any 16S sequences. Representative sequences from OTUs or amplicon sequence variants (e.g. DADA2 and deblur output) can be used as input by taking a sequence placement approach * Database of reference genomes used for prediction has been expanded by >10X. * Addition of hidden-state prediction algorithms from the `castor` R package. * Allows output of MetaCyc ontology predictions that will be comparable with common shotgun metagenomics outputs. * Inference of pathway abundances now relies on MinPath, which makes these predictions more stringent. ### PICRUSt2 Flowchart ![PICRUST2 flowchart](../images/PICRUSt2_flowchart.png) ### Citations PICRUSt2 wraps a number of tools to generate functional predictions from amplicon sequences. **The PICRUSt2 paper can be found [here](https://www.nature.com/articles/s41587-020-0548-6). However, if you use PICRUSt2 you also need to cite the below tools**. #### For phylogenetic placement of reads: * **EPA-NG** ([paper](https://academic.oup.com/sysbio/advance-article/doi/10.1093/sysbio/syy054/5079844), [website](https://github.com/Pbdas/epa-ng)) - Default placement option. * **gappa** ([paper](https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa070/5722201), [website](https://github.com/lczech/gappa)) * **SEPP** ([paper](https://www.worldscientific.com/doi/abs/10.1142/9789814366496_0024), [website](https://github.com/smirarab/sepp)) - If alternative placement option used. #### For hidden state prediction: * **castor** ([paper](https://academic.oup.com/bioinformatics/article/34/6/1053/4582279), [website](https://cran.r-project.org/package=castor)) #### For pathway inference: * **MinPath** ([paper](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000465), [website](http://omics.informatics.indiana.edu/MinPath/)) - A modified version of this tool from the HMP project is packaged with PICRUSt2. This tool was released under the GNU General Public License. #### Frequently asked questions [FAQs](https://github.com/picrust/picrust2/wiki/Frequently-Asked-Questions) ## Activating the conda environment ```console bash source /local/cluster/picrust2/activate.sh ``` To use in SGE, add the source line above to your shell script before the commands you would like to run. ## Location and version ```console $ which picrust2_pipeline.py /local/cluster/picrust2/bin/picrust2_pipeline.py $ picrust2_pipeline.py --version picrust2_pipeline.py 2.5.0 ``` ## help message ```console $ picrust2_pipeline.py --help usage: picrust2_pipeline.py [-h] -s PATH -i PATH -o PATH [-p PROCESSES] [-t epa-ng|sepp] [-r PATH] [--in_traits IN_TRAITS] [--custom_trait_tables PATH] [--marker_gene_table PATH] [--pathway_map MAP] [--reaction_func MAP] [--no_pathways] [--regroup_map ID_MAP] [--no_regroup] [--stratified] [--max_nsti FLOAT] [--min_reads INT] [--min_samples INT] [-m {mp,emp_prob,pic,scp,subtree_average}] [-e EDGE_EXPONENT] [--min_align MIN_ALIGN] [--skip_nsti] [--skip_minpath] [--no_gap_fill] [--coverage] [--per_sequence_contrib] [--wide_table] [--skip_norm] [--remove_intermediate] [--verbose] [-v] Wrapper for full PICRUSt2 pipeline. Run sequence placement with EPA-NG and GAPPA to place study sequences (i.e. OTUs and ASVs) into a reference tree. Then runs hidden-state prediction with the castor R package to predict genome for each study sequence. Metagenome profiles are then generated, which can be optionally stratified by the contributing sequence. Finally, pathway abundances are predicted based on metagenome profiles. By default, output files include predictions for Enzyme Commission (EC) numbers, KEGG Orthologs (KOs), and MetaCyc pathway abundances. However, this script enables users to use custom reference and trait tables to customize analyses. optional arguments: -h, --help show this help message and exit -s PATH, --study_fasta PATH FASTA of unaligned study sequences (e.g. ASVs). The headerline should be only one field (i.e. no additional whitespace-delimited fields). -i PATH, --input PATH Input table of sequence abundances (BIOM, TSV or mothur shared file format). -o PATH, --output PATH Output folder for final files. -p PROCESSES, --processes PROCESSES Number of processes to run in parallel (default: 1). -t epa-ng|sepp, --placement_tool epa-ng|sepp Placement tool to use when placing sequences into reference tree. One of "epa-ng" or "sepp" must be input (default: epa-ng) -r PATH, --ref_dir PATH Directory containing reference sequence files (default: /local/cluster/picrust2/lib/python3.8/site- packages/picrust2/default_files/prokaryotic/pro_ref). Please see the online documentation for how to name the files in this directory. --in_traits IN_TRAITS Comma-delimited list (with no spaces) of which gene families to predict from this set: COG, EC, KO, PFAM, TIGRFAM. Note that EC numbers will always be predicted unless --no_pathways is set (default: EC,KO). --custom_trait_tables PATH Optional path to custom trait tables with gene families as columns and genomes as rows (overrides --in_traits setting) to be used for hidden-state prediction. Multiple tables can be specified by delimiting filenames by commas. Importantly, the first custom table specified will be used for inferring pathway abundances. Typically this command would be used with a custom marker gene table (--marker_gene_table) as well. --marker_gene_table PATH Path to marker gene copy number table (16S copy numbers by default). --pathway_map MAP MinPath mapfile. The default mapfile maps MetaCyc reactions to prokaryotic pathways (default: /local/cluster/picrust2/lib/python3.8/site- packages/picrust2/default_files/pathway_mapfiles/metacyc_path2rxn_struc_filt_pro.txt). --reaction_func MAP Functional database to use as reactions for inferring pathway abundances (default: EC). This should be either the short-form of the database as specified in --in_traits, or the path to the file as would be specified for --custom_trait_tables. Note that when functions besides the default EC numbers are used typically the --no_regroup option would also be set. --no_pathways Flag to indicate that pathways should NOT be inferred (otherwise they will be inferred by default). Predicted EC number abundances are used to infer pathways when the default reference files are used. --regroup_map ID_MAP Mapfile of ids to regroup gene families to before running MinPath. The default mapfile is for regrouping EC numbers to MetaCyc reactions (default: /local/cluster/picrust2/lib/python3.8/site- packages/picrust2/default_files/pathway_mapfiles/ec_level4_to_metacyc_rxn.tsv). --no_regroup Do not regroup input gene families to reactions as specified in the regrouping mapfile. This option should only be used if you are using custom reference and/or mapping files. --stratified Flag to indicate that stratified tables should be generated at all steps (will increase run-time). --max_nsti FLOAT Sequences with NSTI values above this value will be excluded (default: 2). --min_reads INT Minimum number of reads across all samples for each input ASV. ASVs below this cut-off will be counted as part of the "RARE" category in the stratified output (default: 1). --min_samples INT Minimum number of samples that an ASV needs to be identfied within. ASVs below this cut- off will be counted as part of the "RARE" category in the stratified output (default: 1). -m {mp,emp_prob,pic,scp,subtree_average}, --hsp_method {mp,emp_prob,pic,scp,subtree_average} HSP method to use."mp": predict discrete traits using max parsimony. "emp_prob": predict discrete traits based on empirical state probabilities across tips. "subtree_average": predict continuous traits using subtree averaging. "pic": predict continuous traits with phylogentic independent contrast. "scp": reconstruct continuous traits using squared- change parsimony (default: mp). -e EDGE_EXPONENT, --edge_exponent EDGE_EXPONENT Setting for maximum parisomony hidden-state prediction. Specifies weighting transition costs by the inverse length of edge lengths. If 0, then edge lengths do not influence predictions. Must be a non-negative real-valued number (default: 0.500000). --min_align MIN_ALIGN Proportion of the total length of an input query sequence that must align with reference sequences. Any sequences with lengths below this value after making an alignment with reference sequences will be excluded from the placement and all subsequent steps. (default: 0). --skip_nsti Do not calculate nearest-sequenced taxon index (NSTI). --skip_minpath Do not run MinPath to identify which pathways are present as a first pass (on by default). --no_gap_fill Do not perform gap filling before predicting pathway abundances (Gap filling is on otherwise by default. --coverage Calculate pathway coverages as well as abundances, which are experimental and only useful for advanced users. --per_sequence_contrib Flag to specify that MinPath is run on the genes contributed by each sequence (i.e. a predicted genome) individually. Note this will greatly increase the runtime. The output will be the predicted pathway abundance contributed by each individual sequence. This is in contrast to the default stratified output, which is the contribution to the community- wide pathway abundances. Pathway coverage stratified by contributing sequence will also be output when --coverage is set (default: False). --wide_table Output wide-format stratified table of metagenome and pathway predictions when "-- stratified" is set. This is the deprecated method of generating stratified tables since it is extremely memory intensive. The stratified filenames contain "strat" rather than "contrib" when this option is used. --skip_norm Skip normalizing sequence abundances by predicted marker gene copy numbers (typically 16S rRNA genes). This step will be performed automatically unless this option is specified. --remove_intermediate Remove the intermediate outfiles of the sequence placement and pathway inference steps. --verbose Print out details as commands are run. -v, --version show program's version number and exit Run full default pipeline with 10 cores (only unstratified output): picrust2_pipeline.py -s study_seqs.fna -i seqabun.biom -o picrust2_out --processes 10 Run full default pipeline with 10 cores with stratified output (including pathway stratified output based on per-sequence contributions): picrust2_pipeline.py -s study_seqs.fna -i seqabun.biom -o picrust2_out --processes 10 --stratified --per_sequence_contrib ``` software ref: software ref: research ref: