# panX 1.5.1 ## panX: microbial pan-genome analysis and exploration Overview: panX is a software package for microbial pan-genome analysis, visualization and exploration. The analysis pipeline is based on DIAMOND, MCL and phylogeny-aware post-processing. It takes a set of annotated bacterial strains as input (e.g. NCBI RefSeq records or user's own data in GenBank format). All genes from all strains are compared to each other via DIAMOND and then clustered into orthologous groups using MCL and adaptive phylogenetic post-processing, which split distantly related genes and paralogs if necessary. For each gene cluster, corresponding alignment and phylogeny are constructed. All core gene SNPs are then used to build strain/species phylogeny. The results can be interactively explored using a powerful web-based visualization application (either hosted by web server or run locally on desktop). The web application integrates various interconnected components (pan-genome statistical charts, gene cluster table, alignment, comparative phylogenies, metadata table) and allows rapid search and filter of gene clusters by gene name, annotation, duplication, diversity, gene gain/loss events, etc. Strain-specific metadata are integrated into strain phylogeny such that genes related to adaptation, antibiotic resistance, virulence can be readily identified. Location and version: ```console $ which panX.py /local/cluster/bin/panX.py $ panX.py --version panX analysis v1.5.1 ``` help message: ```console $ panX.py --help usage: ./panX.py -h (help) panX: Software for computing core and pan-genome from a set of genome sequences. The results will be exported as json files for visualization in the browser. optional arguments: -h, --help show this help message and exit -fn , --folder_name the absolute path for project folder -sl , --species_name species name as prefix for some temporary folders (e.g.: P_aeruginosa) -ngbk, --gbk_present use nucleotide/amino acid sequence files (fna/faa) when no genBank files given (this option does not consider annotations) -st [ ...], --steps [ ...] run specific steps or run all steps by default -mo, --metainfo_organism add organism information in metadata table. -mr, --metainfo_reconcile use reconciled metadata (redundancy removed) instead of original metadata. -rt , --raxml_max_time RAxML tree optimization: maximal runing time (minutes, default:30min) -t , --threads number of threads -v, --version show program's version number and exit -bp , --blast_file_path the absolute path for blast result (e.g.: /path/blast.out) -rp , --roary_file_path the absolute path for roary result (e.g.: /path/roary.out) -op , --orthofinder_file_path the absolute path for orthofinder result (e.g.: /path/orthofinder.out) -otp , --other_tool_fpath the absolute path for result from other orthology inference tool (e.g.: /path/other_tool.out) -mi , --metainfo_fpath the absolute path for meta_information file (e.g.: /path/meta.out) -dmp , --diamond_path alternative diamond path provided by user -dme , --diamond_evalue default: e-value threshold below 0.001 -dmt , --diamond_max_target_seqs Diamond: maximum number of target sequences per query Estimation: #strain * #max_duplication (50*10=500) -dmi , --diamond_identity Diamond: sequence identity threshold to report an alignment. Default: no restriction (0) -dmqc , --diamond_query_cover Diamond: query sequence coverage threshold to report an alignment. Default: no restriction (0) -dmsc , --diamond_subject_cover Diamond: subject sequence coverage threshold to report an alignment. Default: no restriction (0) -dmdc, --diamond_divide_conquer running diamond alignment in divide-and-conquer(DC) algorithm for large dataset -dcs , --subset_size subset_size (number of strains in a subset) for divide-and-conquer(DC) algorithm. Default:50 -dmsi , --diamond_identity_subproblem Diamond divide-and-conquer subproblem: sequence identity threshold to report an alignment. -dmsqc , --diamond_query_cover_subproblem Diamond divide-and-conquer subproblem: query sequence coverage threshold to report an alignment -dmssc , --diamond_subject_cover_subproblem Diamond divide-and-conquer subproblem: subject sequence coverage threshold to report an alignment -imcl , --mcl_inflation MCL: inflation parameter (this parameter affects granularity) -bmt , --blastn_RNA_max_target_seqs Blastn on RNAs: the maximum number of target sequences per query Estimation: #strain * #max_duplication -np, --disable_cluster_postprocessing disable postprocessing (split overclustered genes and paralogs, and cluster unclustered genes) -nsl, --disable_long_branch_splitting disable splitting long branch -rna, --enable_RNA_clustering cluster rRNAs -fcd , --factor_core_diversity default: factor used to refine raw core genome diversity, apply (0.1+2.0*core_diversity)/(1+2.0*core_diversity) to decide split_long_branch_cutoff -slb , --split_long_branch_cutoff split long branch cutoff provided by user (by default: 0.0 as not given): -pep, --explore_paralog_plot default: not plot paralog statistics -pfc , --paralog_frac_cutoff fraction of strains required for splitting paralogy. Default: 0.33 -pbc , --paralog_branch_cutoff branch_length cutoff used in paralogy splitting -ws , --window_size_smoothed postprocess_unclustered_genes: window size for smoothed cluster length distribution -spr , --strain_proportion postprocess_unclustered_genes: strain proportion -ss , --sigma_scale postprocess_unclustered_genes: sigma scale -cg , --core_genome_threshold percentage of strains used to decide whether a gene is core. Default: 1.0 for strictly core gene; < 1.0 for soft core genes -csf , --core_gene_strain_fpath file path for user-provided subset of strains (core genes should be present in all strains in this list) -sitr, --simple_tree simple tree: does not use treetime for ancestral inference -dgl, --disable_gain_loss disable enable gene gain and loss inference (not recommended) -mglo, --merged_gain_loss_output not split gene presence/absence and gain/loss pattern into separate files for each cluster -iba, --infer_branch_association infer branch association -bamin , --min_strain_fraction_branch_association minimal fraction of the total number of strains for branch association -pamin , --min_strain_fraction_presence_association minimal fraction of the total number of strains for presence/absence association -pamax , --max_strain_fraction_presence_association maximal fraction of the total number of strains for presence/absence association -slt, --store_locus_tag store locus_tags in a separate file instead of saving locus_tags in gene cluster json for large dataset -rlt, --raw_locus_tag use raw locus_tag from GenBank instead of strain_ID + locus_tag -otc, --optional_table_column add customized column in gene cluster json file for visualization. -mtf , --meta_data_config file path for pre-defined metadata structure (discrete/continuous data type, etc.) -rxm , --raxml_path absolute path of raxml -ct, --clean_temporary_files default: keep temporary files ``` software ref: research ref: