panX 1.5.1

2021-09-16 1041 words 5 minutes

panX: microbial pan-genome analysis and exploration

Overview: panX is a software package for microbial pan-genome analysis, visualization and exploration. The analysis pipeline is based on DIAMOND, MCL and phylogeny-aware post-processing. It takes a set of annotated bacterial strains as input (e.g. NCBI RefSeq records or user’s own data in GenBank format). All genes from all strains are compared to each other via DIAMOND and then clustered into orthologous groups using MCL and adaptive phylogenetic post-processing, which split distantly related genes and paralogs if necessary. For each gene cluster, corresponding alignment and phylogeny are constructed. All core gene SNPs are then used to build strain/species phylogeny.

The results can be interactively explored using a powerful web-based visualization application (either hosted by web server or run locally on desktop). The web application integrates various interconnected components (pan-genome statistical charts, gene cluster table, alignment, comparative phylogenies, metadata table) and allows rapid search and filter of gene clusters by gene name, annotation, duplication, diversity, gene gain/loss events, etc. Strain-specific metadata are integrated into strain phylogeny such that genes related to adaptation, antibiotic resistance, virulence can be readily identified.

Location and version:

1
2
3
4


$ which panX.py
/local/cluster/bin/panX.py
$ panX.py --version
panX analysis v1.5.1

help message:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149


$ panX.py --help
usage: ./panX.py -h (help)

panX: Software for computing core and pan-genome from a set of genome
sequences. The results will be exported as json files for visualization in the
browser.

optional arguments:
  -h, --help            show this help message and exit
  -fn , --folder_name   the absolute path for project folder
  -sl , --species_name
                        species name as prefix for some temporary folders
                        (e.g.: P_aeruginosa)
  -ngbk, --gbk_present  use nucleotide/amino acid sequence files (fna/faa)
                        when no genBank files given (this option does not
                        consider annotations)
  -st  [ ...], --steps  [ ...]
                        run specific steps or run all steps by default
  -mo, --metainfo_organism
                        add organism information in metadata table.
  -mr, --metainfo_reconcile
                        use reconciled metadata (redundancy removed) instead
                        of original metadata.
  -rt , --raxml_max_time
                        RAxML tree optimization: maximal runing time (minutes,
                        default:30min)
  -t , --threads        number of threads
  -v, --version         show program's version number and exit
  -bp , --blast_file_path
                        the absolute path for blast result (e.g.:
                        /path/blast.out)
  -rp , --roary_file_path
                        the absolute path for roary result (e.g.:
                        /path/roary.out)
  -op , --orthofinder_file_path
                        the absolute path for orthofinder result (e.g.:
                        /path/orthofinder.out)
  -otp , --other_tool_fpath
                        the absolute path for result from other orthology
                        inference tool (e.g.: /path/other_tool.out)
  -mi , --metainfo_fpath
                        the absolute path for meta_information file (e.g.:
                        /path/meta.out)
  -dmp , --diamond_path
                        alternative diamond path provided by user
  -dme , --diamond_evalue
                        default: e-value threshold below 0.001
  -dmt , --diamond_max_target_seqs
                        Diamond: maximum number of target sequences per query
                        Estimation: #strain * #max_duplication (50*10=500)
  -dmi , --diamond_identity
                        Diamond: sequence identity threshold to report an
                        alignment. Default: no restriction (0)
  -dmqc , --diamond_query_cover
                        Diamond: query sequence coverage threshold to report
                        an alignment. Default: no restriction (0)
  -dmsc , --diamond_subject_cover
                        Diamond: subject sequence coverage threshold to report
                        an alignment. Default: no restriction (0)
  -dmdc, --diamond_divide_conquer
                        running diamond alignment in divide-and-conquer(DC)
                        algorithm for large dataset
  -dcs , --subset_size
                        subset_size (number of strains in a subset) for
                        divide-and-conquer(DC) algorithm. Default:50
  -dmsi , --diamond_identity_subproblem
                        Diamond divide-and-conquer subproblem: sequence
                        identity threshold to report an alignment.
  -dmsqc , --diamond_query_cover_subproblem
                        Diamond divide-and-conquer subproblem: query sequence
                        coverage threshold to report an alignment
  -dmssc , --diamond_subject_cover_subproblem
                        Diamond divide-and-conquer subproblem: subject
                        sequence coverage threshold to report an alignment
  -imcl , --mcl_inflation
                        MCL: inflation parameter (this parameter affects
                        granularity)
  -bmt , --blastn_RNA_max_target_seqs
                        Blastn on RNAs: the maximum number of target sequences
                        per query Estimation: #strain * #max_duplication
  -np, --disable_cluster_postprocessing
                        disable postprocessing (split overclustered genes and
                        paralogs, and cluster unclustered genes)
  -nsl, --disable_long_branch_splitting
                        disable splitting long branch
  -rna, --enable_RNA_clustering
                        cluster rRNAs
  -fcd , --factor_core_diversity
                        default: factor used to refine raw core genome
                        diversity, apply
                        (0.1+2.0*core_diversity)/(1+2.0*core_diversity) to
                        decide split_long_branch_cutoff
  -slb , --split_long_branch_cutoff
                        split long branch cutoff provided by user (by default:
                        0.0 as not given):
  -pep, --explore_paralog_plot
                        default: not plot paralog statistics
  -pfc , --paralog_frac_cutoff
                        fraction of strains required for splitting paralogy.
                        Default: 0.33
  -pbc , --paralog_branch_cutoff
                        branch_length cutoff used in paralogy splitting
  -ws , --window_size_smoothed
                        postprocess_unclustered_genes: window size for
                        smoothed cluster length distribution
  -spr , --strain_proportion
                        postprocess_unclustered_genes: strain proportion
  -ss , --sigma_scale   postprocess_unclustered_genes: sigma scale
  -cg , --core_genome_threshold
                        percentage of strains used to decide whether a gene is
                        core. Default: 1.0 for strictly core gene; < 1.0 for
                        soft core genes
  -csf , --core_gene_strain_fpath
                        file path for user-provided subset of strains (core
                        genes should be present in all strains in this list)
  -sitr, --simple_tree  simple tree: does not use treetime for ancestral
                        inference
  -dgl, --disable_gain_loss
                        disable enable gene gain and loss inference (not
                        recommended)
  -mglo, --merged_gain_loss_output
                        not split gene presence/absence and gain/loss pattern
                        into separate files for each cluster
  -iba, --infer_branch_association
                        infer branch association
  -bamin , --min_strain_fraction_branch_association
                        minimal fraction of the total number of strains for
                        branch association
  -pamin , --min_strain_fraction_presence_association
                        minimal fraction of the total number of strains for
                        presence/absence association
  -pamax , --max_strain_fraction_presence_association
                        maximal fraction of the total number of strains for
                        presence/absence association
  -slt, --store_locus_tag
                        store locus_tags in a separate file instead of saving
                        locus_tags in gene cluster json for large dataset
  -rlt, --raw_locus_tag
                        use raw locus_tag from GenBank instead of strain_ID +
                        locus_tag
  -otc, --optional_table_column
                        add customized column in gene cluster json file for
                        visualization.
  -mtf , --meta_data_config
                        file path for pre-defined metadata structure
                        (discrete/continuous data type, etc.)
  -rxm , --raxml_path   absolute path of raxml
  -ct, --clean_temporary_files
                        default: keep temporary files

software ref: https://github.com/neherlab/pan-genome-analysis
research ref: https://doi.org/10.1093/nar/gkx977