# GetOrganelle 1.7.5.0


## GetOrganelle

This toolkit assemblies organelle genome from genomic skimming data.

It achieved the best performance overall both on simulated and real data and
was recommended as the default for chloroplast genome assembly in a
third-party comparison paper ([Freudenthal et al. 2020. Genome
Biology](https://doi.org/10.1186/s13059-020-02153-6)).

## Conda env activation

You must activate the conda env to use this software.

To activate:

```console
bash
source /local/cluster/getorganelle/activate.sh
```

Location and version:

```console
$ which get_organelle_from_reads.py
/local/cluster/getorganelle/bin/get_organelle_from_reads.py
$ get_organelle_from_reads.py --version
GetOrganelle v1.7.5
$ echo $GETORG_PATH
/nfs1/CGRB/databases/GetOrganelle/latest
```

help message:

```console
$ get_organelle_from_reads.py --help
usage:
###  Embryophyta plant plastome, 2*(1G raw data, 150 bp) reads
get_organelle_from_reads.py -1 sample_1.fq -2 sample_2.fq -s cp_seed.fasta -o plastome_output  -R 15 -k21,45,65,85,105 -F embplant_pt
###  Embryophyta plant mitogenome
get_organelle_from_reads.py -1 sample_1.fq -2 sample_2.fq -s mt_seed.fasta -o mitogenome_output  -R 30 -k 21,45,65,85,105 -F embplant_mt

GetOrganelle v1.7.5 get_organelle_from_reads.py assembles organelle genomes from genome skimming data.
Find updates in https://github.com/Kinggerm/GetOrganelle and see README.md for more information.

optional arguments:
  -v, --version         show program's version number and exit
  -h                    print brief introduction for frequently-used options.
  --help                print verbose introduction for all options.

IN-OUT OPTIONS:
  Options on inputs and outputs

  -1 FQ_FILE_1          Input file with forward paired-end reads (format:
                        fastq/fastq.gz/fastq.tar.gz).
  -2 FQ_FILE_2          Input file with reverse paired-end reads (format:
                        fastq/fastq.gz/fastq.tar.gz).
  -u UNPAIRED_FQ_FILES  Input file(s) with unpaired (single-end) reads (format:
                        fastq/fastq.gz/fastq.tar.gz). files could be comma-separated lists such as
                        'seq1.fq,seq2.fq'.
  -o OUTPUT_BASE        Output directory. Overwriting files if directory exists.
  -s SEED_FILE          Seed sequence(s). Input fasta format file as initial seed. A seed sequence in
                        GetOrganelle is only used for identifying initial organelle reads. The
                        assembly process is purely de novo. Should be a list of files split by
                        comma(s) on a multi-organelle mode, with the same list length to
                        organelle_type (followed by '-F'). Default:
                        '/nfs1/CGRB/databases/GetOrganelle/latest/SeedDatabase/*.fasta' (* depends on
                        the value followed with flag '-F')
  -a ANTI_SEED          Anti-seed(s). Not suggested unless what you really know what you are doing.
                        Input fasta format file as anti-seed, where the extension process stop.
                        Typically serves as excluding plastid reads when extending mitochondrial
                        reads, or the other way around. You should be cautious about using this
                        option, because if the anti-seed includes some word in the target but not in
                        the seed, the result would have gaps. For example, use the embplant_mt and
                        embplant_pt from the same plant-species as seed and anti-seed.
  --max-reads MAXIMUM_N_READS
                        Hard bound for maximum number of reads to be used per file. A input larger
                        than 536870911 will be treated as infinity (INF). Default: 1.5E7 (-F
                        embplant_pt/embplant_nr/fungus_mt/fungus_nr); 7.5E7 (-F
                        embplant_mt/other_pt/anonym); 3E8 (-F animal_mt)
  --reduce-reads-for-coverage REDUCE_READS_FOR_COV
                        Soft bound for maximum number of reads to be used according to target-hitting
                        base coverage. If the estimated target-hitting base coverage is too high and
                        over this VALUE, GetOrganelle automatically reduce the number of reads to
                        generate a final assembly with base coverage close to this VALUE. This design
                        could greatly save computational resources in many situations. A mean base
                        coverage over 500 is extremely sufficient for most cases. This VALUE must be
                        larger than 10. Set this VALUE to inf to disable reducing. Default: 500.
  --max-ignore-percent MAXIMUM_IGNORE_PERCENT
                        The maximum percent of bases to be ignore in extension, due to low quality.
                        Default: 0.01
  --phred-offset PHRED_OFFSET
                        Phred offset for spades-hammer. Default: GetOrganelle-autodetect
  --min-quality-score MIN_QUALITY_SCORE
                        Minimum quality score in extension. This value would be automatically
                        decreased to prevent ignoring too much raw data (see --max-ignore-
                        percent).Default: 1 ('"' in Phred+33; 'A' in Phred+64/Solexa+64)
  --prefix PREFIX       Add extra prefix to resulting files under the output directory.
  --out-per-round       Enable output per round. Choose to save memory but cost more time per round.
  --zip-files           Choose to compress fq/sam files using gzip.
  --keep-temp           Choose to keep the running temp/index files.
  --config-dir GET_ORGANELLE_PATH
                        The directory where the configuration file and default databases were placed.
                        The default value also can be changed by adding 'export
                        GETORG_PATH=your_favor' to the shell script (e.g. ~/.bash_profile or
                        ~/.bashrc) Default: /nfs1/CGRB/databases/GetOrganelle/latest

SCHEME OPTIONS:
  Options on running schemes.

  -F ORGANELLE_TYPE     This flag should be followed with embplant_pt (embryophyta plant plastome),
                        other_pt (non-embryophyta plant plastome), embplant_mt (plant mitogenome),
                        embplant_nr (plant nuclear ribosomal RNA), animal_mt (animal mitogenome),
                        fungus_mt (fungus mitogenome), fungus_nr (fungus nuclear ribosomal RNA)or
                        embplant_mt,other_pt,fungus_mt (the combination of any of above organelle
                        genomes split by comma(s), which might be computationally more intensive than
                        separate runs), or anonym (uncertain organelle genome type). The anonym should
                        be used with customized seed and label databases ('-s' and '--genes'). For
                        easy usage and compatibility of old versions, following redirection would be
                        automatically fulfilled without warning: plant_cp->embplant_pt;
                        plant_pt->embplant_pt; plant_mt->embplant_mt; plant_nr->embplant_nr
  --fast                ="-R 10 -t 4 -J 5 -M 7 --max-n-words 3E7 --larger-auto-ws --disentangle-time-
                        limit 360" This option is suggested for homogeneously and highly covered data
                        (very fine data). You can overwrite the value of a specific option listed
                        above by adding that option along with the "--fast" flag. You could try
                        GetOrganelle with this option for a list of samples and run a second time
                        without this option for the rest with incomplete results.
  --memory-save         ="--out-per-round -P 0 --remove-duplicates 0" You can overwrite the value of a
                        specific option listed above by adding that option along with the "--memory-
                        save" flag. A larger '-R' value is suggested when "--memory-save" is chosen.
  --memory-unlimited    ="-P 1E7 --index-in-memory --remove-duplicates 2E8 --min-quality-score -5
                        --max-ignore-percent 0" You can overwrite the value of a specific option
                        listed above by adding that option along with the "--memory-unlimited" flag.

EXTENDING OPTIONS:
  Options on the performance of extending process

  -w WORD_SIZE          Word size (W) for pre-grouping (if not assigned by '--pre-w') and extending
                        process. This script would try to guess (auto-estimate) a proper W using an
                        empirical function based on average read length, reads quality, target genome
                        coverage, and other variables that might influence the extending process. You
                        could assign the ratio (1>input>0) of W to read_length, based on which this
                        script would estimate the W for you; or assign an absolute W value (read
                        length>input>=35). Default: auto-estimated.
  --pre-w PREGROUP_WORD_SIZE
                        Word size (W) for pre-grouping. Used to reproduce result when word size is a
                        certain value during pregrouping process and later changed during reads
                        extending process. Similar to word size. Default: the same to word size.
  -R MAX_ROUNDS, --max-rounds MAX_ROUNDS
                        Maximum number of extending rounds (suggested: >=2). Default: 15 (-F
                        embplant_pt), 30 (-F embplant_mt/other_pt), 10 (-F
                        embplant_nr/animal_mt/fungus_mt/fungus_nr), inf (-P 0).
  --max-n-words MAXIMUM_N_WORDS
                        Maximum number of words to be used in total.Default: 4E8 (-F embplant_pt), 2E8
                        (-F embplant_nr/fungus_mt/fungus_nr/animal_mt), 2E9 (-F embplant_mt/other_pt)
  -J JUMP_STEP          The length of step for checking words in reads during extending process
                        (integer >= 1). When you have reads of high quality, the larger the number is,
                        the faster the extension will be, the more risk of missing reads in low
                        coverage area. Choose 1 to choose the slowest but safest extension strategy.
                        Default: 3
  -M MESH_SIZE          (Beta parameter) The length of step for building words from seeds during
                        extending process (integer >= 1). When you have reads of high quality, the
                        larger the number is, the faster the extension will be, the more risk of
                        missing reads in low coverage area. Another usage of this mesh size is to
                        choose a larger mesh size coupled with a smaller word size, which makes
                        smaller word size feasible when memory is limited.Choose 1 to choose the
                        slowest but safest extension strategy. Default: 2
  --bowtie2-options BOWTIE2_OPTIONS
                        Bowtie2 options, such as '--ma 3 --mp 5,2 --very-fast -t'. Default: --very-
                        fast -t.
  --larger-auto-ws      By using this flag, the empirical function for estimating W would tend to
                        produce a relative larger W, which would speed up the matching in extending,
                        reduce the memory cost in extending, but increase the risk of broken final
                        graph. Suggested when the data is good with high and homogenous coverage.
  --target-genome-size TARGET_GENOME_SIZE
                        Hypothetical value(s) of target genome size. This is only used for estimating
                        word size when no '-w word_size' is given. Should be a list of INTEGER numbers
                        split by comma(s) on a multi-organelle mode, with the same list length to
                        organelle_type (followed by '-F'). Default: 130000 (-F embplant_pt) or 390000
                        (-F embplant_mt) or 13000 (-F embplant_nr) or 39000 (-F other_pt) or 13000 (-F
                        animal_mt) or 65000 (-F fungus_mt) or 13000 (-F fungus_nr) or
                        39000,390000,65000 (-F other_pt,embplant_mt,fungus_mt)
  --max-extending-len MAX_EXTENDING_LEN
                        Maximum extending length(s) derived from the seed(s). A single value could be
                        a non-negative number, or inf (infinite) or auto (automatic estimation). This
                        is designed for properly stopping the extending from getting too long and
                        saving computational resources. However, empirically, a maximum extending
                        length value larger than 6000 would not be helpful for saving computational
                        resources. This value would not be precise in controlling output size,
                        especially when pre-group (followed by '-P') is turn on.In the auto mode, the
                        maximum extending length is estimated based on the sizes of the gap regions
                        that not covered in the seed sequences. A sequence of a closely related
                        species would be preferred for estimating a better maximum extending length
                        value. If you are using limited loci, e.g. rbcL gene as the seed for
                        assembling the whole plastome (with extending length ca. 75000 >> 6000), you
                        should set maximum extending length to inf. Should be a list of
                        numbers/auto/inf split by comma(s) on a multi-organelle mode, with the same
                        list length to organelle_type (followed by '-F'). Default: inf.

ASSEMBLY OPTIONS:
  These options are about the assembly and graph disentangling

  -k SPADES_KMER        SPAdes kmer settings. Use the same format as in SPAdes. illegal kmer values
                        would be automatically discarded by GetOrganelle. Default: 21,55,85,115
  --spades-options OTHER_SPADES_OPTIONS
                        Other SPAdes options. Use double quotation marks to include all the arguments
                        and parameters.
  --no-spades           Disable SPAdes.
  --ignore-k IGNORE_KMER_RES
                        A kmer threshold below which, no slimming/disentangling would be executed on
                        the result. Default: 40
  --genes GENES_FASTA   Followed with a customized database (a fasta file or the base name of a blast
                        database) containing or made of ONE set of protein coding genes and ribosomal
                        RNAs extracted from ONE reference genome that you want to assemble. Should be
                        a list of databases split by comma(s) on a multi-organelle mode, with the same
                        list length to organelle_type (followed by '-F'). This is optional for any
                        organelle mentioned in '-F' but required for 'anonym'. By default, certain
                        database(s) in /nfs1/CGRB/databases/GetOrganelle/latest/LabelDatabase would be
                        used contingent on the organelle types chosen (-F). The default value become
                        invalid when '--genes' or '--ex-genes' is used.
  --ex-genes EXCLUDE_GENES
                        This is optional and Not suggested, since non-target contigs could contribute
                        information for better downstream coverage-based clustering. Followed with a
                        customized database (a fasta file or the base name of a blast database)
                        containing or made of protein coding genes and ribosomal RNAs extracted from
                        reference genome(s) that you want to exclude. Could be a list of databases
                        split by comma(s) but NOT required to have the same list length to
                        organelle_type (followed by '-F'). The default value will become invalid when
                        '--genes' or '--ex-genes' is used.
  --disentangle-df DISENTANGLE_DEPTH_FACTOR
                        Depth factor for differentiate genome type of contigs. The genome type of
                        contigs are determined by blast. Default: 10.0
  --contamination-depth CONTAMINATION_DEPTH
                        Depth factor for confirming contamination in parallel contigs. Default: 3.0
  --contamination-similarity CONTAMINATION_SIMILARITY
                        Similarity threshold for confirming contaminating contigs. Default: 0.9
  --no-degenerate       Disable making consensus from parallel contig based on nucleotide degenerate
                        table.
  --degenerate-depth DEGENERATE_DEPTH
                        Depth factor for confirming parallel contigs. Default: 1.5
  --degenerate-similarity DEGENERATE_SIMILARITY
                        Similarity threshold for confirming parallel contigs. Default: 0.98
  --disentangle-time-limit DISENTANGLE_TIME_LIMIT
                        Time limit (second) for each try of disentangling a graph file as a circular
                        genome. Disentangling a graph as contigs is not limited. Default: 1800
  --expected-max-size EXPECTED_MAX_SIZE
                        Expected maximum target genome size(s) for disentangling. Should be a list of
                        INTEGER numbers split by comma(s) on a multi-organelle mode, with the same
                        list length to organelle_type (followed by '-F'). Default: 250000 (-F
                        embplant_pt/fungus_mt), 25000 (-F embplant_nr/animal_mt/fungus_nr), 1000000
                        (-F embplant_mt/other_pt),1000000,1000000,250000 (-F
                        other_pt,embplant_mt,fungus_mt)
  --expected-min-size EXPECTED_MIN_SIZE
                        Expected minimum target genome size(s) for disentangling. Should be a list of
                        INTEGER numbers split by comma(s) on a multi-organelle mode, with the same
                        list length to organelle_type (followed by '-F'). Default: 10000 for all.
  --reverse-lsc         For '-F embplant_pt' with complete circular result, by default, the direction
                        of the starting contig (usually the LSC region) is determined as the direction
                        with less ORFs. Choose this option to reverse the direction of the starting
                        contig when result is circular. Actually, both directions are biologically
                        equivalent to each other. The reordering of the direction is only for easier
                        downstream analysis.
  --max-paths-num MAX_PATHS_NUM
                        Repeats would dramatically increase the number of potential isomers (paths).
                        This option was used to export a certain amount of paths out of all possible
                        paths per assembly graph. Default: 1000

ADDITIONAL OPTIONS:

  -t THREADS            Maximum threads to use.
  -P PRE_GROUPED        The maximum number (integer) of high-covered reads to be pre-grouped before
                        extending process. pre_grouping is suggested when the whole genome coverage is
                        shallow but the organ genome coverage is deep. The default value is 2E5. For
                        personal computer with 8G memory, we suggest no more than 3E5. A larger number
                        (ex. 6E5) would run faster but exhaust memory in the first few minutes. Choose
                        0 to disable this process.
  --which-blast WHICH_BLAST
                        Assign the path to BLAST binary files if not added to the path. Default: try
                        "/local/cluster/getorganelle-1.7.5.0/lib/python3.9/site-
                        packages/GetOrganelleDep/linux/ncbi-blast" first, then $PATH
  --which-bowtie2 WHICH_BOWTIE2
                        Assign the path to Bowtie2 binary files if not added to the path. Default: try
                        "/local/cluster/getorganelle-1.7.5.0/lib/python3.9/site-
                        packages/GetOrganelleDep/linux/bowtie2" first, then $PATH
  --which-spades WHICH_SPADES
                        Assign the path to SPAdes binary files if not added to the path. Default: try
                        "/local/cluster/getorganelle-1.7.5.0/lib/python3.9/site-
                        packages/GetOrganelleDep/linux/SPAdes" first, then $PATH
  --which-bandage WHICH_BANDAGE
                        Assign the path to bandage binary file if not added to the path. Default: try
                        $PATH
  --continue            Several check points based on files produced, rather than on the log file, so
                        keep in mind that this script will NOT detect the difference between this
                        input parameters and the previous ones.
  --overwrite           Overwrite previous file if existed.
  --index-in-memory     Keep index in memory. Choose save index in memory than in disk.
  --remove-duplicates RM_DUPLICATES
                        By default this script use unique reads to extend. Choose the number of
                        duplicates (integer) to be saved in memory. A larger number (ex. 2E7) would
                        run faster but exhaust memory in the first few minutes. Choose 0 to disable
                        this process. Note that whether choose or not will not disable the calling of
                        replicate reads. Default: 10000000.0.
  --flush-step ECHO_STEP
                        Flush step (INTEGER OR INF) for presenting progress. For running in the
                        background, you could set this to inf, which would disable this. Default:
                        54321
  --random-seed RANDOM_SEED
                        Default: 12345
  --verbose             Verbose output. Choose to enable verbose running log_handler.
```

Please denote the version of GetOrganelle as well as the dependencies in your
manuscript for reproducible science.

<b>Citation:</b> Jian-Jun Jin*, Wen-Bin Yu*, Jun-Bo Yang, Yu Song, Claude W.
dePamphilis, Ting-Shuang Yi, De-Zhu Li. <b>GetOrganelle: a fast and versatile
toolkit for accurate de novo assembly of organelle genomes.</b> <i>Genome
Biology</i> <b>21</b>, 241 (2020).
[https://doi.org/10.1186/s13059-020-02154-5](https://doi.org/10.1186/s13059-020-02154-5)

<b>License:</b> GPL https://www.gnu.org/licenses/gpl-3.0.html

Please also cite the dependencies if used:

SPAdes: [Bankevich, A., S. Nurk, D. Antipov, A. A. Gurevich, M. Dvorkin, A. S. Kulikov, V. M. Lesin, S. I. Nikolenko, S. Pham, A. D. Prjibelski, A. V. Pyshkin, A. V. Sirotkin, N. Vyahhi, G. Tesler, M. A. Alekseyev and P. A. Pevzner. 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology 19: 455-477.](https://www.liebertpub.com/doi/abs/10.1089/cmb.2012.0021)

Bowtie2: [Langmead, B. and S. L. Salzberg. 2012. Fast gapped-read alignment with Bowtie 2. Nature Methods 9: 357-359.](https://www.nature.com/articles/nmeth.1923)

BLAST+: [Camacho, C., G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer and T. L. Madden. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10: 421.](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-421)

Bandage: [Wick, R. R., M. B. Schultz, J. Zobel and K. E. Holt. 2015. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31: 3350-3352.](https://academic.oup.com/bioinformatics/article/31/20/3350/196114)

software ref: <https://github.com/Kinggerm/GetOrganelle/wiki>  
software ref: <https://github.com/Kinggerm/GetOrganelle>  
research ref: <https://doi.org/10.1186/s13059-020-02154-5>