# Genomescope2


{{< admonition tip "Conda" true >}}
See the 'activating the conda environment' section below to access this
software.
{{< /admonition >}}

## GenomeScope 2.0: Reference-free profiling of polyploid genomes

Many genomics analyses require first establishing a reference genome. However,
*de novo* genome assembly is a complicated and computationally intensive
process with many assumptions hidden to the user. A popular assessment prior
to genome assembly is genome profiling, where the k-mer frequencies within the
sequencing reads are analyzed to efficiently estimate major genome
characteristics such as genome size, heterozygosity, and repetitiveness.
However, current genome profiling tools are suited only for diploid genomes
and use heuristic approaches.

We have developed GenomeScope 2.0, which applies classical insights from
combinatorial theory to establish a detailed mathematical model of how k-mer
frequencies will be distributed in heterozygous and polyploid genomes.
GenomeScope 2.0 employs a polyploid-aware mixture model that, within seconds,
accurately infers genome properties from unassembled sequencing data.
GenomeScope 2.0 uses the k-mer count distribution, e.g. from KMC or Jellyfish,
and produces a report and several informative plots describing the genome
properties. We validate the approach on simulated polyploid data created using
a generative model with parameters for genome size, heterozygosity,
repetitiveness, ploidy, and sequencing coverage, and find GenomeScope 2.0
retains accuracy across a broad range of realistic and extreme parameter
values. We also validate GenomeScope 2.0 by analyzing genuine sequence data
from 11 diverse polyploid genomes with known genome characteristics.

### Getting Started

Before running GenomeScope 2.0, you must first compute the histogram of k-mer
frequencies. We recommend
[KMC](https://academic.oup.com/bioinformatics/article/33/17/2759/3796399) that
is available at
http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=kmc&subpage=download,
or
[jellyfish](https://academic.oup.com/bioinformatics/article/27/6/764/234905/A-fast-lock-free-approach-for-efficient-parallel)
that is available at http://www.genome.umd.edu/jellyfish.html.

After compiling KMC, you can count the k-mers with these commands:

```console
mkdir tmp
ls *.fastq > FILES
kmc -k21 -t10 -m64 -ci1 -cs10000 @FILES reads tmp/
```

Note you should adjust the memory (-m) and threads (-t) parameters according
to your server. This example will use 10 threads and 64GB of RAM. The k-mer
length (-k) may need to be scaled if you have low coverage or high error rate.
The lower (-ci) and upper (-cs) bounds exclude k-mers with counts outside
these boundaries. FILES is a file with a list of input files.

After compiling jellyfish, you can count the k-mers with this command:

```console
jellyfish count -C -m 21 -s 1000000000 -t 10 *.fastq -o reads.jf
```
  
Note you should adjust the memory (-s) and threads (-t) parameters according
to your server. This example will use 10 threads and 1GB of RAM. The k-mer
length (-m) may need to be scaled if you have low coverage or a high error
rate. You should always use "canonical k-mers" (-C) since the sequencing reads
will come from both the forward and reverse strand of DNA.

We recommend using a k-mer length of 21 for most genomes, as this length is
sufficiently long that most k-mers are not repetitive and is short enough that
the analysis will be more robust to sequencing errors. Extremely large
(haploid size >>10GB) and/or very repetitive genomes may benefit from larger
k-mer lengths to increase the number of unique k-mers. Accurate inferences
requires a minimum amount of coverage, at least 15x coverage of the haploid
genome or greater, otherwise the model fit will be poor or not converge.
GenomeScope also requires relatively low error rate sequencing, such as
Illumina sequencing, so that most k-mers do not have errors in them. For
example, a 2% error corresponds to an error every 50bp on average, which is
greater than the typical k-mer size used (k=21). Notably, raw single molecule
sequencing reads from Oxford Nanopore or Pacific Biosciences, which currently
average 5-15% error, are not supported as an error will occur on average every
6 to 20 bp.

After counting the k-mers, you then export the k-mer count histogram. With KMC:

```console
kmc_tools transform reads histogram reads.histo -cx10000
```

The upper bound (-cx) gives the cutoff for the histogram.

With jellyfish:

```console
jellyfish histo -t 10 reads.jf > reads.histo
```

Again the thread count (-t) should be scaled according to your server.

After you have the histogram file, you can run GenomeScope within the online
web tool, or at the command line.

After installation of the R package, command line users can run the modeling
with the R script genomescope.R, making sure that genomescope.R is in your
PATH.

```console
genomescope.R -i histogram_file -o output_dir -k k-mer_length
```

The input histogram_file (from KMC or jellyfish), output_dir, and k-mer_length
are required parameters. The optional parameter *-p ploidy* sets the ploidy of
the model for GenomeScope to use. The optional parameter *-l lambda* sets the
initial guess for the average k-mer coverage of the sequencing. The optional
parameter *-n 'name_prefix'* sets the prefix for the output files. The
optional parameter *-m max_kmercov* specifies the cutoff for excluding high
frequence k-mers from the analysis. The output plots and a text file of the
inferred genome characteristics will be output to the specified output_dir
directory.

For example, you can download the histogram from the Arabidopsis F1 described
in the manuscript here:
https://raw.githubusercontent.com/schatzlab/genomescope/master/analysis/real_data/ara_F1_21.hist

Then run GenomeScope like this:

```console
genomescope.R -i ara_F1_21.hist -o output -k 21
```

This should complete in less than 1 minute, and report:

```console
Model converged het:0.0104 kcov:22.2 err:0.0035 model fit:0.458 len:151978519
```

The plots and the full results will be in output directory, showing the
estimated genome size to be 151.9Mbp and a 1.04% heterozygosity rate (the
exact values may slightly differ due to the randomization within the
modeling).

**Note: `genomescope.R` should be in your `$PATH` when you activate the conda
env below.**

-------------------------------------------------------------------------------

## Activating the conda environment

Check out a node with `qrsh` and run:

```console
bash
source /local/cluster/conda-envs/envs/genomescope2/activate.sh
```

### Using conda activate genomescope2

If your conda is set up as
[here](https://software.cqls.oregonstate.edu/tips/posts/using-the-system-miniconda-3-install/),
you can run `conda activate genomescope2` instead of the `source` line above as well.

## Running over SGE

To use over SGE, include the source line above in a shell script prior to your
commands, e.g.

```bash
$ cat run_genomescope2.sh
#!/usr/bin/env bash
source /local/cluster/conda-envs/envs/genomescope2/activate.sh
genomescope.R ...
```

And then run `hqsub 'bash ./run_genomescope2.sh' -r sge.genomescope2 ...`.

### Using hqsub --conda

You can also run:

`hqsub 'conda activate genomescope2; genomescope.R ...' -r sge.genomescope2 --conda ...`

Without having to write a separate shell script.

## Location and version

```console
$ which genomescope.R
/local/cluster/conda-envs/envs/genomescope2/bin/genomescope.R
$ genomescope.R --version
GenomeScope 2.0
```

## help message

```console
$ genomescope.R -h
usage: /local/cluster/conda-envs/envs/genomescope2/bin/genomescope.R
       [-h] [-v] [-i INPUT] [-o OUTPUT] [-p PLOIDY] [-k KMER_LENGTH]
       [-n NAME_PREFIX] [-l LAMBDA] [-m MAX_KMERCOV] [--verbose]
       [--no_unique_sequence] [-t TOPOLOGY]
       [--initial_repetitiveness INITIAL_REPETITIVENESS]
       [--initial_heterozygosities INITIAL_HETEROZYGOSITIES]
       [--transform_exp TRANSFORM_EXP] [--testing] [--true_params TRUE_PARAMS]
       [--trace_flag] [--num_rounds NUM_ROUNDS] [--fitted_hist]
       [--start_shift START_SHIFT] [--typical_error TYPICAL_ERROR]

options:
  -h, --help            show this help message and exit
  -v, --version         print the version and exit
  -i INPUT, --input INPUT
                        input histogram file
  -o OUTPUT, --output OUTPUT
                        output directory name
  -p PLOIDY, --ploidy PLOIDY
                        ploidy (1, 2, 3, 4, 5, or 6) for model to use [default
                        2]
  -k KMER_LENGTH, --kmer_length KMER_LENGTH
                        kmer length used to calculate kmer spectra [default
                        21]
  -n NAME_PREFIX, --name_prefix NAME_PREFIX
                        optional name_prefix for output files
  -l LAMBDA, --lambda LAMBDA, --kcov LAMBDA, --kmercov LAMBDA
                        optional initial kmercov estimate for model to use
  -m MAX_KMERCOV, --max_kmercov MAX_KMERCOV
                        optional maximum kmer coverage threshold (kmers with
                        coverage greater than max_kmercov are ignored by the
                        model)
  --verbose             optional flag to print messages during execution
  --no_unique_sequence  optional flag to turn off yellow unique sequence line
                        in plots
  -t TOPOLOGY, --topology TOPOLOGY
                        ADVANCED: flag for topology for model to use
  --initial_repetitiveness INITIAL_REPETITIVENESS
                        ADVANCED: flag to set initial value for repetitiveness
  --initial_heterozygosities INITIAL_HETEROZYGOSITIES
                        ADVANCED: flag to set initial values for nucleotide
                        heterozygosity rates
  --transform_exp TRANSFORM_EXP
                        ADVANCED: parameter for the exponent when fitting a
                        transformed (x**transform_exp*y vs. x) kmer histogram
                        [default 1]
  --testing             ADVANCED: flag to create testing.tsv file with model
                        parameters
  --true_params TRUE_PARAMS
                        ADVANCED: flag to state true simulated parameters for
                        testing mode
  --trace_flag          ADVANCED: flag to turn on printing of iteration
                        progress of nlsLM function
  --num_rounds NUM_ROUNDS
                        ADVANCED: parameter for the number of optimization
                        rounds
  --fitted_hist         ADVANCED: generates a fitted histogram for kmer
                        multiplicity 0-4 and a lookup table of probabilities
  --start_shift START_SHIFT
                        ADVANCED: coverage shifts to exclude between fitting
                        rounds
  --typical_error TYPICAL_ERROR
                        ADVANCED: typical level of sequencing error
```

software ref: <https://github.com/tbenavi1/genomescope2.0>  
research ref: <https://doi.org/10.1038/s41467-020-14998-3>