Contents

Genomescope2

Conda
See the ‘activating the conda environment’ section below to access this software.

GenomeScope 2.0: Reference-free profiling of polyploid genomes

Many genomics analyses require first establishing a reference genome. However, de novo genome assembly is a complicated and computationally intensive process with many assumptions hidden to the user. A popular assessment prior to genome assembly is genome profiling, where the k-mer frequencies within the sequencing reads are analyzed to efficiently estimate major genome characteristics such as genome size, heterozygosity, and repetitiveness. However, current genome profiling tools are suited only for diploid genomes and use heuristic approaches.

We have developed GenomeScope 2.0, which applies classical insights from combinatorial theory to establish a detailed mathematical model of how k-mer frequencies will be distributed in heterozygous and polyploid genomes. GenomeScope 2.0 employs a polyploid-aware mixture model that, within seconds, accurately infers genome properties from unassembled sequencing data. GenomeScope 2.0 uses the k-mer count distribution, e.g. from KMC or Jellyfish, and produces a report and several informative plots describing the genome properties. We validate the approach on simulated polyploid data created using a generative model with parameters for genome size, heterozygosity, repetitiveness, ploidy, and sequencing coverage, and find GenomeScope 2.0 retains accuracy across a broad range of realistic and extreme parameter values. We also validate GenomeScope 2.0 by analyzing genuine sequence data from 11 diverse polyploid genomes with known genome characteristics.

Getting Started

Before running GenomeScope 2.0, you must first compute the histogram of k-mer frequencies. We recommend KMC that is available at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=kmc&subpage=download, or jellyfish that is available at http://www.genome.umd.edu/jellyfish.html.

After compiling KMC, you can count the k-mers with these commands:

1
2
3
mkdir tmp
ls *.fastq > FILES
kmc -k21 -t10 -m64 -ci1 -cs10000 @FILES reads tmp/

Note you should adjust the memory (-m) and threads (-t) parameters according to your server. This example will use 10 threads and 64GB of RAM. The k-mer length (-k) may need to be scaled if you have low coverage or high error rate. The lower (-ci) and upper (-cs) bounds exclude k-mers with counts outside these boundaries. FILES is a file with a list of input files.

After compiling jellyfish, you can count the k-mers with this command:

1
jellyfish count -C -m 21 -s 1000000000 -t 10 *.fastq -o reads.jf

Note you should adjust the memory (-s) and threads (-t) parameters according to your server. This example will use 10 threads and 1GB of RAM. The k-mer length (-m) may need to be scaled if you have low coverage or a high error rate. You should always use “canonical k-mers” (-C) since the sequencing reads will come from both the forward and reverse strand of DNA.

We recommend using a k-mer length of 21 for most genomes, as this length is sufficiently long that most k-mers are not repetitive and is short enough that the analysis will be more robust to sequencing errors. Extremely large (haploid size »10GB) and/or very repetitive genomes may benefit from larger k-mer lengths to increase the number of unique k-mers. Accurate inferences requires a minimum amount of coverage, at least 15x coverage of the haploid genome or greater, otherwise the model fit will be poor or not converge. GenomeScope also requires relatively low error rate sequencing, such as Illumina sequencing, so that most k-mers do not have errors in them. For example, a 2% error corresponds to an error every 50bp on average, which is greater than the typical k-mer size used (k=21). Notably, raw single molecule sequencing reads from Oxford Nanopore or Pacific Biosciences, which currently average 5-15% error, are not supported as an error will occur on average every 6 to 20 bp.

After counting the k-mers, you then export the k-mer count histogram. With KMC:

1
kmc_tools transform reads histogram reads.histo -cx10000

The upper bound (-cx) gives the cutoff for the histogram.

With jellyfish:

1
jellyfish histo -t 10 reads.jf > reads.histo

Again the thread count (-t) should be scaled according to your server.

After you have the histogram file, you can run GenomeScope within the online web tool, or at the command line.

After installation of the R package, command line users can run the modeling with the R script genomescope.R, making sure that genomescope.R is in your PATH.

1
genomescope.R -i histogram_file -o output_dir -k k-mer_length

The input histogram_file (from KMC or jellyfish), output_dir, and k-mer_length are required parameters. The optional parameter -p ploidy sets the ploidy of the model for GenomeScope to use. The optional parameter -l lambda sets the initial guess for the average k-mer coverage of the sequencing. The optional parameter -n ‘name_prefix’ sets the prefix for the output files. The optional parameter -m max_kmercov specifies the cutoff for excluding high frequence k-mers from the analysis. The output plots and a text file of the inferred genome characteristics will be output to the specified output_dir directory.

For example, you can download the histogram from the Arabidopsis F1 described in the manuscript here: https://raw.githubusercontent.com/schatzlab/genomescope/master/analysis/real_data/ara_F1_21.hist

Then run GenomeScope like this:

1
genomescope.R -i ara_F1_21.hist -o output -k 21

This should complete in less than 1 minute, and report:

1
Model converged het:0.0104 kcov:22.2 err:0.0035 model fit:0.458 len:151978519

The plots and the full results will be in output directory, showing the estimated genome size to be 151.9Mbp and a 1.04% heterozygosity rate (the exact values may slightly differ due to the randomization within the modeling).

Note: genomescope.R should be in your $PATH when you activate the conda env below.


Activating the conda environment

Check out a node with qrsh and run:

1
2
bash
source /local/cluster/conda-envs/envs/genomescope2/activate.sh

Using conda activate genomescope2

If your conda is set up as here, you can run conda activate genomescope2 instead of the source line above as well.

Running over SGE

To use over SGE, include the source line above in a shell script prior to your commands, e.g.

1
2
3
4
$ cat run_genomescope2.sh
#!/usr/bin/env bash
source /local/cluster/conda-envs/envs/genomescope2/activate.sh
genomescope.R ...

And then run hqsub 'bash ./run_genomescope2.sh' -r sge.genomescope2 ....

Using hqsub –conda

You can also run:

hqsub 'conda activate genomescope2; genomescope.R ...' -r sge.genomescope2 --conda ...

Without having to write a separate shell script.

Location and version

1
2
3
4
$ which genomescope.R
/local/cluster/conda-envs/envs/genomescope2/bin/genomescope.R
$ genomescope.R --version
GenomeScope 2.0

help message

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
$ genomescope.R -h
usage: /local/cluster/conda-envs/envs/genomescope2/bin/genomescope.R
       [-h] [-v] [-i INPUT] [-o OUTPUT] [-p PLOIDY] [-k KMER_LENGTH]
       [-n NAME_PREFIX] [-l LAMBDA] [-m MAX_KMERCOV] [--verbose]
       [--no_unique_sequence] [-t TOPOLOGY]
       [--initial_repetitiveness INITIAL_REPETITIVENESS]
       [--initial_heterozygosities INITIAL_HETEROZYGOSITIES]
       [--transform_exp TRANSFORM_EXP] [--testing] [--true_params TRUE_PARAMS]
       [--trace_flag] [--num_rounds NUM_ROUNDS] [--fitted_hist]
       [--start_shift START_SHIFT] [--typical_error TYPICAL_ERROR]

options:
  -h, --help            show this help message and exit
  -v, --version         print the version and exit
  -i INPUT, --input INPUT
                        input histogram file
  -o OUTPUT, --output OUTPUT
                        output directory name
  -p PLOIDY, --ploidy PLOIDY
                        ploidy (1, 2, 3, 4, 5, or 6) for model to use [default
                        2]
  -k KMER_LENGTH, --kmer_length KMER_LENGTH
                        kmer length used to calculate kmer spectra [default
                        21]
  -n NAME_PREFIX, --name_prefix NAME_PREFIX
                        optional name_prefix for output files
  -l LAMBDA, --lambda LAMBDA, --kcov LAMBDA, --kmercov LAMBDA
                        optional initial kmercov estimate for model to use
  -m MAX_KMERCOV, --max_kmercov MAX_KMERCOV
                        optional maximum kmer coverage threshold (kmers with
                        coverage greater than max_kmercov are ignored by the
                        model)
  --verbose             optional flag to print messages during execution
  --no_unique_sequence  optional flag to turn off yellow unique sequence line
                        in plots
  -t TOPOLOGY, --topology TOPOLOGY
                        ADVANCED: flag for topology for model to use
  --initial_repetitiveness INITIAL_REPETITIVENESS
                        ADVANCED: flag to set initial value for repetitiveness
  --initial_heterozygosities INITIAL_HETEROZYGOSITIES
                        ADVANCED: flag to set initial values for nucleotide
                        heterozygosity rates
  --transform_exp TRANSFORM_EXP
                        ADVANCED: parameter for the exponent when fitting a
                        transformed (x**transform_exp*y vs. x) kmer histogram
                        [default 1]
  --testing             ADVANCED: flag to create testing.tsv file with model
                        parameters
  --true_params TRUE_PARAMS
                        ADVANCED: flag to state true simulated parameters for
                        testing mode
  --trace_flag          ADVANCED: flag to turn on printing of iteration
                        progress of nlsLM function
  --num_rounds NUM_ROUNDS
                        ADVANCED: parameter for the number of optimization
                        rounds
  --fitted_hist         ADVANCED: generates a fitted histogram for kmer
                        multiplicity 0-4 and a lookup table of probabilities
  --start_shift START_SHIFT
                        ADVANCED: coverage shifts to exclude between fitting
                        rounds
  --typical_error TYPICAL_ERROR
                        ADVANCED: typical level of sequencing error

software ref: https://github.com/tbenavi1/genomescope2.0
research ref: https://doi.org/10.1038/s41467-020-14998-3