# Genomescope2 {{< admonition tip "Conda" true >}} See the 'activating the conda environment' section below to access this software. {{< /admonition >}} ## GenomeScope 2.0: Reference-free profiling of polyploid genomes Many genomics analyses require first establishing a reference genome. However, *de novo* genome assembly is a complicated and computationally intensive process with many assumptions hidden to the user. A popular assessment prior to genome assembly is genome profiling, where the k-mer frequencies within the sequencing reads are analyzed to efficiently estimate major genome characteristics such as genome size, heterozygosity, and repetitiveness. However, current genome profiling tools are suited only for diploid genomes and use heuristic approaches. We have developed GenomeScope 2.0, which applies classical insights from combinatorial theory to establish a detailed mathematical model of how k-mer frequencies will be distributed in heterozygous and polyploid genomes. GenomeScope 2.0 employs a polyploid-aware mixture model that, within seconds, accurately infers genome properties from unassembled sequencing data. GenomeScope 2.0 uses the k-mer count distribution, e.g. from KMC or Jellyfish, and produces a report and several informative plots describing the genome properties. We validate the approach on simulated polyploid data created using a generative model with parameters for genome size, heterozygosity, repetitiveness, ploidy, and sequencing coverage, and find GenomeScope 2.0 retains accuracy across a broad range of realistic and extreme parameter values. We also validate GenomeScope 2.0 by analyzing genuine sequence data from 11 diverse polyploid genomes with known genome characteristics. ### Getting Started Before running GenomeScope 2.0, you must first compute the histogram of k-mer frequencies. We recommend [KMC](https://academic.oup.com/bioinformatics/article/33/17/2759/3796399) that is available at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=kmc&subpage=download, or [jellyfish](https://academic.oup.com/bioinformatics/article/27/6/764/234905/A-fast-lock-free-approach-for-efficient-parallel) that is available at http://www.genome.umd.edu/jellyfish.html. After compiling KMC, you can count the k-mers with these commands: ```console mkdir tmp ls *.fastq > FILES kmc -k21 -t10 -m64 -ci1 -cs10000 @FILES reads tmp/ ``` Note you should adjust the memory (-m) and threads (-t) parameters according to your server. This example will use 10 threads and 64GB of RAM. The k-mer length (-k) may need to be scaled if you have low coverage or high error rate. The lower (-ci) and upper (-cs) bounds exclude k-mers with counts outside these boundaries. FILES is a file with a list of input files. After compiling jellyfish, you can count the k-mers with this command: ```console jellyfish count -C -m 21 -s 1000000000 -t 10 *.fastq -o reads.jf ``` Note you should adjust the memory (-s) and threads (-t) parameters according to your server. This example will use 10 threads and 1GB of RAM. The k-mer length (-m) may need to be scaled if you have low coverage or a high error rate. You should always use "canonical k-mers" (-C) since the sequencing reads will come from both the forward and reverse strand of DNA. We recommend using a k-mer length of 21 for most genomes, as this length is sufficiently long that most k-mers are not repetitive and is short enough that the analysis will be more robust to sequencing errors. Extremely large (haploid size >>10GB) and/or very repetitive genomes may benefit from larger k-mer lengths to increase the number of unique k-mers. Accurate inferences requires a minimum amount of coverage, at least 15x coverage of the haploid genome or greater, otherwise the model fit will be poor or not converge. GenomeScope also requires relatively low error rate sequencing, such as Illumina sequencing, so that most k-mers do not have errors in them. For example, a 2% error corresponds to an error every 50bp on average, which is greater than the typical k-mer size used (k=21). Notably, raw single molecule sequencing reads from Oxford Nanopore or Pacific Biosciences, which currently average 5-15% error, are not supported as an error will occur on average every 6 to 20 bp. After counting the k-mers, you then export the k-mer count histogram. With KMC: ```console kmc_tools transform reads histogram reads.histo -cx10000 ``` The upper bound (-cx) gives the cutoff for the histogram. With jellyfish: ```console jellyfish histo -t 10 reads.jf > reads.histo ``` Again the thread count (-t) should be scaled according to your server. After you have the histogram file, you can run GenomeScope within the online web tool, or at the command line. After installation of the R package, command line users can run the modeling with the R script genomescope.R, making sure that genomescope.R is in your PATH. ```console genomescope.R -i histogram_file -o output_dir -k k-mer_length ``` The input histogram_file (from KMC or jellyfish), output_dir, and k-mer_length are required parameters. The optional parameter *-p ploidy* sets the ploidy of the model for GenomeScope to use. The optional parameter *-l lambda* sets the initial guess for the average k-mer coverage of the sequencing. The optional parameter *-n 'name_prefix'* sets the prefix for the output files. The optional parameter *-m max_kmercov* specifies the cutoff for excluding high frequence k-mers from the analysis. The output plots and a text file of the inferred genome characteristics will be output to the specified output_dir directory. For example, you can download the histogram from the Arabidopsis F1 described in the manuscript here: https://raw.githubusercontent.com/schatzlab/genomescope/master/analysis/real_data/ara_F1_21.hist Then run GenomeScope like this: ```console genomescope.R -i ara_F1_21.hist -o output -k 21 ``` This should complete in less than 1 minute, and report: ```console Model converged het:0.0104 kcov:22.2 err:0.0035 model fit:0.458 len:151978519 ``` The plots and the full results will be in output directory, showing the estimated genome size to be 151.9Mbp and a 1.04% heterozygosity rate (the exact values may slightly differ due to the randomization within the modeling). **Note: `genomescope.R` should be in your `$PATH` when you activate the conda env below.** ------------------------------------------------------------------------------- ## Activating the conda environment Check out a node with `qrsh` and run: ```console bash source /local/cluster/conda-envs/envs/genomescope2/activate.sh ``` ### Using conda activate genomescope2 If your conda is set up as [here](https://software.cqls.oregonstate.edu/tips/posts/using-the-system-miniconda-3-install/), you can run `conda activate genomescope2` instead of the `source` line above as well. ## Running over SGE To use over SGE, include the source line above in a shell script prior to your commands, e.g. ```bash $ cat run_genomescope2.sh #!/usr/bin/env bash source /local/cluster/conda-envs/envs/genomescope2/activate.sh genomescope.R ... ``` And then run `hqsub 'bash ./run_genomescope2.sh' -r sge.genomescope2 ...`. ### Using hqsub --conda You can also run: `hqsub 'conda activate genomescope2; genomescope.R ...' -r sge.genomescope2 --conda ...` Without having to write a separate shell script. ## Location and version ```console $ which genomescope.R /local/cluster/conda-envs/envs/genomescope2/bin/genomescope.R $ genomescope.R --version GenomeScope 2.0 ``` ## help message ```console $ genomescope.R -h usage: /local/cluster/conda-envs/envs/genomescope2/bin/genomescope.R [-h] [-v] [-i INPUT] [-o OUTPUT] [-p PLOIDY] [-k KMER_LENGTH] [-n NAME_PREFIX] [-l LAMBDA] [-m MAX_KMERCOV] [--verbose] [--no_unique_sequence] [-t TOPOLOGY] [--initial_repetitiveness INITIAL_REPETITIVENESS] [--initial_heterozygosities INITIAL_HETEROZYGOSITIES] [--transform_exp TRANSFORM_EXP] [--testing] [--true_params TRUE_PARAMS] [--trace_flag] [--num_rounds NUM_ROUNDS] [--fitted_hist] [--start_shift START_SHIFT] [--typical_error TYPICAL_ERROR] options: -h, --help show this help message and exit -v, --version print the version and exit -i INPUT, --input INPUT input histogram file -o OUTPUT, --output OUTPUT output directory name -p PLOIDY, --ploidy PLOIDY ploidy (1, 2, 3, 4, 5, or 6) for model to use [default 2] -k KMER_LENGTH, --kmer_length KMER_LENGTH kmer length used to calculate kmer spectra [default 21] -n NAME_PREFIX, --name_prefix NAME_PREFIX optional name_prefix for output files -l LAMBDA, --lambda LAMBDA, --kcov LAMBDA, --kmercov LAMBDA optional initial kmercov estimate for model to use -m MAX_KMERCOV, --max_kmercov MAX_KMERCOV optional maximum kmer coverage threshold (kmers with coverage greater than max_kmercov are ignored by the model) --verbose optional flag to print messages during execution --no_unique_sequence optional flag to turn off yellow unique sequence line in plots -t TOPOLOGY, --topology TOPOLOGY ADVANCED: flag for topology for model to use --initial_repetitiveness INITIAL_REPETITIVENESS ADVANCED: flag to set initial value for repetitiveness --initial_heterozygosities INITIAL_HETEROZYGOSITIES ADVANCED: flag to set initial values for nucleotide heterozygosity rates --transform_exp TRANSFORM_EXP ADVANCED: parameter for the exponent when fitting a transformed (x**transform_exp*y vs. x) kmer histogram [default 1] --testing ADVANCED: flag to create testing.tsv file with model parameters --true_params TRUE_PARAMS ADVANCED: flag to state true simulated parameters for testing mode --trace_flag ADVANCED: flag to turn on printing of iteration progress of nlsLM function --num_rounds NUM_ROUNDS ADVANCED: parameter for the number of optimization rounds --fitted_hist ADVANCED: generates a fitted histogram for kmer multiplicity 0-4 and a lookup table of probabilities --start_shift START_SHIFT ADVANCED: coverage shifts to exclude between fitting rounds --typical_error TYPICAL_ERROR ADVANCED: typical level of sequencing error ``` software ref: research ref: