Contents

vechat 1.1.1

Conda
See the ‘activating the conda environment’ section below to access this software.

vechat-1.1.1

Correcting Errors in Noisy Long Reads Using Variation Graphs

Description

Error correction is the canonical first step in long-read sequencing data analysis. The current standard is to make use of a consensus sequence as a template. However, in mixed samples, such as metagenomes or organisms of higher ploidy, consensus induced biases can mask true variants affecting haplotypes of lower frequencies, because they are mistaken as errors.

The novelty presented here is to use graph based, instead of sequence based consensus as a template for identifying errors. The advantage is that graph based reference systems also capture variants of lower frequencies, so do not mistakenly mask them as errors. We present VeChat, as a novel approach to implement this idea: VeChat distinguishes errors from haplotype-specific true variants based on variation graphs, which reflect a popular type of data structure for pangenome reference systems. Upon initial construction of an ad-hoc variation graph from the raw input reads, nodes and edges that are due to errors are pruned from that graph by way of an iterative procedure that is based on principles from frequent itemset mining. Upon termination, the graph exclusively contains nodes and edges reflecting true sequential phenomena. Final re-alignments of the raw reads indicate where and how reads need to be corrected. VeChat is implemented with C++ and Python3.


Activating the conda environment

Check out a node with qrsh and run

1
2
bash
source /local/cluster/vechat/activate.sh

To run over SGE, include the source line above in your shell script prior to the vechat commands.

Location

1
2
$ which vechat
/local/cluster/vechat-1.1.1/bin/vechat

help message

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
 vechat --help
usage: vechat [-h] [-o OUTFILE] [--platform PLATFORM] [--split] [--split-size SPLIT_SIZE] [--scrub] [-u] [--base] [--min-identity MIN_IDENTITY]
              [--linear] [-d MIN_CONFIDENCE] [-s MIN_SUPPORT] [--min-ovlplen-cns MIN_OVLPLEN_CNS] [--min-identity-cns MIN_IDENTITY_CNS]
              [-w WINDOW_LENGTH] [-q QUALITY_THRESHOLD] [-e ERROR_THRESHOLD] [-t THREADS] [-m MATCH] [-x MISMATCH] [-g GAP]
              [--cudaaligner-batches CUDAALIGNER_BATCHES] [-c CUDAPOA_BATCHES] [-b]
              sequences

Haplotype-aware Error Correction for Noisy Long Reads Using Variation Graphs

positional arguments:
  sequences             input file in FASTA/FASTQ format (can be compressed with gzip) containing sequences used for correction

options:
  -h, --help            show this help message and exit
  -o OUTFILE, --outfile OUTFILE
                        output file (default: reads.corrected.fa)
  --platform PLATFORM   sequencing platform: pb/ont (default: pb)
  --split               split target sequences into chunks (recommend for FASTQ > 20G or FASTA > 10G) (default: False)
  --split-size SPLIT_SIZE
                        split target sequences into chunks of desired size in lines, only valid when using --split (default: 1000000)
  --scrub               scrub chimeric reads (default: False)
  -u, --include-unpolished
                        output unpolished target sequences (default: False)
  --base                perform base level alignment when computing read overlaps in the first iteration (default: False)
  --min-identity MIN_IDENTITY
                        minimum identity used for filtering overlaps, only works combined with --base (default: 0.8)
  --linear              perform linear based fragment correction rather than variation graph based fragment correction (default: False)
  -d MIN_CONFIDENCE, --min-confidence MIN_CONFIDENCE
                        minimum confidence for keeping edges in the graph (default: 0.2)
  -s MIN_SUPPORT, --min-support MIN_SUPPORT
                        minimum support for keeping edges in the graph (default: 0.2)
  --min-ovlplen-cns MIN_OVLPLEN_CNS
                        minimum read overlap length in the consensus round (default: 1000)
  --min-identity-cns MIN_IDENTITY_CNS
                        minimum sequence identity between read overlaps in the consensus round (default: 0.99)
  -w WINDOW_LENGTH, --window-length WINDOW_LENGTH
                        size of window on which POA is performed (default: 500)
  -q QUALITY_THRESHOLD, --quality-threshold QUALITY_THRESHOLD
                        threshold for average base quality of windows used in POA (default: 10.0)
  -e ERROR_THRESHOLD, --error-threshold ERROR_THRESHOLD
                        maximum allowed error rate used for filtering overlaps (default: 0.3)
  -t THREADS, --threads THREADS
                        number of threads (default: 1)
  -m MATCH, --match MATCH
                        score for matching bases (default: 5)
  -x MISMATCH, --mismatch MISMATCH
                        score for mismatching bases (default: -4)
  -g GAP, --gap GAP     gap penalty (must be negative) (default: -8)
  --cudaaligner-batches CUDAALIGNER_BATCHES
                        number of batches for CUDA accelerated alignment (default: 0)
  -c CUDAPOA_BATCHES, --cudapoa-batches CUDAPOA_BATCHES
                        number of batches for CUDA accelerated polishing (default: 0)
  -b, --cuda-banded-alignment
                        use banding approximation for polishing on GPU. Only applicable when -c is used. (default: False)

software ref: https://github.com/HaploKit/vechat
research ref: https://doi.org/10.1038/s41467-022-34381-8