Syri 1.5

2021-09-07 831 words 4 minutes

Synteny and Rearrangement Identifier (SyRI)

SyRI is a comprehensive tool for predicting genomic differences between related genomes using whole-genome assemblies (WGA). The assemblies are aligned using whole-genome alignment tools, and these alignments are then used as input to SyRI. SyRI identifies syntenic path (longest set of co-linear regions), structural rearrangements (inversions, translocations, and duplications), local variations (SNPs, indels, CNVs etc) within syntenic and structural rearrangements, and un-aligned regions.

SyRI uses an unprecedented approach where it starts by identifying longest syntenic path (set of co-linear regions). Since, all non-syntenic regions corresponds to genomic regions which have rearranged between the two genomes, identification of syntenic simultaneously identifies all structural rearrangements as well. After this step, all aligned non-syntenic regions are then classified as either inversion, translocation, or duplication based on the conformation of the constituting alignments. This approach transforms the challenging problem of SR identification to a comparatively easier problem of SR classificaiton.

Further, SyRI also identifies local variations within all syntenic and structurally rearranged regions. Local variations consists of short variations like SNPs, and small indels as well as structural variations like large indels, CNVs (copy-number variations), and HDRs. Short variations are parsed out from the constituting alignments, where as structural variations are predicting by comparing the overlaps and gaps between consecutive alignments of a syntenic or rearranged region.

Activate conda env:

1
2


bash
source /local/cluster/syri/activate.sh

Location and version:

1
2


$ which syri
/local/cluster/syri-1.5/bin/syri

Software has been tested! See output here:

1

/nfs1/CGRB/databases/software/syri

help message:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81


$ syri -h
usage: syri [-h] -c INFILE [-r REF] [-q QRY] [-d DELTA] [-F {T,S,B}] [-k]
            [--log {DEBUG,INFO,WARN}] [--lf LOG_FIN] [--dir DIR]
            [--prefix PREFIX] [--seed SEED] [--nc NCORES] [--novcf] [-f]
            [--nosr] [--tdgaplen TDGL] [--tdmaxolp TDOLP] [-b BRUTERUNTIME]
            [--unic TRANSUNICOUNT] [--unip TRANSUNIPERCENT] [--inc INCREASEBY]
            [--no-chrmatch] [--nosv] [--nosnp] [--all] [--allow-offset OFFSET]
            [--cigar] [-s SSPATH]

Input Files:
  -c INFILE             File containing alignment coordinates (default: None)
  -r REF                Genome A (which is considered as reference for the
                        alignments). Required for local variation (large
                        indels, CNVs) identification. (default: None)
  -q QRY                Genome B (which is considered as query for the
                        alignments). Required for local variation (large
                        indels, CNVs) identification. (default: None)
  -d DELTA              .delta file from mummer. Required for short variation
                        (SNPs/indels) identification when CIGAR string is not
                        available (default: None)

optional arguments:
  -h, --help            show this help message and exit
  -F {T,S,B}            Input file type. T: Table, S: SAM, B: BAM (default: T)
  -k                    Keep intermediate output files (default: False)
  --log {DEBUG,INFO,WARN}
                        log level (default: INFO)
  --lf LOG_FIN          Name of log file (default: syri.log)
  --dir DIR             path to working directory (if not current directory).
                        All files must be in this directory. (default: None)
  --prefix PREFIX       Prefix to add before the output file Names (default: )
  --seed SEED           seed for generating random numbers (default: 1)
  --nc NCORES           number of cores to use in parallel (max is number of
                        chromosomes) (default: 1)
  --novcf               Do not combine all files into one output file
                        (default: False)
  -f                    Filter out low quality alignments (default: True)

SR identification:
  --nosr                Set to skip structural rearrangement identification
                        (default: False)
  --tdgaplen TDGL       Maximum allowed gap-length between two alignments of a
                        multi-alignment translocation or duplication (TD).
                        Larger values increases TD identification sensitivity
                        but also runtime. (default: 500000)
  --tdmaxolp TDOLP      Maximum allowed overlap between two translocations.
                        Value should be in range (0,1]. (default: 0.8)
  -b BRUTERUNTIME       Cutoff to restrict brute force methods to take too
                        much time (in seconds). Smaller values would make
                        algorithm faster, but could have marginal effects on
                        accuracy. In general case, would not be required.
                        (default: 60)
  --unic TRANSUNICOUNT  Number of uniques bps for selecting translocation.
                        Smaller values would select smaller TLs better, but
                        may increase time and decrease accuracy. (default:
                        1000)
  --unip TRANSUNIPERCENT
                        Percent of unique region requried to select
                        translocation. Value should be in range (0,1]. Smaller
                        values would allow selection of TDs which are more
                        overlapped with other regions. (default: 0.5)
  --inc INCREASEBY      Minimum score increase required to add another
                        alignment to translocation cluster solution (default:
                        1000)
  --no-chrmatch         Do not allow SyRI to automatically match chromosome
                        ids between the two genomes if they are not equal
                        (default: False)

ShV identification:
  --nosv                Set to skip structural variation identification
                        (default: False)
  --nosnp               Set to skip SNP/Indel (within alignment)
                        identification (default: False)
  --all                 Use duplications too for variant identification
                        (default: False)
  --allow-offset OFFSET
                        BPs allowed to overlap (default: 5)
  --cigar               Find SNPs/indels using CIGAR string. Necessary for
                        alignment generated using aligners other than nucmers
                        (default: False)
  -s SSPATH             path to show-snps from mummer (default: show-snps)

software ref: https://schneebergerlab.github.io/syri/
software ref: https://github.com/schneebergerlab/syri
research ref: https://doi.org/10.1186/s13059-019-1911-0