Synteny and Rearrangement Identifier (SyRI)
SyRI is a comprehensive tool for predicting genomic differences between related
genomes using whole-genome assemblies (WGA). The assemblies are aligned using
whole-genome alignment tools, and these alignments are then used as input to
SyRI. SyRI identifies syntenic path (longest set of co-linear regions),
structural rearrangements (inversions, translocations, and duplications), local
variations (SNPs, indels, CNVs etc) within syntenic and structural
rearrangements, and un-aligned regions.
SyRI uses an unprecedented approach where it starts by identifying longest
syntenic path (set of co-linear regions). Since, all non-syntenic regions
corresponds to genomic regions which have rearranged between the two genomes,
identification of syntenic simultaneously identifies all structural
rearrangements as well. After this step, all aligned non-syntenic regions are
then classified as either inversion, translocation, or duplication based on the
conformation of the constituting alignments. This approach transforms the
challenging problem of SR identification to a comparatively easier problem of SR
classificaiton.
Further, SyRI also identifies local variations within all syntenic and
structurally rearranged regions. Local variations consists of short variations
like SNPs, and small indels as well as structural variations like large indels,
CNVs (copy-number variations), and HDRs. Short variations are parsed out from
the constituting alignments, where as structural variations are predicting by
comparing the overlaps and gaps between consecutive alignments of a syntenic or
rearranged region.
Activate conda env:
1
2
|
bash
source /local/cluster/syri/activate.sh
|
Location and version:
1
2
|
$ which syri
/local/cluster/syri-1.5/bin/syri
|
Software has been tested! See output here:
1
|
/nfs1/CGRB/databases/software/syri
|
help message:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
|
$ syri -h
usage: syri [-h] -c INFILE [-r REF] [-q QRY] [-d DELTA] [-F {T,S,B}] [-k]
[--log {DEBUG,INFO,WARN}] [--lf LOG_FIN] [--dir DIR]
[--prefix PREFIX] [--seed SEED] [--nc NCORES] [--novcf] [-f]
[--nosr] [--tdgaplen TDGL] [--tdmaxolp TDOLP] [-b BRUTERUNTIME]
[--unic TRANSUNICOUNT] [--unip TRANSUNIPERCENT] [--inc INCREASEBY]
[--no-chrmatch] [--nosv] [--nosnp] [--all] [--allow-offset OFFSET]
[--cigar] [-s SSPATH]
Input Files:
-c INFILE File containing alignment coordinates (default: None)
-r REF Genome A (which is considered as reference for the
alignments). Required for local variation (large
indels, CNVs) identification. (default: None)
-q QRY Genome B (which is considered as query for the
alignments). Required for local variation (large
indels, CNVs) identification. (default: None)
-d DELTA .delta file from mummer. Required for short variation
(SNPs/indels) identification when CIGAR string is not
available (default: None)
optional arguments:
-h, --help show this help message and exit
-F {T,S,B} Input file type. T: Table, S: SAM, B: BAM (default: T)
-k Keep intermediate output files (default: False)
--log {DEBUG,INFO,WARN}
log level (default: INFO)
--lf LOG_FIN Name of log file (default: syri.log)
--dir DIR path to working directory (if not current directory).
All files must be in this directory. (default: None)
--prefix PREFIX Prefix to add before the output file Names (default: )
--seed SEED seed for generating random numbers (default: 1)
--nc NCORES number of cores to use in parallel (max is number of
chromosomes) (default: 1)
--novcf Do not combine all files into one output file
(default: False)
-f Filter out low quality alignments (default: True)
SR identification:
--nosr Set to skip structural rearrangement identification
(default: False)
--tdgaplen TDGL Maximum allowed gap-length between two alignments of a
multi-alignment translocation or duplication (TD).
Larger values increases TD identification sensitivity
but also runtime. (default: 500000)
--tdmaxolp TDOLP Maximum allowed overlap between two translocations.
Value should be in range (0,1]. (default: 0.8)
-b BRUTERUNTIME Cutoff to restrict brute force methods to take too
much time (in seconds). Smaller values would make
algorithm faster, but could have marginal effects on
accuracy. In general case, would not be required.
(default: 60)
--unic TRANSUNICOUNT Number of uniques bps for selecting translocation.
Smaller values would select smaller TLs better, but
may increase time and decrease accuracy. (default:
1000)
--unip TRANSUNIPERCENT
Percent of unique region requried to select
translocation. Value should be in range (0,1]. Smaller
values would allow selection of TDs which are more
overlapped with other regions. (default: 0.5)
--inc INCREASEBY Minimum score increase required to add another
alignment to translocation cluster solution (default:
1000)
--no-chrmatch Do not allow SyRI to automatically match chromosome
ids between the two genomes if they are not equal
(default: False)
ShV identification:
--nosv Set to skip structural variation identification
(default: False)
--nosnp Set to skip SNP/Indel (within alignment)
identification (default: False)
--all Use duplications too for variant identification
(default: False)
--allow-offset OFFSET
BPs allowed to overlap (default: 5)
--cigar Find SNPs/indels using CIGAR string. Necessary for
alignment generated using aligners other than nucmers
(default: False)
-s SSPATH path to show-snps from mummer (default: show-snps)
|
software ref: https://schneebergerlab.github.io/syri/
software ref: https://github.com/schneebergerlab/syri
research ref: https://doi.org/10.1186/s13059-019-1911-0