# Syri 1.5 ## Synteny and Rearrangement Identifier (SyRI) SyRI is a comprehensive tool for predicting genomic differences between related genomes using whole-genome assemblies (WGA). The assemblies are aligned using whole-genome alignment tools, and these alignments are then used as input to SyRI. SyRI identifies syntenic path (longest set of co-linear regions), structural rearrangements (inversions, translocations, and duplications), local variations (SNPs, indels, CNVs etc) within syntenic and structural rearrangements, and un-aligned regions. SyRI uses an unprecedented approach where it starts by identifying longest syntenic path (set of co-linear regions). Since, all non-syntenic regions corresponds to genomic regions which have rearranged between the two genomes, identification of syntenic simultaneously identifies all structural rearrangements as well. After this step, all aligned non-syntenic regions are then classified as either inversion, translocation, or duplication based on the conformation of the constituting alignments. This approach transforms the challenging problem of SR identification to a comparatively easier problem of SR classificaiton. Further, SyRI also identifies local variations within all syntenic and structurally rearranged regions. Local variations consists of short variations like SNPs, and small indels as well as structural variations like large indels, CNVs (copy-number variations), and HDRs. Short variations are parsed out from the constituting alignments, where as structural variations are predicting by comparing the overlaps and gaps between consecutive alignments of a syntenic or rearranged region. Activate conda env: ```console bash source /local/cluster/syri/activate.sh ``` Location and version: ```console $ which syri /local/cluster/syri-1.5/bin/syri ``` Software has been tested! See output here: ```console /nfs1/CGRB/databases/software/syri ``` help message: ```console $ syri -h usage: syri [-h] -c INFILE [-r REF] [-q QRY] [-d DELTA] [-F {T,S,B}] [-k] [--log {DEBUG,INFO,WARN}] [--lf LOG_FIN] [--dir DIR] [--prefix PREFIX] [--seed SEED] [--nc NCORES] [--novcf] [-f] [--nosr] [--tdgaplen TDGL] [--tdmaxolp TDOLP] [-b BRUTERUNTIME] [--unic TRANSUNICOUNT] [--unip TRANSUNIPERCENT] [--inc INCREASEBY] [--no-chrmatch] [--nosv] [--nosnp] [--all] [--allow-offset OFFSET] [--cigar] [-s SSPATH] Input Files: -c INFILE File containing alignment coordinates (default: None) -r REF Genome A (which is considered as reference for the alignments). Required for local variation (large indels, CNVs) identification. (default: None) -q QRY Genome B (which is considered as query for the alignments). Required for local variation (large indels, CNVs) identification. (default: None) -d DELTA .delta file from mummer. Required for short variation (SNPs/indels) identification when CIGAR string is not available (default: None) optional arguments: -h, --help show this help message and exit -F {T,S,B} Input file type. T: Table, S: SAM, B: BAM (default: T) -k Keep intermediate output files (default: False) --log {DEBUG,INFO,WARN} log level (default: INFO) --lf LOG_FIN Name of log file (default: syri.log) --dir DIR path to working directory (if not current directory). All files must be in this directory. (default: None) --prefix PREFIX Prefix to add before the output file Names (default: ) --seed SEED seed for generating random numbers (default: 1) --nc NCORES number of cores to use in parallel (max is number of chromosomes) (default: 1) --novcf Do not combine all files into one output file (default: False) -f Filter out low quality alignments (default: True) SR identification: --nosr Set to skip structural rearrangement identification (default: False) --tdgaplen TDGL Maximum allowed gap-length between two alignments of a multi-alignment translocation or duplication (TD). Larger values increases TD identification sensitivity but also runtime. (default: 500000) --tdmaxolp TDOLP Maximum allowed overlap between two translocations. Value should be in range (0,1]. (default: 0.8) -b BRUTERUNTIME Cutoff to restrict brute force methods to take too much time (in seconds). Smaller values would make algorithm faster, but could have marginal effects on accuracy. In general case, would not be required. (default: 60) --unic TRANSUNICOUNT Number of uniques bps for selecting translocation. Smaller values would select smaller TLs better, but may increase time and decrease accuracy. (default: 1000) --unip TRANSUNIPERCENT Percent of unique region requried to select translocation. Value should be in range (0,1]. Smaller values would allow selection of TDs which are more overlapped with other regions. (default: 0.5) --inc INCREASEBY Minimum score increase required to add another alignment to translocation cluster solution (default: 1000) --no-chrmatch Do not allow SyRI to automatically match chromosome ids between the two genomes if they are not equal (default: False) ShV identification: --nosv Set to skip structural variation identification (default: False) --nosnp Set to skip SNP/Indel (within alignment) identification (default: False) --all Use duplications too for variant identification (default: False) --allow-offset OFFSET BPs allowed to overlap (default: 5) --cigar Find SNPs/indels using CIGAR string. Necessary for alignment generated using aligners other than nucmers (default: False) -s SSPATH path to show-snps from mummer (default: show-snps) ``` software ref: software ref: research ref: