SQANTI3 5.0.0

2022-07-18 1285 words 7 minutes

SQANTI3

SQANTI3 is the newest version of the SQANTI tool that merges features from SQANTI and SQANTI2, together with new additions. SQANTI3 will continue as an integrated development aiming to provide the best characterization for your new long read-defined transcriptome.

SQANTI3 is the first module of the Functional IsoTranscriptomics (FIT) framework, which also includes IsoAnnot and tappAS.

Latest updates

Latest SQANTI3 release (01/06/2022) is version 5.0.

WARNING: v5.0 constitutes a major release of the SQANTI3 software. Versions of SQANTI3 >= 5.0 will not have backward compatibility with previous releases and their output (v4.3 and earlier). Users that wish to apply any of the new functionalities in v5.0 to output files from older versions will herefore need to re-run SQANTI3 QC.

New features implemented in SQANTI3 v5.0:

Implemented new machine learning-based filter.
Updated rules filter: users can now define their own set of rules using a JSON file. By default, the rules filter applies the same set of rules that were implemented in the old sqanti3_RulesFilter.py script.
The sqanti3_RulesFilter.py script is now deprecated and has been replaced by sqanti3_filter.py, which works a wrapper for both filters (see details in the documentation).
IsoAnnotLite updated to version 2.7.3.
Substantial modification of the SQANTI3 directory structure, with utilities folder now being divided into subfolders that group the scripts by their function.
Added a column in the classification file to indicate whether a polyA motif was found, which adds to the existing column detailing the detected motif (details here).
Changed CAGE argument and CAGE/polyA columns to capital letters (for consistency across columns and arguments).
The example folder now includes sample commands and output files for SQANTI3 QC, rules filter and machine learning filter.
Added new supported transcript model (STM) plots to the SQANTI3 QC report.
Minor fixes/enhancements:
- Included cython (cDNA_cupcake dependency) as a dependency in the SQANTI3 conda environment.
- pip installed in conda environment.
- When supplied, the new sqanti3_filter.py filters the sqanti3_qc.py output files using the filter result (rules or ML). This was not previously done by sqanti3_RulesFilter.py.
- Antisense vs intergenic bug: fixed inconsistencies in classification of isoforms across the two categories.
- Fixed deprecation warnings in calculation of ratioTSS.
- Minor report updates.

Documentation

For detailed documentation, please visit the SQANTI3 wiki.

Wiki contents:

Activating the conda env

Check out a node with qrsh and then run these commands:

1
2


bash
source /local/cluster/SQANTI3/activate.sh

To use in SGE, generate a bash script with the source activate line above and then the SQANTI3 commands you wish to run.

Location and version

1
2
3
4
5


$ which sqanti3_qc.py
/local/cluster/SQANTI3/bin/sqanti3_qc.py
$ sqanti3_qc.py --version
R scripting front-end version 3.6.1 (2019-07-05)
SQANTI3 2.0.0

help message

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129


$ sqanti3_qc.py --help
R scripting front-end version 3.6.1 (2019-07-05)
usage: sqanti3_qc.py [-h] [--min_ref_len MIN_REF_LEN] [--force_id_ignore]
                     [--aligner_choice {minimap2,deSALT,gmap}]
                     [--cage_peak CAGE_PEAK]
                     [--polyA_motif_list POLYA_MOTIF_LIST]
                     [--polyA_peak POLYA_PEAK] [--phyloP_bed PHYLOP_BED]
                     [--skipORF] [--is_fusion] [--orf_input ORF_INPUT] [-g]
                     [-e EXPRESSION] [-x GMAP_INDEX] [-t CPUS] [-n CHUNKS]
                     [-o OUTPUT] [-d DIR] [-c COVERAGE] [-s SITES] [-w WINDOW]
                     [--genename] [-fl FL_COUNT] [-v] [--isoAnnotLite]
                     [--gff3 GFF3]
                     isoforms annotation genome

Structural and Quality Annotation of Novel Transcript Isoforms

positional arguments:
  isoforms              Isoforms (FASTA/FASTQ) or GTF format. Recommend
                        provide GTF format with the --gtf option.
  annotation            Reference annotation file (GTF format)
  genome                Reference genome (Fasta format)

optional arguments:
  -h, --help            show this help message and exit
  --min_ref_len MIN_REF_LEN
                        Minimum reference transcript length (default: 200 bp)
  --force_id_ignore     Allow the usage of transcript IDs non related with
                        PacBio's nomenclature (PB.X.Y)
  --aligner_choice {minimap2,deSALT,gmap}
  --cage_peak CAGE_PEAK
                        FANTOM5 Cage Peak (BED format, optional)
  --polyA_motif_list POLYA_MOTIF_LIST
                        Ranked list of polyA motifs (text, optional)
  --polyA_peak POLYA_PEAK
                        PolyA Peak (BED format, optional)
  --phyloP_bed PHYLOP_BED
                        PhyloP BED for conservation score (BED, optional)
  --skipORF             Skip ORF prediction (to save time)
  --is_fusion           Input are fusion isoforms, must supply GTF as input
                        using --gtf
  --orf_input ORF_INPUT
                        Input fasta to run ORF on. By default, ORF is run on
                        genome-corrected fasta - this overrides it. If input
                        is fusion (--is_fusion), this must be provided for ORF
                        prediction.
  -g, --gtf             Use when running SQANTI by using as input a gtf of
                        isoforms
  -e EXPRESSION, --expression EXPRESSION
                        Expression matrix (supported: Kallisto tsv)
  -x GMAP_INDEX, --gmap_index GMAP_INDEX
                        Path and prefix of the reference index created by
                        gmap_build. Mandatory if using GMAP unless -g option
                        is specified.
  -t CPUS, --cpus CPUS  Number of threads used during alignment by aligners.
                        (default: 10)
  -n CHUNKS, --chunks CHUNKS
                        Number of chunks to split SQANTI3 analysis in for
                        speed up (default: 1).
  -o OUTPUT, --output OUTPUT
                        Prefix for output files.
  -d DIR, --dir DIR     Directory for output files. Default: Directory where
                        the script was run.
  -c COVERAGE, --coverage COVERAGE
                        Junction coverage files (provide a single file, comma-
                        delmited filenames, or a file pattern, ex:
                        "mydir/*.junctions").
  -s SITES, --sites SITES
                        Set of splice sites to be considered as canonical
                        (comma-separated list of splice sites). Default:
                        GTAG,GCAG,ATAC.
  -w WINDOW, --window WINDOW
                        Size of the window in the genomic DNA screened for
                        Adenine content downstream of TTS
  --genename            Use gene_name tag from GTF to define genes. Default:
                        gene_id used to define genes
  -fl FL_COUNT, --fl_count FL_COUNT
                        Full-length PacBio abundance file
  -v, --version         Display program version number.
  --isoAnnotLite        Run isoAnnot Lite to output a tappAS-compatible gff3
                        file
  --gff3 GFF3           Precomputed tappAS species specific GFF3 file. It will
                        serve as reference to transfer functional attributes
$ sqanti3_RulesFilter.py
R scripting front-end version 3.6.1 (2019-07-05)
usage: sqanti3_RulesFilter.py [-h] [--sam SAM] [--faa FAA] [-a INTRAPRIMING]
                              [-r RUNALENGTH] [-m MAX_DIST_TO_KNOWN_END]
                              [-c MIN_COV] [--filter_mono_exonic] [--skipGTF]
                              [--skipFaFq] [--skipJunction] [-v]
                              sqanti_class isoforms gtf_file
sqanti3_RulesFilter.py: error: the following arguments are required: sqanti_class, isoforms, gtf_file
(/local/cluster/SQANTI3)
# davised:Linux @ x86_64-conda_cos6-linux-gnu in ~ [22:56:08] C:2
$ sqanti3_RulesFilter.py -h
R scripting front-end version 3.6.1 (2019-07-05)
usage: sqanti3_RulesFilter.py [-h] [--sam SAM] [--faa FAA] [-a INTRAPRIMING]
                              [-r RUNALENGTH] [-m MAX_DIST_TO_KNOWN_END]
                              [-c MIN_COV] [--filter_mono_exonic] [--skipGTF]
                              [--skipFaFq] [--skipJunction] [-v]
                              sqanti_class isoforms gtf_file

Filtering of Isoforms based on SQANTI3 attributes

positional arguments:
  sqanti_class          SQANTI classification output file.
  isoforms              fasta/fastq isoform file to be filtered by SQANTI3
  gtf_file              GTF of the input fasta/fastq

optional arguments:
  -h, --help            show this help message and exit
  --sam SAM             (Optional) SAM alignment of the input fasta/fastq
  --faa FAA             (Optional) ORF prediction faa file to be filtered by
                        SQANTI3
  -a INTRAPRIMING, --intrapriming INTRAPRIMING
                        Adenine percentage at genomic 3' end to flag an
                        isoform as intra-priming (default: 0.6)
  -r RUNALENGTH, --runAlength RUNALENGTH
                        Continuous run-A length at genomic 3' end to flag an
                        isoform as intra-priming (default: 6)
  -m MAX_DIST_TO_KNOWN_END, --max_dist_to_known_end MAX_DIST_TO_KNOWN_END
                        Maximum distance to an annotated 3' end to preserve as
                        a valid 3' end and not filter out (default: 50bp)
  -c MIN_COV, --min_cov MIN_COV
                        Minimum junction coverage for each isoform (only used
                        if min_cov field is not 'NA'), default: 3
  --filter_mono_exonic  Filter out all mono-exonic transcripts (default: OFF)
  --skipGTF             Skip output of GTF
  --skipFaFq            Skip output of isoform fasta/fastq
  --skipJunction        Skip output of junctions file
  -v, --version         Display program version number.

software ref: https://github.com/ConesaLab/SQANTI3
research ref: https://doi.org/10.1101/gr.222976.117
research ref: https://github.com/ConesaLab/SQANTI3#how-to-cite-sqanti3