SQANTI3 5.1.0

2022-07-29 1314 words 7 minutes

Contents

Conda

See the ‘activating the conda environment’ section below to access this software.

SQANTI3-5.1.0

SQANTI3 is the newest version of the SQANTI tool that merges features from SQANTI and SQANTI2, together with new additions. SQANTI3 will continue as an integrated development aiming to provide the best characterization for your new long read-defined transcriptome.

SQANTI3 is the first module of the Functional IsoTranscriptomics (FIT) framework, which also includes IsoAnnot and tappAS.

New features in SQANTI3 v5.1 [LATEST]:

Major changes:

Implemented new rescue strategy to recover transcriptome diversity lost after filtering (see details at the SQ rescue wiki).
Updated conda environment to include rescue dependencies. We recommend creating the environment again in order for SQANTI3 to run without error.
Fixed behavior of mono-exon transcripts during ML filter:
- FSM now undergo intra-primming evaluation if they are mono-exons.
- Corrected ML filter output when --force_multi_exon option is supplied: mono-exon transcripts will now be labeled as Artifacts.
Fixed reasons file output by rules filter: the table now includes correct filtering reasons for mono-exon transcripts.
Added an option to rules filter to control for mono-exon transcripts (previously available in ML filter).
Modified the output of SQANTI3 QC to incorporate the creation of a complete params.txt file, i.e. including all arguments and the full paths of all supplied files.

Minor fixes/enhancements:

Fixed output path for IsoAnnotLite GFF3 that prevented writing the file to the correct output directory when -gff3 option was not used.
Set temporary file dir for HTML report creation (fixes Singularity container error).

New features in SQANTI3 v5.0:

Major changes:

Implemented new machine learning-based filter.
Updated rules filter: users can now define their own set of rules using a JSON file. By default, the rules filter applies the same set of rules that were implemented in the old sqanti3_RulesFilter.py script.
The sqanti3_RulesFilter.py script is now deprecated and has been replaced by sqanti3_filter.py, which works a wrapper for both filters (see details in the documentation).
IsoAnnotLite updated to version 2.7.3.
Substantial modification of the SQANTI3 directory structure, with utilities folder now being divided into subfolders that group the scripts by their function.
Added a column in the classification file to indicate whether a polyA motif was found, which adds to the existing column detailing the detected motif (details here).
Changed CAGE argument and CAGE/polyA columns to capital letters (for consistency across columns and arguments).
The example folder now includes sample commands and output files for SQANTI3 QC, rules filter and machine learning filter.
Added new supported transcript model (STM) plots to the SQANTI3 QC report.

Minor fixes/enhancements:

Included cython (cDNA_cupcake dependency) as a dependency in the SQANTI3 conda environment.
pip installed in conda environment.
When supplied, the new sqanti3_filter.py filters the sqanti3_qc.py output files using the filter result (rules or ML). This was not previously done by sqanti3_RulesFilter.py.
Antisense vs intergenic bug: fixed inconsistencies in classification of isoforms across the two categories.
Fixed deprecation warnings in calculation of ratioTSS.
Minor report updates.

Documentation

For detailed documentation, please visit the SQANTI3 wiki.

Wiki contents:

Please, note that we are currently updating and expanding the wiki to provide as much information as possible and enhance the SQANTI3 user experience. Pages under construction -or where information is still missing- will be indicated where appropriate. Thank you for your patience!

How to cite SQANTI3

SQANTI3 paper is currently in preparation. In the meantime, when using SQANTI3 in your research, please cite the original SQANTI paper as well as this repository:

Tardaguila M, de la Fuente L, Marti C, et al. SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res, 2018. 28(3):396-411. doi:10.1101/gr.222976.117

Activating the conda environment

Check out a node with qrsh and then run these commands:

1
2


bash
source /local/cluster/SQANTI3/activate.sh

To use in SGE, generate a bash script with the source activate line above and then the SQANTI3 commands you wish to run.

Location and version

1
2
3
4
5


$ which sqanti3_qc.py
/local/cluster/SQANTI3/bin/sqanti3_qc.py
$ sqanti3_qc.py --version
R scripting front-end version 4.1.3 (2022-03-10)
SQANTI3 5.0

help message

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75


$ sqanti3_qc.py --help
R scripting front-end version 4.1.3 (2022-03-10)
usage: sqanti3_qc.py [-h] [--min_ref_len MIN_REF_LEN] [--force_id_ignore]
                     [--aligner_choice {minimap2,deSALT,gmap,uLTRA}] [--CAGE_peak CAGE_PEAK]
                     [--polyA_motif_list POLYA_MOTIF_LIST] [--polyA_peak POLYA_PEAK] [--phyloP_bed PHYLOP_BED]
                     [--skipORF] [--is_fusion] [--orf_input ORF_INPUT] [--fasta] [-e EXPRESSION] [-x GMAP_INDEX]
                     [-t CPUS] [-n CHUNKS] [-o OUTPUT] [-d DIR] [-c COVERAGE] [-s SITES] [-w WINDOW] [--genename]
                     [-fl FL_COUNT] [-v] [--saturation] [--report {html,pdf,both,skip}] [--isoAnnotLite]
                     [--gff3 GFF3] [--short_reads SHORT_READS] [--SR_bam SR_BAM]
                     isoforms annotation genome

Structural and Quality Annotation of Novel Transcript Isoforms

positional arguments:
  isoforms              Isoforms (FASTA/FASTQ) or GTF format. It is recommended to provide them in GTF format,
                        but if it is needed to map the sequences to the genome use a FASTA/FASTQ file with the
                        --fasta option.
  annotation            Reference annotation file (GTF format)
  genome                Reference genome (Fasta format)

optional arguments:
  -h, --help            show this help message and exit
  --min_ref_len MIN_REF_LEN
                        Minimum reference transcript length (default: 200 bp)
  --force_id_ignore     Allow the usage of transcript IDs non related with PacBio's nomenclature (PB.X.Y)
  --aligner_choice {minimap2,deSALT,gmap,uLTRA}
  --CAGE_peak CAGE_PEAK
                        FANTOM5 Cage Peak (BED format, optional)
  --polyA_motif_list POLYA_MOTIF_LIST
                        Ranked list of polyA motifs (text, optional)
  --polyA_peak POLYA_PEAK
                        PolyA Peak (BED format, optional)
  --phyloP_bed PHYLOP_BED
                        PhyloP BED for conservation score (BED, optional)
  --skipORF             Skip ORF prediction (to save time)
  --is_fusion           Input are fusion isoforms, must supply GTF as input
  --orf_input ORF_INPUT
                        Input fasta to run ORF on. By default, ORF is run on genome-corrected fasta - this
                        overrides it. If input is fusion (--is_fusion), this must be provided for ORF prediction.
  --fasta               Use when running SQANTI by using as input a FASTA/FASTQ with the sequences of isoforms
  -e EXPRESSION, --expression EXPRESSION
                        Expression matrix (supported: Kallisto tsv)
  -x GMAP_INDEX, --gmap_index GMAP_INDEX
                        Path and prefix of the reference index created by gmap_build. Mandatory if using GMAP
                        unless -g option is specified.
  -t CPUS, --cpus CPUS  Number of threads used during alignment by aligners. (default: 10)
  -n CHUNKS, --chunks CHUNKS
                        Number of chunks to split SQANTI3 analysis in for speed up (default: 1).
  -o OUTPUT, --output OUTPUT
                        Prefix for output files.
  -d DIR, --dir DIR     Directory for output files. Default: Directory where the script was run.
  -c COVERAGE, --coverage COVERAGE
                        Junction coverage files (provide a single file, comma-delmited filenames, or a file
                        pattern, ex: "mydir/*.junctions").
  -s SITES, --sites SITES
                        Set of splice sites to be considered as canonical (comma-separated list of splice sites).
                        Default: GTAG,GCAG,ATAC.
  -w WINDOW, --window WINDOW
                        Size of the window in the genomic DNA screened for Adenine content downstream of TTS
  --genename            Use gene_name tag from GTF to define genes. Default: gene_id used to define genes
  -fl FL_COUNT, --fl_count FL_COUNT
                        Full-length PacBio abundance file
  -v, --version         Display program version number.
  --saturation          Include saturation curves into report
  --report {html,pdf,both,skip}
                        select report format --html --pdf --both --skip
  --isoAnnotLite        Run isoAnnot Lite to output a tappAS-compatible gff3 file
  --gff3 GFF3           Precomputed tappAS species specific GFF3 file. It will serve as reference to transfer
                        functional attributes
  --short_reads SHORT_READS
                        File Of File Names (fofn, space separated) with paths to FASTA or FASTQ from Short-Read
                        RNA-Seq. If expression or coverage files are not provided, Kallisto (just for pair-end
                        data) and STAR, respectively, will be run to calculate them.
  --SR_bam SR_BAM       Directory or fofn file with the sorted bam files of Short Reads RNA-Seq mapped against
                        the genome

software ref: https://github.com/ConesaLab/SQANTI3
research ref: https://doi.org/10.1101/gr.222976.117
research ref: https://github.com/ConesaLab/SQANTI3#how-to-cite-sqanti3