TransDecoder 5.5

2022-03-17 915 words 5 minutes

TransDecoder (Find Coding Regions Within Transcripts)

TransDecoder identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to the genome using Tophat and Cufflinks.

TransDecoder identifies likely coding sequences based on the following criteria:

a minimum length open reading frame (ORF) is found in a transcript sequence
a log-likelihood score similar to what is computed by the GeneID software is > 0.
the above coding score is greatest when the ORF is scored in the 1st reading frame as compared to scores in the other 2 forward reading frames.
if a candidate ORF is found fully encapsulated by the coordinates of another candidate ORF, the longer one is reported. However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc).
a PSSM is built/trained/used to refine the start codon prediction.
optional the putative peptide has a match to a Pfam domain above the noise cutoff score.

Location:

1
2
3
4
5
6
7
8


$ which TransDecoder.LongOrfs
/local/cluster/bin/TransDecoder.LongOrfs
$ TransDecoder.LongOrfs --version
TransDecoder.LongOrfs 5.5.0
$ which TransDecoder.Predict
/local/cluster/bin/TransDecoder.Predict
$ TransDecoder.Predict --version
TransDecoder.Predict 5.5.0

Util dir:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


$ ls /local/cluster/TransDecoder/util/
bin/                                  gff3_file_to_proteins.pl*     refine_gff3_group_iso_strip_utrs.pl*
cdna_alignment_orf_to_genome_orf.pl*  gff3_gene_to_gtf_format.pl*   refine_hexamer_scores.pl*
compute_base_probs.pl*                gtf_genome_to_cdna_fasta.pl*  remove_eclipsed_ORFs.pl*
exclude_similar_proteins.pl*          gtf_to_alignment_gff3.pl*     score_CDS_likelihood_all_6_frames.pl*
fasta_prot_checker.pl*                gtf_to_bed.pl*                select_best_ORFs_per_transcript.pl*
ffindex_resume.pl*                    misc/                         seq_n_baseprobs_to_loglikelihood_vals.pl*
gene_list_to_gff.pl*                  nr_ORFs_gff3.pl*              start_codon_refinement.pl*
get_FL_accs.pl*                       pfam_mpi.pbs*                 train_start_PWM.pl*
get_longest_ORF_per_transcript.pl*    pfam_runner.pl*               uri_unescape.pl*
get_top_longest_fasta_entries.pl*     PWM/
gff3_file_to_bed.pl*                  __pwm_tests/

help message:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144


$ TransDecoder.Predict -h

########################################################################################
#             ______                 ___                  __
#            /_  __/______ ____ ___ / _ \___ _______  ___/ /__ ____
#             / / / __/ _ `/ _\(_-</ // / -_) __/ _ \/ _  / -_) __/
#            /_/ /_/ \_,_/_//_/___/____/\__/\__/\___/\_,_/\__/_/   .Predict
#
########################################################################################
#
#  Transdecoder.LongOrfs|http://transdecoder.github.io> - Transcriptome Protein Prediction
#
#
#  Required:
#
#   -t <string>                            transcripts.fasta
#
#  Common options:
#
#
#   --retain_long_orfs_mode <string>        'dynamic' or 'strict' (default: dynamic)
#                                        In dynamic mode, sets range according to 1%FDR in random sequence of sameGC content.
#
#
#   --retain_long_orfs_length <int>         under 'strict' mode, retain all ORFs found that are equal or longer than these many nucleotides even if no other evidence
#                                         marks it as coding (default: 1000000) so essentially turned off by default.)
#
#   --retain_pfam_hits <string>            domain table output file from running hmmscan to search Pfam (see transdecoder.github.io for info)
#                                        Any ORF with a pfam domain hit will be retained in the final output.
#
#   --retain_blastp_hits <string>          blastp output in '-outfmt 6' format.
#                                        Any ORF with a blast match will be retained in the final output.
#
#   --single_best_only                     Retain only the single best orf per transcript (prioritized by homologythen orf length)
#
#   --output_dir | -O  <string>            output directory from the TransDecoder.LongOrfs step (default: basename( -t val ) + ".transdecoder_dir")
#
#   -G <string>                            genetic code (default: universal; see PerlDoc; options: Euplotes, Tetrahymena, Candida, Acetabularia, ...)
#
#   --no_refine_starts                     start refinement identifies potential start codons for 5' partial ORFs using a PWM, process on by default.
#
##  Advanced options
#
#    -T <int>                            Top longest ORFs to train Markov Model (hexamer stats) (default: 500)
#                                        Note, 10x this value are first selected for removing redundancies,
#                                        and then this -T value of longest ORFs are selected from the non-redundant set.
#  Genetic Codes
#
#
#   --genetic_code <string>                Universal (default)
#
#        Genetic Codes (derived from: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)
#
#
Acetabularia
Candida
Ciliate
Dasycladacean
Euplotid
Hexamita
Mesodinium
Mitochondrial-Ascidian
Mitochondrial-Chlorophycean
Mitochondrial-Echinoderm
Mitochondrial-Flatworm
Mitochondrial-Invertebrates
Mitochondrial-Protozoan
Mitochondrial-Pterobranchia
Mitochondrial-Scenedesmus_obliquus
Mitochondrial-Thraustochytrium
Mitochondrial-Trematode
Mitochondrial-Vertebrates
Mitochondrial-Yeast
Pachysolen_tannophilus
Peritrich
SR1_Gracilibacteria
Tetrahymena
Universal
#
#  --version                           show version (5.5.0)
#
#########################################################################################
$ TransDecoder.LongOrfs

########################################################################################
#             ______                 ___                  __
#            /_  __/______ ____ ___ / _ \___ _______  ___/ /__ ____
#             / / / __/ _ `/ _\(_-</ // / -_) __/ _ \/ _  / -_) __/
#            /_/ /_/ \_,_/_//_/___/____/\__/\__/\___/\_,_/\__/_/   .LongOrfs
#
########################################################################################
#
#  Transdecoder.LongOrfs|http://transdecoder.github.io> - Transcriptome Protein Prediction
#
#
#  Required:
#
#    -t <string>                            transcripts.fasta
#
#  Optional:
#
#   --gene_trans_map <string>              gene-to-transcript identifier mapping file (tab-delimited, gene_id<tab>trans_id<return> )
#
#   -m <int>                               minimum protein length (default: 100)
#
#   -G <string>                            genetic code (default: universal; see PerlDoc; options: Euplotes, Tetrahymena, Candida, Acetabularia)
#
#   -S                                     strand-specific (only analyzes top strand)
#
#   --output_dir | -O  <string>            path to intended output directory (default:  basename( -t val ) + ".transdecoder_dir")
#
#   --genetic_code <string>                Universal (default)
#
#        Genetic Codes (derived from: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)
#
Acetabularia
Candida
Ciliate
Dasycladacean
Euplotid
Hexamita
Mesodinium
Mitochondrial-Ascidian
Mitochondrial-Chlorophycean
Mitochondrial-Echinoderm
Mitochondrial-Flatworm
Mitochondrial-Invertebrates
Mitochondrial-Protozoan
Mitochondrial-Pterobranchia
Mitochondrial-Scenedesmus_obliquus
Mitochondrial-Thraustochytrium
Mitochondrial-Trematode
Mitochondrial-Vertebrates
Mitochondrial-Yeast
Pachysolen_tannophilus
Peritrich
SR1_Gracilibacteria
Tetrahymena
Universal
#
#
#   --version                              show version tag (5.5.0)
#
#########################################################################################

software ref: https://github.com/TransDecoder/TransDecoder/wiki
research ref: <>