Transomics pipeline help

When use please reference:
Igor Seledtsov, Vladimir Molodtsov, Peter Kosarev, Victor Solovyev: Transomics pipeline - tools for analysis of RNASeq data. http://linux5.softberry.com/cgi-bin/berry/programs/Transomics

Description

Transomics computational pipeline 1) maps RNA-Seq data to a reference genome, 2) assembles them into transcripts and 3) quantifies the abundance of these transcripts in particular datasets. The pipeline will include the following key components: ReadsMap tools for alignment of RNA-Seq reads to genome, Fgenesh++R identification of alternative transcripts gene prediction tolls, and ExpLevel tool for measuring expression levels of predicted transcript isoforms.

ReadsMap tools

ReadsMap is a fast short read aligner that quickly maps/aligns large sets of short DNA sequences. Multiple processors can be used optionally to achieve greater alignment speed. On initial stage we map "exonic" reads that demonstrate high-quality, non-interrupted alignment to a genomic sequence. Potentially, this step would map most of the reads to a genome, and the remaining "non-mapped" group would be small enough to be subjected to more thorough analysis. At the second step, we use a modified variant of our EST_MAP program to align these "non-mapped" reads using splice site matrices and producing very accurate alignment with gaps. This reads will indentify potential exon-intron boundaries.

Accuracy of reads mapping was estimated on a set of read pairs (each read is 76 base long and generated with 2% of random mutations) from 1356 mRNA sequences of human chromosome 22. We computed performance of ReadsMap and popular Bowtie read mapping program (Genome Biology, 2009, 10:R25) in aligning 1963694 of such reads to chromosome 22 sequence.

Table 1. ReadsMap performance
READS Aligned (%) Alignments True Sn Sp
Unspliced 1479130 (100%) 1610551 14748464 0.997 0.92
Spliced 484311 (99.95%) 507887 477170 0.985 0.94
All 1963441 (99.99) 2118534 1952916 0.994 0.92

Extra alignments affecting specificity of mapping are often produced due to occurrence of alternatively spliced variant of mRNA.

On the same data we have estimated Bowtie (v2) performance. As the Bowtie gap penalty is proportional to gap size it does not produce full alignment across introns that are thousands nucleotides long. Therefore we made comparison by accounting the accuracy of alignment of the most long alignment fragment.

Table 2. ReadsMap vs Bowtie performance
ReadsMap Aligned (%) Alignments True Sn Sp
Unspliced 1479130 (100%) 1610551 1479129 1.00 0.92
Spliced 484311 (99.95%) 507887 477170 0.999 0.94
All 1963441 (99.99) 2118534 1952916 0.999 0.93
Bowtie Aligned (%) Alignments True Sn Sp
Unspliced 1479130 (100%) 1610551 14748464 0.997 0.99
Spliced 484311 (99.95%) 507887 477170 0.375 0.99
All 1963441 (99.99) 2118534 1952916 0.834 0.99

We can conclude that the current version of Bowtie is not suitable for mapping spliced reads from RNASeq data having Sn=0.38 compared with ReadsMap Sn=0.99 (Table2).

In Fig 1 a typical example of a spliced read alignment is presented. While the longest fragment of alignment is located by Bowtie in the right place, the short fragment is aligned incorrectly.

Fig 1. ReadsMap alignment:

          nnnnnnnnnnnnnnnn(..)gtcagaaagtaactggCAAATT]ctatgtataaaattgt(..)taatgtaaacacttac[
          ................(..)................|||||| ................(..)................ 
          ----------------(..)----------------CAAATT ----------------(..)---------------- 
          1         1         1         1         5         7         7         7         
                        
   16277748  16277758  16277768  16277778  16277788  16277798  16277808  16277818
          ACATTATGACGACTAGAAACAGCATACTCTCTGGCCGTCTGTCCAGATAGATCTTGAGAAGATACATCAAtgttttgctc
          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||..........
          ACATTATGACGACTAGAAACAGCATACTCTCTGGCCGTCTGTCCAGATAGATCTTGAGAAGATACATCAA----------
                                                                         

Bowtie alignment:

          1        11  16277729  16277739  16277747  16277752  16277762  16277772
          nnnnnnnnnnnnnnnn(..)ttttaatgtaaacact?[TACA]---[CATTATGACGACTAGAAACAGCATACTCTCTGG
          ................(..)................  0|0| ... |||||||||||||||||||||||||||||||||
          ----------------(..)----------------  CAAA tta CATTATGACGACTAGAAACAGCATACTCTCTGG
          1         1         1         1         3        11        21        31
   16277782  16277792  16277802  16277812  16277820  16277830  51304553  51304563
          CCGTCTGTCCAGATAGATCTTGAGAAGATACATCAA]?tgttttgctcaagtag(..)nnnnnnnnnnnnnnnn
          ||||||||||||||||||||||||||||||||||||  ................(..)................
          CCGTCTGTCCAGATAGATCTTGAGAAGATACATCAA  ----------------(..)----------------
         41        51        61        71        77        77        77        77

ReadsMap can exactly locate even very short alignment fragments separated by long introns as it uses information from alignments of all set of reads.

Fgenesh++R - gene prediction using RNASeq data

We build the gene identification with RNA-Seq data based on components of FGENESH++ gene finding pipeline (Salamov, Solovyev, 2000; Solovyev et al., 2006; Solovyev, 2007), which is known as one of the most a ccurate eukaryotic gene identification tool and has been used in annotation of a dozen new genomes.

The general scheme of Transomics pipeline.

ReadsMap provide information on splice sites and intron positions that forces modified version of Fgenesh++ to produce gene models that are supported by RNASeq data.
Fgenesh++R provides possibility to predict tissue specific gene variant or even produce alternatively spliced gene models. These models can be visualized in the ReadsMap Viewer.

Alternative gene models predicted by Fgenesh++R pipeline.

Finally, Transomics pipeline includes a module to compute a relative abundance (expression level) of alternative transcripts generated from the same gene locus using a solution of a system of linear equations.

Relative accuracy of spike-in transcript quantification submitted by 11 participants of the RGASP assessment experiment (presented at the workshop by as in Figue below., our group is the rightmost one. Dr. Kokocinski, The Sanger Institute, Cambridge, member of the assessor's group). The group names are the same.