ReadsMap help

When use please reference:
Seledtsov I., Molodtsov V., Solovyev V. ReadMap: Aligning next generation sequences to the genome sequence. http://linux5.softberry.com/cgi-bin/berry/programs/ReadsMap

Description

ReadsMap is a fast short read aligner that quickly maps/aligns large sets of short DNA sequences. Multiple processors can be used optionally to achieve greater alignment speed. On initial stage we map “exonic” reads that demonstrate high-quality, non-interrupted alignment to a genomic sequence. Potentially, this step would map most of the reads to a genome, and the remaining “non-mapped” group would be small enough to be subjected to more thorough analysis. At the second step, we use a modified variant of our EST_MAP program to align these “non-mapped” reads using splice site matrices and producing very accurate alignment with gaps. This reads will indentify potential exon-intron boundaries.

Accuracy of reads mapping was estimated on a set of read pairs (each read is 76 base long and generated with 2% of random mutations) from 1356 mRNA sequences of human chromosome 22. We computed performance of ReadsMap and popular Bowtie read mapping program (Genome Biology, 2009, 10:R25) in aligning 1963694 of such reads to chromosome 22 sequence.

Table 1. ReadsMap performance
READS Aligned (%) Alignments True Sn Sp
Unspliced 1479130 (100%) 1610551 14748464 0.997 0.92
Spliced 484311 (99.95%) 507887 477170 0.985 0.94
All 1963441 (99.99) 2118534 1952916 0.994 0.92

Extra alignments affecting specificity of mapping are often produced due to occurrence of alternatively spliced variant of mRNA.

On the same data we have estimated Bowtie (v2) performance. As the Bowtie gap penalty is proportional to gap size it does not produce full alignment across introns that are thousands nucleotides long. Therefore we made comparison by accounting the accuracy of alignment of the most long alignment fragment.

Table 2. ReadsMap vs Bowtie performance
ReadsMap Aligned (%) Alignments True Sn Sp
Unspliced 1479130 (100%) 1610551 1479129 1.00 0.92
Spliced 484311 (99.95%) 507887 477170 0.999 0.94
All 1963441 (99.99) 2118534 1952916 0.999 0.93
Bowtie Aligned (%) Alignments True Sn Sp
Unspliced 1479130 (100%) 1610551 14748464 0.997 0.99
Spliced 484311 (99.95%) 507887 477170 0.375 0.99
All 1963441 (99.99) 2118534 1952916 0.834 0.99

We can conclude that the current version of Bowtie is not suitable for mapping spliced reads from RNASeq data having Sn=0.38 compared with ReadsMap Sn=0.99 (Table2).

In Fig 1 a typical example of a spliced read alignment is presented. While the longest fragment of alignment is located by Bowtie in the right place, the short fragment is aligned incorrectly.

Fig 1. ReadsMap alignment:

          nnnnnnnnnnnnnnnn(..)gtcagaaagtaactggCAAATT]ctatgtataaaattgt(..)taatgtaaacacttac[
          ................(..)................|||||| ................(..)................ 
          ----------------(..)----------------CAAATT ----------------(..)---------------- 
          1         1         1         1         5         7         7         7         
                        
   16277748  16277758  16277768  16277778  16277788  16277798  16277808  16277818
          ACATTATGACGACTAGAAACAGCATACTCTCTGGCCGTCTGTCCAGATAGATCTTGAGAAGATACATCAAtgttttgctc
          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||..........
          ACATTATGACGACTAGAAACAGCATACTCTCTGGCCGTCTGTCCAGATAGATCTTGAGAAGATACATCAA----------
                                                                         

Bowtie alignment:

          1        11  16277729  16277739  16277747  16277752  16277762  16277772
          nnnnnnnnnnnnnnnn(..)ttttaatgtaaacact?[TACA]---[CATTATGACGACTAGAAACAGCATACTCTCTGG
          ................(..)................  0|0| ... |||||||||||||||||||||||||||||||||
          ----------------(..)----------------  CAAA tta CATTATGACGACTAGAAACAGCATACTCTCTGG
          1         1         1         1         3        11        21        31
   16277782  16277792  16277802  16277812  16277820  16277830  51304553  51304563
          CCGTCTGTCCAGATAGATCTTGAGAAGATACATCAA]?tgttttgctcaagtag(..)nnnnnnnnnnnnnnnn
          ||||||||||||||||||||||||||||||||||||  ................(..)................
          CCGTCTGTCCAGATAGATCTTGAGAAGATACATCAA  ----------------(..)----------------
         41        51        61        71        77        77        77        77

ReadsMap can exactly locate even very short alignment fragments separated by long introns as it uses information from alignments of all set of reads.

Steps of pipeline

Mapping of reads to chromosomes (contigs) is performed in several stages.
In the beginning, the search for possible variants of mapping subject to masking of sequences and cutting off “unreliable” fragments is to be done.
Further, the obtained “draft” alignments are being sorted and processed in order to restore the coordinates lost during cutting off. At the same stage occurs the initial “rough” filtering of alignments.
The next step is the search for introns at ends of alignments based on the list of valid introns supported by other reads.
Further processing allows to rebuild the most possible link of read (or pair of reads) to a certain chromosome. To do this, the value of maximal chromosome coverage for every read (as a measure of alignment's quality) is calculated. All alignments of a certain read with coverage values below the reliability threshold are to be removed. Thus, the alignments with the best quality (the one best alignment in limiting case) remain.
The “cleared” during this procedure list of alignments is used for reiterated attempt to find introns at the ends of reads. The obtained result can be used for further work.
Sequences that are used as initial data can be preliminary processed in order to “mask” unreliable fragments. When the “masking” option is used, the following occurs:
1. Symbols N at the ends of sequences (both target and query) are being removed;
2. All symbols in lowercase are being considered as unknown and thus removed from matching calculation;
3. During the postprocessing, the corrupted coordinates are being restored.
Thus, the masking occurs at data preparation stage. Data prepared in such a way can be utilized both as masked and as unmasked.
1. Preliminary data processing;
2. Run with use of shell;
3. Step-by-step run;
4. Conversion.

Input data

As raw data, the file(s) with chromosomal sequence in FASTA format and file with reads' sequences (MULTI FASTA) are used. Sequences of paired reads are supposed to be mutually sequential, i.e. the first read of the first pair, the second read of the first pair, the first read of the second pair, the second read of the second pair etc.

Masked - enable masking (do not consider lowercase letters).
Paired - perform processing subject to pairness of reads. In our tests, taking the pairness of reads into account had increased a specificity (ratio of correct alignments to the total ones), but slightly decreased sensitivity (ratio of correct alignments to the total number of reads). By default this option is disabled. Paired reads are to be mutually sequential (see above).
No splice - disable spliced aligning (accuracy for non-spliced reads will be increased).
To SAM - convert result into the SAM format.
To Simple - provide results in Simple format.

Output data

The result of calculations is a set of binary files for every chromosome with the name of the following type — xxx:yyy:000:rri.da (xxx:yyy:000:rpi.da in case paired reads), where xxx — the name of file with chromosome, yyy — the name of file with reads.