When use please reference: Seledtsov I., Molodtsov V., Solovyev V. ReadMap: Aligning next generation sequences to the genome sequence. http://linux5.softberry.com/cgi-bin/berry/programs/ReadsMap
ReadsMap is a fast short read aligner that quickly maps/aligns large sets of short DNA sequences. Multiple processors can be used optionally to achieve greater alignment speed. On initial stage we map “exonic” reads that demonstrate high-quality, non-interrupted alignment to a genomic sequence. Potentially, this step would map most of the reads to a genome, and the remaining “non-mapped” group would be small enough to be subjected to more thorough analysis. At the second step, we use a modified variant of our EST_MAP program to align these “non-mapped” reads using splice site matrices and producing very accurate alignment with gaps. This reads will indentify potential exon-intron boundaries.
Accuracy of reads mapping was estimated on a set of read pairs (each read is 76 base long and generated with 2% of random mutations) from 1356 mRNA sequences of human chromosome 22. We computed performance of ReadsMap and popular Bowtie read mapping program (Genome Biology, 2009, 10:R25) in aligning 1963694 of such reads to chromosome 22 sequence.
Extra alignments affecting specificity of mapping are often produced due to occurrence of alternatively spliced variant of mRNA.
On the same data we have estimated Bowtie (v2) performance. As the Bowtie gap penalty is proportional to gap size it does not produce full alignment across introns that are thousands nucleotides long. Therefore we made comparison by accounting the accuracy of alignment of the most long alignment fragment.
We can conclude that the current version of Bowtie is not suitable for mapping spliced reads from RNASeq data having Sn=0.38 compared with ReadsMap Sn=0.99 (Table2).
In Fig 1 a typical example of a spliced read alignment is presented. While the longest fragment of alignment is located by Bowtie in the right place, the short fragment is aligned incorrectly.
Fig 1. ReadsMap alignment:
nnnnnnnnnnnnnnnn(..)gtcagaaagtaactggCAAATT]ctatgtataaaattgt(..)taatgtaaacacttac[ ................(..)................|||||| ................(..)................ ----------------(..)----------------CAAATT ----------------(..)---------------- 1 1 1 1 5 7 7 7 16277748 16277758 16277768 16277778 16277788 16277798 16277808 16277818 ACATTATGACGACTAGAAACAGCATACTCTCTGGCCGTCTGTCCAGATAGATCTTGAGAAGATACATCAAtgttttgctc ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||.......... ACATTATGACGACTAGAAACAGCATACTCTCTGGCCGTCTGTCCAGATAGATCTTGAGAAGATACATCAA----------
1 11 16277729 16277739 16277747 16277752 16277762 16277772 nnnnnnnnnnnnnnnn(..)ttttaatgtaaacact?[TACA]---[CATTATGACGACTAGAAACAGCATACTCTCTGG ................(..)................ 0|0| ... ||||||||||||||||||||||||||||||||| ----------------(..)---------------- CAAA tta CATTATGACGACTAGAAACAGCATACTCTCTGG 1 1 1 1 3 11 21 31 16277782 16277792 16277802 16277812 16277820 16277830 51304553 51304563 CCGTCTGTCCAGATAGATCTTGAGAAGATACATCAA]?tgttttgctcaagtag(..)nnnnnnnnnnnnnnnn |||||||||||||||||||||||||||||||||||| ................(..)................ CCGTCTGTCCAGATAGATCTTGAGAAGATACATCAA ----------------(..)---------------- 41 51 61 71 77 77 77 77
ReadsMap can exactly locate even very short alignment fragments separated by long introns as it uses information from alignments of all set of reads.
Mapping of reads to chromosomes (contigs) is performed in several stages.
In the beginning, the search for possible variants of mapping subject to masking of sequences and cutting off “unreliable” fragments is to be done.
Further, the obtained “draft” alignments are being sorted and processed in order to restore the coordinates lost during cutting off. At the same stage occurs the initial “rough” filtering of alignments.
The next step is the search for introns at ends of alignments based on the list of valid introns supported by other reads.
Further processing allows to rebuild the most possible link of read (or pair of reads) to a certain chromosome. To do this, the value of maximal chromosome coverage for every read (as a measure of alignment's quality) is calculated. All alignments of a certain read with coverage values below the reliability threshold are to be removed. Thus, the alignments with the best quality (the one best alignment in limiting case) remain.
The “cleared” during this procedure list of alignments is used for reiterated attempt to find introns at the ends of reads. The obtained result can be used for further work.
Sequences that are used as initial data can be preliminary processed in order to “mask” unreliable fragments. When the “masking” option is used, the following occurs:
1. Symbols N at the ends of sequences (both target and query) are being removed;
2. All symbols in lowercase are being considered as unknown and thus removed from matching calculation;
3. During the postprocessing, the corrupted coordinates are being restored.
Thus, the masking occurs at data preparation stage. Data prepared in such a way can be utilized both as masked and as unmasked.
1. Preliminary data processing;
2. Run with use of shell;
3. Step-by-step run;
As raw data, the file(s) with chromosomal sequence in FASTA format and file with reads' sequences (MULTI FASTA) are used. Sequences of paired reads are supposed to be mutually sequential, i.e. the first read of the first pair, the second read of the first pair, the first read of the second pair, the second read of the second pair etc.
Masked - enable masking (do not consider lowercase letters).
Paired - perform processing subject to pairness of reads. In our tests, taking the pairness of reads into account had increased a specificity (ratio of correct alignments to the total ones), but slightly decreased sensitivity (ratio of correct alignments to the total number of reads). By default this option is disabled. Paired reads are to be mutually sequential (see above).
No splice - disable spliced aligning (accuracy for non-spliced reads will be increased).
To SAM - convert result into the SAM format.
To Simple - provide results in Simple format.
The result of calculations is a set of binary files for every chromosome with the name of the following type — xxx:yyy:000:rri.da (xxx:yyy:000:rpi.da in case paired reads), where xxx — the name of file with chromosome, yyy — the name of file with reads.