OligoZip help

When use please reference:
Vorobyev D., Seledtsov I., Solovyev V. De novo assembling next generation sequences. http://linux5.softberry.com/cgi-bin/berry/programs/OligoZip

OligoZip general algorithm

Algorithm of ab initio genome assembling with the use of data produced by next-generation sequencing machines (Illumina/Solexa/etc).
In this description, a group of assembled reads will be denoted as a reads “cluster”, unused reads as “free reads” and a set of unused reads as a “free reads pool”.
The algorithm begins with an empty cluster.
1) First, the cluster accepts a top read from a free reads pool {P} (which at the beginning of calculation contains all reads). Consensus of the single-read cluster coincides with a sequence of the read.
2) Next, with the use of hashing technique, a subset of free pool is constructed, in which reads share some similarity with the consensus.
3) For each read of this subset, an attempt is made to add it to the cluster. Read is included under condition that its alignment to a consensus meets certain criteria.
4) After adding second (and any other) read into the cluster, a clusters consensus is recalculated. Entered reads are excluded from a free pool.
This process (1 to 4) iterates till the free pool contains reads with similarity to a consensus, that allows to include them into the current cluster. Cluster consensus is output as a contig sequence.
Assembling of the next clusters (and contigs) occurs the same way. Clusters are formed from the pool of reads remained unused after previous clusters assembly.
Process stops when either no reads remained free or all of free reads are already tested as cluster-initiating reads.

Input data

FASTA file - is a FASTA-file with reads with even/odd sequences representing one read pair (0-th and 1-st sequences should represent a pair, and so on).

Pair info file - Pair-info file format:

READS# 3711740 - reads number
TYPES# 2  - number of read types
0 – 185586        PAIR_ENDS -60..60  - string describes certain read type
1855870 – 3711739 PAIR_ENDS  10..190  - string describes certain read type. 

Here 1855870 - 3711739 are numbers of reads (begining from 0-th) belonging to this reads type; (-60..60) or (10..190) is an interval of possible distances between internal ends of reads in a pair. Negative value means that reads overlap.

Output data

FASTA-file with contigs.