Fgenesh pipeline help

Fgenesh pipeline - Pipeline for automatic, with no human intervention (to modify results), prediction of genes in eukaryotic genomes based on Softberry gene finding software

References to Fgenesh pipeline software:

Solovyev V, Kosarev P, Seledsov I, Vorobyev D. (2006) Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol. 2006;7 Suppl 1:S10.1-12.

Solovyev V.V. (2002) Finding genes by computer: probabilistic and discriminative approaches. In Current Topics in Computational Biology (eds. T. Jiang, T. Smith, Y. Xu, M. Zhang), The MIT Press, p. 365-401.
Salamov A., Solovyev V. (2000) Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10(4), 516-522.

The pipeline can also make prediction of genes in long introns of other genes.

If "mapping known mRNAs" step is used, then user should prepare for a query organism three files (*.cdna, *.pro, *.dat) from RefSeq and put their locations on three corresponding lines.

If "mapping known mRNAs" step is not used, three corresponding lines must still be present yet their content is ignored (user can put "n/a" or "none" instead of paths to indicate that data are not available or not used).

If "using ESTs to improve gene models" step is used, file with ESTs must be provided.

If "prediction of genes based on homology to known proteins" step is used, then user should prepare protein database (NR or custom protein database) and put its location into "paths.list".

User can run step "prediction of genes based on homology to known proteins" with either 'prot_map' (default) or 'BLAST' (alternative) method:

"prot_map" (default) method runs prot_map to map proteins from protein database to genomic sequences and selects good quality mappings, then fgenesh+ predicts more refined gene models in regions with good mapped proteins;

"BLAST" (alternative) method first predicts genes ab initio (by fgenesh), then finds homologs to predicted proteins in a database (by BLAST) and then tries to refine gene models (by fgenesh+) using protein homologs found.

While both methods give similar accuracy in our tests, we recommend to use "prot_map" method because it is more straightforward, and "BLAST" method uses some heuristics to merge genes if they got split when predicted ab initio.

Please note that if you run "prediction of genes based on homology to known proteins" step, you must choose whether to run it with 'prot_map' or 'BLAST' method. If both values are set to 1, program asks you to set to 1 only one value.

The pipeline always runs "ab initio gene prediction" step, therefore it is not to set up in configuration file.

The final step, prediction of genes in long introns of other genes, can be switched ON and OFF by setting the corresponding value to 1 or 0.

On the last two lines put locations of parameters file (e.g., "mammals.par") and "paths.list" file you use for a task.

Pipeline steps: additional information

I. Mapping known genes (mRNAs) and selecting good mappings.

Mapping known mRNAs, e.g., from RefSeq, to genomic sequences. Genes predicted on this step are stored in tmp_work/*.kg1 files.

II. Mapping known proteins and selecting good mappings.

"Prot_map" (default) method predicts gene models using combination of prot_map and Fgenesh+, with additional selection of reliable models through blast2 alignments between predicted proteins and protein homologs.

First, prot_map maps given protein database (for example, NR) to a genomic sequence and good quality mappings are selected. Then, Fgenesh+ predicts more refined gene models in regions with good mapped proteins.

After that predicted gene models are additionally filtered by script which analyses blast2 alignment between predicted model and protein homolog. Only models that have blast score > 100 and coverage > 80% both for model and homolog are selected.

An alternative, "BLAST", method to predict genes based on homology to known proteins first predicts genes ab initio (by fgenesh), then finds homologs to predicted proteins in a database (by BLAST) and then tries to refine gene models (by fgenesh+) using protein homologs found.

User can choose which of the two methods to use in gene predictions - see section "Task configuration file".

While both methods give similar accuracy in our tests, we recommend to use "prot_map" method because it is more straightforward, and "BLAST" method uses some heuristics to merge genes if they got split when predicted ab initio.

III. Ab initio gene prediction.

At this step a special script looks for files *.kg1 and *.pm, combines them and forms sequence fragments that contain NO gene models from mapping either mRNAs or proteins. If both *.kg1 files and *.pm files are missing, the script skips fragmenting the sequences. Then gene models are predicted ab initio by Fgenesh.

Prediction of genes in long introns of other genes runs optionally as the last step.