SNP-select help

SNP-select description

The main concept of the ANNOVAR-like program SNP-select, as well, in fact, as ANNOVAR itself (see http://www.openbioinformatics.org/annovar/), is a stepwise filtering of input mutation set to discriminate those hardly determine a certain hereditary disease or any other "mutational" feature. User can specify the set of filtering steps.

The SNP-select program reads input files where the every line corresponds to a single mutation per genome, including single and/or block substitutions, insertions or deletions. The first 5 values in a line, tab-separated, represent a chromosomal index (may contain 'chr' prior to index), starting and ending positions for mutation, reference and alternative nucleotides. It is allowed to insert additional columns to line that will be output without changes in output files. '0' is specified for indication of a reference nucleotide, if it is unknown. Insertions/deletions can be represented by the '-' symbol for indication of missing nucleotides. The Table 1 represents a number of examples.

Table 1. Example of an input file with five genetic variants.
Chromosome Start End Ref Obs Comments
16 49303427 49303427 C T R702W (NOD2)
16 49321279 49321279 - C c.3016_3017insC (NOD2)
13 19661685 19661685 G - 35delG (GJB2)
1 105293754 105293754 0 ATAAA Block substitution
1 13133880 13133881 TC - 2-bp deletion (rs59770105)

To estimate variants in regard to their functional effects, SNP-select utilizes a number of databases that are preliminary loaded and checked (Table 2). Some filters, by user discretion, can be omitted. Every variant (mutation) is being compared to content of databases selected as filters, and, on hit particular genome regions or on exceeding an user-specified score threshold, is being discriminated.

At every step, in the folder, that is specified in initialization file, the files 'step_i' and 'step_i.dropped' with "passed through the filter" and "dropped" at i-th step mutations. The last value in tab-delimited line of mutation contains the commentary on a feature of current filter. As a rule, it's a score value. For the first filter - genomic annotation - it is a reference to mutation localization: exonic or splicing for accepted mutations, and noncoding_exonic, intronic, 3-prim noncoding, 5-prim noncoding, intergenic for filtered, as well as '-' for those discriminated mutations that were not found in annotations DB (located outside the annotated part of a genome).

Once all filters are passed, the list of mutations that are most promising in terms of association with a certain feature/disease is formed.

Table 2. List of filtering steps.
Filter order Name of a filter
file in folder with DB
Score threshold for discrimination of non-promising mutations
1 Genic annotation of a mutation,
knownGene.fg2
All mutations outside exons or splicing sites are being discriminated
2 Conserved regions among 46 species,
phastConsElements46way.txt
All not found removed
3 Segmental duplication regions
genomicSuperDups.txt
All found removed**)
4 1000 Genomes Project Pilot data 2011 May release
ALL.sites.2010_11.txt
All found removed**)
5 dbSNP, NCBI short genetic variations
snp135.txt
All found removed**)
6 whole-exome SIFT scores for non-synonymous variants
avsift.txt
>0.05*)
7 PolyPhen-2 scores
ljb_pp2.txt
>0.85*)
8 PhyloP conservation scores
ljb_phylop.txt
>0.95*)
9 MutationTaster scores
ljb_mt.txt
>0.5*)
10 whole-exome LRT (likelihood ratio test) scores
ljb_lrt.txt
>0.5*)
11 whole-exome GERP++ scores
ljb_gerp++.txt
>0*)
12 Conserved genomic regions by GERP++
gerp++elem.txt
All not found removed
13 NHLBI Exome Sequencing Project
esp6500_all.txt
>0*)
14 UCSC repeats
rmask.txt
All found removed**)

*) - can be changed by user
**) - Database hit is enough