Department of Computer Science

Department of Computer Science
Presentation : Kevin Charles Paruchuri Padmavathi Department of Computer Science UTSA 11/1/2010

Introduction GASSST: global alignment short sequence search tool
A Gibbs sampling strategy applied to the mapping of ambiguous short-sequence tags.

GASSST: global alignment short sequence search tool

Current Sequence Aligners
Next-generation sequencing machines are able to produce huge amounts data Common techniques often restrict indels in the alignment to improve speed Flexible aligners are too slow for large- scale applications As technology improves, the next generation sequencing machines are able to produce huge amounts of data. The common techniques that we currently have, often restrict indels in the alignment to improve speed.

GASSST GASSST is thus 2-fold—achieving high performance with no restrictions on the number of indels with a design that is still effective on long reads. This method compares with BLAST, with a new efficient filtering step that discards most alignments coming from the seed phase Carefully designed series of filters of increasing complexity and efficiency to quickly eliminate most candidate alignments Algorithm manipulates pre-computed small table of 64KB which easily fits into the cache memory The purpose of this search tool is 2 fold, achieving high performance with no restrictions on the number of indels with a design that is still effective on long reads. This method compares with one of the algorithm we saw in class (BLAST), Goals are to globally align short sequences to local regions of complete genomes in a very short time. Furthermore, to increase sensitivity, a few alignment errors are permitted. A first step is generally to map these short reads over a reference genome In a seed-and-extend method, one or more exactly matching k-mers (“seeds” or “hot-spots”) provide initial evidence of possible similarity. The seeds are then extended into sequence alignments. The extension step is more accurate than the seeding step, but it is computationally expensive GASSST’s originality comes then from the use of a small lookup table. More precisely, the precomputed alignment scores of all possible pairs of words of length w can be stored in a memory of size 42w bytes, if a score is memorized in a single byte. For PASS, the size of the lookup table is 414 = 256 MB.

Last step, extend, receives alignments that passed the filter step.
It is computed using a traditional banded NW algorithm. Significant alignments are then printed with their full description. Provides a lower bound only It should be noted that if the filter step provides good efficiency, no optimization of the extend step is required. Indeed, if most false positive alignments have already been ruled out, the extend step should only take a negligible fraction of the total execution time.

Tiled Algorithm Tiled Algorithm. With a pre-computed table score of size 4. Here, the score given by the tiled algorithm is the same as the full dynamic programming algorithm.

A Gibbs sampling strategy applied to the mapping of ambiguous short-sequence tags.

Gibbs Sampling for Ambiguous Seq
Maps ambiguous tags to individual genomic sites. Mapping of ambiguous tags Calculating LR for each site For each map site the number of co-located tags are counted. This count is used for calculate likelihood ratio Higher likelihood ratio, higher confidence, increases non-linearly with tag counts LR is calculating conditional prob Two steps are circular, led to adopt Gibbs Sampling. For some set of ambiguous tags (σ), it reaches relative entropy between Ps and Pn. From our previous lectures, We already know about gibbs sampling and the problem of mapping ambiguous tags to reference genome. This method maps ambiguous tags to individual genome sites. This algorithm takes the advantage of local genomic content. There are 2 aspects : 1) mapping ambiguous tags to all probable sites 2) Calculating likelihood ratio of each map site. a) For each map site number of Ambiguous tags are counted and this count used to calculate the LR LRj = Ps(kj)/Pn(kj) Ps is the estimated target distribution of tag counts (Initially to mode this, Normal distribution is used) Pn is the background distribution of tag counts (Poisson distribution is used) Kj is the tag count at site kj. Kj = ku + ka f (f= 1/(mean of associated sites of ambiguous tags)), ka more, f small, to weight unique tags more heavily for the mapping From practical point of view LR is calculating the conditional prob of assigning the ambiguous tags to a specific site given the assignments of all the other tags. For some set of ambiguous tags (σ is the set of tag count for all genomic sites), it approaches relative entropy between Ps and Pn (they become one). These two steps are circular, led to adopt gibbs sampling.

Diagrammatic representation of previous slide.

Comparison Compared against MAQ s/w method, which randomly selects a site for each ambiguous tag. Comparison on the eight seq tag libraries (20 bp tags, 35 bp tags) shows that Gibbs Sampling correctly maps from 49% to 71%, MAQ method 8% to 23%.

Thank you for listening.
Questions

Results We found that GASSST achieves high sensitivity in a wide range of configurations and faster overall execution time than other state-of-the-art aligners.

Department of Computer Science

Similar presentations

Presentation on theme: "Department of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Department of Computer Science

Similar presentations

Presentation on theme: "Department of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback