Department of Computer Science

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Word Spotting DTW.

BRISK (Presented by Josh Gleason)

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪莊凱翔.

Heuristic alignment algorithms and cost matrices

Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.

Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.

Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.

Fast, Effective Code Generation in a Just-In-Time Java Compiler Rejin P. James & Roshan C. Subudhi CSE Department USC, Columbia.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu

Assembling Sequences Using Trace Signals and Additional Sequence Information Bastien Chevreux, Thomas Pfisterer, Thomas Wetter, Sandor Suhai Deutsches.

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Sequence Alignment.

Construction of Substitution matrices

Doug Raiford Phage class: introduction to sequence databases.

Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.

Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.

Parallel tree search: An algorithmic approach for multi- field packet classification Authors: Derek Pao and Cutson Liu. Publisher: Computer communications.

1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.

April 21, 2016Introduction to Artificial Intelligence Lecture 22: Computer Vision II 1 Canny Edge Detector The Canny edge detector is a good approximation.

Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression.

A Music Search Engine for Plagiarism Detection

Metagenomic Species Diversity.

Scoring Sequence Alignments Calculating E

Preprocessing Data Rob Schmieder.

Updating SF-Tree Speaker: Ho Wai Shing.

Gene expression from RNA-Seq

RNA-Seq analysis in R (Bioconductor)

UNIVERSITY OF MASSACHUSETTS Dept

Genome alignment Usman Roshan.

From: Optimized design and assessment of whole genome tiling arrays

Introduction to Algorithms

Welcome to Introduction to Bioinformatics

Genomic Data Clustering on FPGAs for Compression

A Closer Look at Instruction Set Architectures

Students: Meiling He Advisor: Prof. Brain Armstrong

Genome Read In-Memory (GRIM) Filter Fast Location Filtering in DNA Read Mapping with Emerging Memory Technologies Jeremie Kim, Damla Senol, Hongyi Xin,

Sequence comparison: Significance of similarity scores

Search-Based Footstep Planning

GateKeeper: A New Hardware Architecture

Local alignment and BLAST

Fast Sequence Alignments

Paraskevi Raftopoulou, Euripides G.M. Petrakis

Next-generation sequencing - Mapping short reads

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Pairwise Sequence Alignment (cont.)

Efficient Importance Sampling Techniques for the Photon Map

Maximize read usage through mapping strategies

Parallel System for BLAST

Sahand Kashani, Stuart Byma, James Larus 2019/02/16

BIOINFORMATICS Fast Alignment

Sequence comparison: Significance of similarity scores

Next-generation sequencing - Mapping short reads

Basic Local Alignment Search Tool

Sequence alignment, E-value & Extreme value distribution

Presentation transcript:

Department of Computer Science Presentation : Kevin Charles Paruchuri Padmavathi Department of Computer Science UTSA 11/1/2010

Introduction GASSST: global alignment short sequence search tool A Gibbs sampling strategy applied to the mapping of ambiguous short-sequence tags.

GASSST: global alignment short sequence search tool

Current Sequence Aligners Next-generation sequencing machines are able to produce huge amounts data Common techniques often restrict indels in the alignment to improve speed Flexible aligners are too slow for large- scale applications As technology improves, the next generation sequencing machines are able to produce huge amounts of data. The common techniques that we currently have, often restrict indels in the alignment to improve speed.

GASSST GASSST is thus 2-fold—achieving high performance with no restrictions on the number of indels with a design that is still effective on long reads. This method compares with BLAST, with a new efficient filtering step that discards most alignments coming from the seed phase Carefully designed series of filters of increasing complexity and efficiency to quickly eliminate most candidate alignments Algorithm manipulates pre-computed small table of 64KB which easily fits into the cache memory The purpose of this search tool is 2 fold, achieving high performance with no restrictions on the number of indels with a design that is still effective on long reads. This method compares with one of the algorithm we saw in class (BLAST), Goals are to globally align short sequences to local regions of complete genomes in a very short time. Furthermore, to increase sensitivity, a few alignment errors are permitted. A first step is generally to map these short reads over a reference genome In a seed-and-extend method, one or more exactly matching k-mers (“seeds” or “hot-spots”) provide initial evidence of possible similarity. The seeds are then extended into sequence alignments. The extension step is more accurate than the seeding step, but it is computationally expensive GASSST’s originality comes then from the use of a small lookup table. More precisely, the precomputed alignment scores of all possible pairs of words of length w can be stored in a memory of size 42w bytes, if a score is memorized in a single byte. For PASS, the size of the lookup table is 414 = 256 MB.

Last step, extend, receives alignments that passed the filter step. It is computed using a traditional banded NW algorithm. Significant alignments are then printed with their full description. Provides a lower bound only It should be noted that if the filter step provides good efficiency, no optimization of the extend step is required. Indeed, if most false positive alignments have already been ruled out, the extend step should only take a negligible fraction of the total execution time.

Tiled Algorithm Tiled Algorithm. With a pre-computed table score of size 4. Here, the score given by the tiled algorithm is the same as the full dynamic programming algorithm.

A Gibbs sampling strategy applied to the mapping of ambiguous short-sequence tags.

Gibbs Sampling for Ambiguous Seq Maps ambiguous tags to individual genomic sites. Mapping of ambiguous tags Calculating LR for each site For each map site the number of co-located tags are counted. This count is used for calculate likelihood ratio Higher likelihood ratio, higher confidence, increases non-linearly with tag counts LR is calculating conditional prob Two steps are circular, led to adopt Gibbs Sampling. For some set of ambiguous tags (σ), it reaches relative entropy between Ps and Pn. From our previous lectures, We already know about gibbs sampling and the problem of mapping ambiguous tags to reference genome. This method maps ambiguous tags to individual genome sites. This algorithm takes the advantage of local genomic content. There are 2 aspects : 1) mapping ambiguous tags to all probable sites 2) Calculating likelihood ratio of each map site. a) For each map site number of Ambiguous tags are counted and this count used to calculate the LR LRj = Ps(kj)/Pn(kj) Ps is the estimated target distribution of tag counts (Initially to mode this, Normal distribution is used) Pn is the background distribution of tag counts (Poisson distribution is used) Kj is the tag count at site kj. Kj = ku + ka f (f= 1/(mean of associated sites of ambiguous tags)), ka more, f small, to weight unique tags more heavily for the mapping From practical point of view LR is calculating the conditional prob of assigning the ambiguous tags to a specific site given the assignments of all the other tags. For some set of ambiguous tags (σ is the set of tag count for all genomic sites), it approaches relative entropy between Ps and Pn (they become one). These two steps are circular, led to adopt gibbs sampling.

Diagrammatic representation of previous slide.

Comparison Compared against MAQ s/w method, which randomly selects a site for each ambiguous tag. Comparison on the eight seq tag libraries (20 bp tags, 35 bp tags) shows that Gibbs Sampling correctly maps from 49% to 71%, MAQ method 8% to 23%.

Thank you for listening. Questions

Results We found that GASSST achieves high sensitivity in a wide range of configurations and faster overall execution time than other state-of-the-art aligners.