Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches.

Slides:



Advertisements
Similar presentations
Introduction to perl programming: the minimum to know! Bioinformatic and Comparative Genome Analysis Course HKU-Pasteur Research Centre - Hong Kong, China.
Advertisements

Supplementary Figure S1 (A) Change of reporter activity levels after actinomycin D treatment. HEK293T cells were transiently transfected with the reporter.
The genetic code.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Improved Alignment of Protein Sequences Based on Common Parts David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
Bioinformatics. Bioinformatics is an applied science that uses computer programs to access molecular biology databanks to make inferences about the information.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
Supplementary Fig.1: oligonucleotide primer sequences.
Let’s investigate some of the Hot Areas of Life Sciences in more detail: Genomics –Human Genome Project –Use of Microarrays or DNA chips Bioinformatics.
Better Filtering with Gapped q-grams S. Burkhardt Center for Bioinformatics, SaarbrückenMax-Planck Institut f. Informatik, Saarbrücken J. Kärkkäinen.
Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal.
1 Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 5 High-level Programming with Python Part II: Container Objects Reference:
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Dictionaries.
BLAT – The B LAST- L ike A lignment T ool Kent, W.J. Genome Res : Presenter: 巨彥霖 田知本.
IGEM Arsenic Bioremediation Possibly finished biobrick for ArsR by adding a RBS and terminator. Will send for sequencing today or Monday.
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Nature and Action of the Gene
Data Structures and Algorithms. 2 3 Outline What is a data structure Examples –elementary data structures –hash tables Computer capabilities What is.
1 Perl: subroutines (for sorting). 2 Good Programming Strategies for Subroutines #!/usr/bin/perl # example why globals are bad $one = ; $two = ; $max.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
Undifferentiated Differentiated (4 d) Supplemental Figure S1.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
Fig. S1 siControl E2 G1: 45.7% S: 26.9% G2-M: 27.4% siER  E2 G1: 70.9% S: 9.9% G2-M: 19.2% G1: 57.1% S: 12.0% G2-M: 30.9% siRNF31 E2 A B siRNF31 siControl.
PART 1 - DNA REPLICATION PART 2 - TRANSCRIPTION AND TRANSLATION.
Cloning of Atrolysin A from Crotulas atrox AJ Goos and Kayla Ohrt.
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Prodigiosin Production in E. Coli Brian Hovey and Stephanie Vondrak.
Passing Genetic Notes in Class CC106 / Discussion D by John R. Finnerty.
Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative.
Definitions Mutation – any change in the genetic sequence.
Dictionaries. A “Good morning” dictionary English: Good morning Spanish: Buenas días Swedish: God morgon German: Guten morgen Venda: Ndi matscheloni Afrikaans:
Doug Raiford Phage class: introduction to sequence databases.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
Transcription and Translation Activity 1.You will work with the person sitting next to you. 2.One of you will take the role of RNA polymerase and transcribe.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Heuristic Alignment Algorithms Hongchao Li Jan
Suppl. Figure 1 APP23 + X Terc +/- Terc +/-, APP23 + X Terc +/- G1Terc -/-, APP23 + X G1Terc -/- G2Terc -/-, APP23 + X G2Terc -/- G3Terc -/-, APP23 + and.
RA(4kb)- Atggagtccgaaatgctgcaatcgcctcttctgggcctgggggaggaagatgaggc……………………………………………….. ……………………………………………. ……………………….,……. …tactacatctccgtgtactcggtggagaagcgtgtcagatag.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
Name of presentation Month 2009 SPARQ-ed PROJECT Mutations in the tumor suppressor gene p53 Pulari Thangavelu (PhD student) April Chromosome Instability.
DNA, RNA and Protein.
Ji-Yoon Park Nanoparticle-Based Theorem Proving.
SSAHA: A Fast Search Method For Large DNA Databases Zemin Ning, Anthony J. Cox and James C. Mullikin Seminar by: Gerry Kammerer © ETH Zürich.
ORF Calling.
Nanoparticle-based Theorem Proving
Homology Search Tools Kun-Mao Chao (趙坤茂)
Modelling Proteomes.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Supplementary information Table-S1 (Xiao)
Sequence – 5’ to 3’ Tm ˚C Genome Position HV68 TMER7 Δ mt. Forward
Python.
Supplemental Table 3. Oligonucleotides for qPCR
Supplementary Figure 1 – cDNA analysis reveals that three splice site alterations generate multiple RNA isoforms. (A) c.430-1G>C (IVS 6) results in 3.
DNA By: Mr. Kauffman.
Gene architecture and sequence annotation
Homology Search Tools Kun-Mao Chao (趙坤茂)
Molecular engineering of photoresponsive three-dimensional DNA
A Missense Mutation (R565W) in Cirhin (FLJ14728) in North American Indian Childhood Cirrhosis  Pierre Chagnon, Jacques Michaud, Grant Mitchell, Jocelyne.
Fundamentals of Protein Structure
Python.
High-Level Synthesis of a Genomic Database Search Engine
6.096 Algorithms for Computational Biology Lecture 2 BLAST & Database Search Manolis Piotr Indyk.
BIOINFORMATICS Fast Alignment
Homology Search Tools Kun-Mao Chao (趙坤茂)
Searching Sequence Databases
Presentation transcript:

q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches Krebsforschungszentrum, Heidelberg E. Rivals P. Ferragina M. Vingron

Outline  Existing Work  Motivation  Problem  Algorithm  Results

 Examples : BLAST FASTA  Linear Scan (No Index)  Good Sensitivity

 Today: New Applications  Examples : EST-Clustering Large Scale Shotgun Assembly  Low Sensitivity  Multiple Searches  Specialized Algorithms Needed

Pattern P T C G A T T A C A G T G A A T  Local Alignment, minimum Length w w = 8  Low Error Rate (<10% Edit Distance) Database D G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T

Filter Step: Identify Hotspots Scan Step: Scan Hotspots with BLAST

T C G C G A G A T A T T T T A T A C G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T q = 3 # of q-grams : |P| - q + 1 Edit Distance e : at least t = |P| - q (qe) common q-grams q-gram Filtration Block Addressing Suffix Array Window Shifting T C G A T T A CT C G A T T A C A G T G A A T w = 8

G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T T C G A T T A C q-gram Filtration Block Addressing Suffix Array Window Shifting  Scan Blocks with counter  t How to find the matching q-grams?  Divide D into Blocks  Count matching q-grams per Block 40

G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T T C G A T T A C q-gram Filtration Block Addressing Suffix Array Window Shifting  Precompute Searches for q-grams, O(1) Time Access AAA : 0 AAC : 0 AAG : 0 AAT : 0 ACA : 1 ACC : 1 ACG : 1 ACT : 1 AGA : 3 AGC : 3 AGG : 3 AGT : 3 ATA : 4 ATC : 4 ATG : 4 ATT : 5 TGA : 26 TGC : 27 TGG : 27 TGT : 29 TTA : 29 TTC : 29 TTG : 30 TTT :  Sorted List of Pointers to Suffixes, O(log |D|) Access Time

G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T T C G A T T A C A G T G A A T q-gram Filtration Block Addressing Suffix Array Window Shifting  Scan Marked Blocks q = 3 w = 8 e = 1 t = 3  Mark full Blocks for each Window  Move Window over Query T C G A T T A C

 Influence of the Block Size  Sensitivity  Running Times  Overhead for loading the Index Benchmark System: Ultra Sparc Processor, 333Mhz, 4GB RAM

Influence of Block Size

Sensitivity  1000 Queries  BLAST Cutoff E =  Number of identical hitlists Mouse EST DB: 91.4 % Human EST DB: 97.1 %  QUASAR finds many Hits below selected Error Level

Running Times  Test Parameters: l 6% Error l w = 50 l q = 11 l block size 2048 l scan with BLAST l time averaged for 1000 queries  ~30 times faster than BLAST

Overhead for Loading the Index  1000 queries  Human EST DB, 280 Mbps  BLAST Test Run: 5 seconds Load Time seconds Search Time  QUASAR Test Run: 90 seconds Load Time 380 seconds Search Time