05.04.2008 SSAHA: A Fast Search Method For Large DNA Databases Zemin Ning, Anthony J. Cox and James C. Mullikin Seminar by: Gerry Kammerer © ETH Zürich.

Slides:

Advertisements

Similar presentations

INTRODUCTION TO DNA By the end of this lecture you will know:

Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.

Heuristic alignment algorithms and cost matrices

Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 18: Application-Driven Hardware Acceleration (4/4)

Heuristic Approaches for Sequence Alignments

Sequence alignment, E-value & Extreme value distribution

Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,

Nucleic Acids Nucleic Acid Basics Contain instructions to build proteins 2 types: – DNA – RNA Composed of smaller units called nucleotides – Monomer:

Nucleic Acids -DNA and RNA

Unit 2 – PART A Inside the Nucleus DNA Sturcture.

SSAHA, or Sequence Search and Alignment by Hashing Algorithm, is used mainly for fast sequence assembly, SNP detection, and the ordering and orientation.

Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

An Introduction to the Genetic Code of Life! By C. Rhein

DNA Structure Review. Questions 1.Name the term used to describe the shape of the DNA molecule. 2.What does DNA stand for? 3.What 3 chemicals make up.

DNA The molecule of heredity. The molecules of DNA is the information for life (determine an organism’s traits) DNA achieves its control by determining.

Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.

DNA REVIEW Objective: To review the structure and function of DNA.

What is this DNA you speak of?  DNA stands for deoxyribonucleic acid - Found in nucleus of eukaryotic cells - Found in cytoplasm of protists.

1 2 DNA DNA.DNA is often called the blueprint of life. In simple terms, DNA contains the instructions for making proteins within the cell.

Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.

DNA (deoxyribonucleic acid) consists of three components.

DNA Deoxyribose Nucleic Acid. DNA (deoxyribonucleic acid) Genetic Information in the form of DNA is passed from parent to offspring. Genes are the code.

1 1. Label the components that make up the DNA. 2. Draw a box surrounding one nucleotide of the double helix and label this.

PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.

Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

DNA. What is DNA? DNA (Deoxyribonucleic Acid)- is the information of life Achieves its control by determining the structure of proteins The complete instructions.

Doug Raiford Phage class: introduction to sequence databases.

Have Your DNA and Eat It Too I will be able to describe the structure of the DNA molecule I will be able to explain the rules of base pairing I will understand.

The Structure of:.  By the end of the day, you should:  Know what DNA stands for  Understand the shape of DNA and be able to label all parts  Know.

DNA Introduction. What is DNA? Genetic information of life Type of Nucleic Acid Double Stranded.

DNA (Deoxyribonucleic Acid). What is DNA? DNA is an encoded molecule that determines traits by giving instructions to make proteins.

DNA. DNA is the organic molecule Deoxyribonucleic Acid The function of DNA is as a molecule that permanently stores the information or instructions necessary.

DNA Structure DNA consists of two molecules that are arranged into a ladder-like structure called a Double Helix. A molecule of DNA is made up of millions.

DNA and RNA Structure and Function Chapter 12 DNA DEOXYRIBONUCLEIC ACID Section 12-1.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

Sequence comparison: Local alignment

Higher Human Biology Sub topic 2a

DNA and The Genome Structure and Organisation of DNA

Overview of Genetics Genes make us who we are!.

DNA Structure 2.6 & 7.1.

MACROMOLECULES NUCLEIC ACIDS

DNA Deoxyribonucleic Acid

Making a Paper Helix Name: _______________________

Homology Search Tools Kun-Mao Chao (趙坤茂)

DNA Part 1: The basics.

Fast Sequence Alignments

Unit 4 Notes: DNA Structure

DNA and the Genome Key Area 1a The Structure of DNA.

DNA Section 6.1.

BIOINFORMATICS Fast Alignment

Goal for Today: Identify the structure and function of DNA.

DNA The Blueprints for Life

DNA Learning Goal: To learn about the structure of DNA.

Homology Search Tools Kun-Mao Chao (趙坤茂)

Sequence alignment, E-value & Extreme value distribution

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

SSAHA: A Fast Search Method For Large DNA Databases Zemin Ning, Anthony J. Cox and James C. Mullikin Seminar by: Gerry Kammerer © ETH Zürich | Gerry Kammerer

Gerry Kammerer – ETH Zürich 2 Human Genome

Gerry Kammerer – ETH Zürich 3 Outline  Introduction  DNA and DNA sequences  The problem and some approaches  The SSAHA-approach  Conclusions

Gerry Kammerer – ETH Zürich 4 Outline  Introduction  DNA and DNA sequences  The problem and some approaches  The SSAHA-approach  Conclusions

Gerry Kammerer – ETH Zürich 5 DNA  Deoxyribonucleic acid  Contains genetic instructions  Double helix  Long polymer of simple units (Nucleotides)  Backbone made of sugars and phospate  Four types of molecules attached to each sugar  Sequence of these four bases encodes information

Gerry Kammerer – ETH Zürich 6 DNA sequence  Base Pair  Bases from each strand form bonds  DNA sequence  Succession of letters  Adenine, Cytosine, Guanine, Thymine  Measured in Giga base (Gb) or Giga base pairs (Gbp)

Gerry Kammerer – ETH Zürich 7 The Problem  Sequence comparison (exact / approx)  Through comparison: Make conclusions on -Structure -Function -Cooperation of components  Sequence specifying  Produce multiple megabytes of data / day  Big amount of queries/data: Overexert Techniques -Results not found in reasonable time / not exact enough

Gerry Kammerer – ETH Zürich 8 Approaches  Dynamic Programming (First approaches)  Needleman & Wunsch, 1970  Refinements: Smith & Waterman, 1981 (most popular)  BLAST (Basic Local Alignment Search Tool)  Altschul et al., 1990  Faster / less accurate  Family of programs  Suffix Tree Algorithms  Need to much memory

Gerry Kammerer – ETH Zürich 9 Outline  Introduction  DNA and DNA sequences  The problem and some approaches  The SSAHA-approach  Conclusions

Gerry Kammerer – ETH Zürich 10 SSAHA-approach  Use hash table structures  Need much memory (Nowadays we have more RAM!)  But significantly less than suffix tree methods!  orders of magnitude faster than BLAST

Gerry Kammerer – ETH Zürich 11 Definitions  Query Q = „GGATCCCCTG“  DB = S 1, S 2, S 3, S 4,... (DNA sequences)  k-tuple: 4-tuple = „GGAT“  S has (n – k + 1) (overlapping) k-tuples  (i, j) references k-tuple -i is index of sequence -J is offset in the sequence  2-tuple (2,3) Example DB: S1 = „GGATCCCCTG“ S2 = „TGCAACAT“ S3 = „AACATCCTGGG“

Gerry Kammerer – ETH Zürich 12 Hash table construction  K-tuples  Only 4 k (as we have four bases)  List of postions L  Positions of k-tuples (sorted by k-tuple)  Array A  Pointers into L  (Which positions in L belong to which k-tuples)

Gerry Kammerer – ETH Zürich 13 Hash table construction (ctd.) Example DB (1-tuples): S1 = „GGATCC“ S2 = „TGCAAC“ S3 = „AATA“ List of positions L: Array A: A = [0,6,10,14]  A = 0 C = 6 G = 10 T = 14 8: (2,3) 9: (2,5) 10:(1,0) 11:(1,1) 12:(2,1) 13:(1,3) 14:(1,1) 15:(3,2) 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5)

Gerry Kammerer – ETH Zürich 14 Sequence Search  Query Q = „GAAT...“ – DNA sequence  Proceed each k-tuple base-by-base  E.g. with 2-tuple: „GA“, „AA“, „AT“,...  Construct hits: (i,k,j)  i, j is position for the current k-tuple (from hash table)  k = (j – (offset of current k-tuple in Q))  n entries in DB = n hits

Gerry Kammerer – ETH Zürich 15 Sequence Search (ctd.)  Sorting the hits  (i,k,j) – First by i, then k, then j  Let us have a look at a small example! Query Q = „AT“

Gerry Kammerer – ETH Zürich 16 Remember Example DB (1-tuples): S1 = „GGATCC“ S2 = „TGCAAC“ S3 = „AATA“ List of positions L: Array A: A = [0,6,10,14]  A = 0 C = 6 G = 10 T = 14 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5) Query Q = „AT“ 8: (2,3) 9: (2,5) 10:(1,0) 11:(1,1) 12:(2,1) 13:(1,3) 14:(1,1) 15:(3,2)

Gerry Kammerer – ETH Zürich 17 Sequence Search Example Example DB (1-tuples) List of positions L: 8: (2,3) 9: (2,5) 10:(3,2) 11:(1,0) 12:(1,1) 13:(2,1) 14:(1,3) 15:(1,1) 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5) Query Q = „AT“ Proceed base-by-base Hits: (1,2,2) (2,3,3) (2,4,4) (3,0,0) (3,1,1) (3,3,3)

Gerry Kammerer – ETH Zürich 18 Sequence Search Example (ctd.) Example DB (1-tuples) List of positions L: 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5) Query Q = „AT“ Proceed base-by-base Hits: (1,2,2)(1,2,3) (2,3,3)(1,0,1) (2,4,4)(3,1,2) (3,0,0) (3,1,1) (3,3,3) 8: (2,3) 9: (2,5) 10:(1,0) 11:(1,1) 12:(2,1) 13:(1,3) 14:(1,1) 15:(3,2) Hits: (1,2,2) (2,3,3) (2,4,4) (3,0,0) (3,1,1) (3,3,3)

Gerry Kammerer – ETH Zürich 19 Sequence Search Example (ctd.) Example DB (1-tuples) List of positions L: 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5) Query Q = „AT“ Proceed base-by-base Sorted Hits: (1,0,1)(3,1,1) (1,2,2)(3,1,2) (1,2,3)(3,3,3) (2,3,3) (2,4,4) (3,0,0) 8: (2,3) 9: (2,5) 10:(1,0) 11:(1,1) 12:(2,1) 13:(1,3) 14:(1,1) 15:(3,2)

Gerry Kammerer – ETH Zürich 20 Sequence Search Example (ctd.) Query Q = „AT“ Proceed base-by-base Sorted Hits: (1,0,1)(3,1,1) (1,2,2)(3,1,2) (1,2,3)(3,3,3) (2,3,3) (2,4,4) (3,0,0) Example DB (1-tuples): S1 = „GGATCC“ S2 = „TGCAAC“ S3 = „AATA“ Same i,k in Hits: Run of matching bases Example DB (1-tuples) Sorted Hits: (1,0,1)(3,1,1) (1,2,2)(3,1,2) (1,2,3)(3,3,3) (2,3,3) (2,4,4) (3,0,0)

Gerry Kammerer – ETH Zürich 21 Sequence Search Summary  Run of matching bases  Region of exact matches  Gapped matches  Only finds in forward direction!  Reverse query to find in reward direction 3-tuples, 9-base query Hits: (3,9,9)(5,3,3) (3,9,12)(5,3,9) (3,9,15)

Gerry Kammerer – ETH Zürich 22 Memory Requirements  Array A: 4 * 4 k = 4 k+1 bytes  32 bit pointers, 4 k possible k-tuples  List L: 8 * W bytes  W = Number of k-tuples in database  Reduce Memory usage  Only consider non-overlapping k-tuples  Discard highly frequent k-tuples  Loss of accuracy!

Gerry Kammerer – ETH Zürich 23 Search speed  Search speed depends on  T hash Building Hash-tables  T search Processing a specific query  T hash does not matter much Computed once for one DB (save to disk, server usage)

Gerry Kammerer – ETH Zürich 24 Optimise Search speed  Sorting algorithm  In reality: Lies close to linear with quicksort  Parameters k and W (tradeoff with accuracy)  Increase k (loss of sensitivity)  Reduce W by cutoff very often occuring k-tuples  Strong effect! (There exists highly repetitive k-tuples)

Gerry Kammerer – ETH Zürich 25 Experimental results (from paper)  2.7 Gb of human genome DNA  292‘016 sequences  177 Query sequences  Containing 104‘755 bases  Compaq EV6 500MHz Processor, 16 GB RAM

Gerry Kammerer – ETH Zürich 26 Experimental results (ctd.) 90%95%100% kT hash T search T hash T search T hash T search s102.5s842.4s128.8s868.5s389.5s s26.3s810.5s36.1s808.8s199.1s s7.3s969.9s11.0s961.2s119.0s s2.2s s851.4s78.7s s0.9s932.0s2.5s927.1s51.6s s0.1s1015.5s1.7s999.2s35.4s

Gerry Kammerer – ETH Zürich 27 Outline  Introduction  DNA and DNA sequences  The problem and some approaches  The SSAHA-approach  Conclusions

Gerry Kammerer – ETH Zürich 28 Reasons for fastness  Hashing the database  Nearly independent from database size  BLAST e.g. hashes query and scans DB  Human genome far from random  Discard highly repetitive k-tuples has big effect

Gerry Kammerer – ETH Zürich 29 Conclusions  Computers improved quickly  Cheaper, more powerful  More RAM available  Hash the database

Gerry Kammerer – ETH Zürich 30 Questions?