PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.

Slides:



Advertisements
Similar presentations
Lecture 9. Resource bounded KC K-, and C- complexities depend on unlimited computational resources. Kolmogorov himself first observed that we can put resource.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Ming Li Canada Research Chair in Bioinformatics University of Waterloo Modern Homology Search.
Ming Li Visiting Professor City University of Hong Kong Canada Research Chair in Bioinformatics Professor University of Waterloo Joint work with Bin Ma,
Homology Based Analysis of the Human/Mouse lncRNome
BLAST Sequence alignment, E-value & Extreme value distribution.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Ming Li Canada Research Chair in Bioinformatics University of Waterloo Visiting Professor City University of Hong Kong Joint work: B. Ma, J. Tromp, W.
Seeds for Similarity Search Presentation by: Anastasia Fedynak.
March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.
Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)
1 Energy Efficient Multi-match Packet Classification with TCAM Fang Yu
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Index-based search of single sequences Omkar Mate CS 374 Stanford University.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
Sequence alignment, E-value & Extreme value distribution
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
BLAST What it does and what it means Steven Slater Adapted from pt.
PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B 鍾承宏 B 王凱平 B 莊謹譽 B 張智翔 B
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Comp. Genomics Recitation 3 The statistics of database searching.
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
From Smith-Waterman to BLAST
Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.
Doug Raiford Phage class: introduction to sequence databases.
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Step 3: Tools Database Searching
Optimal Relay Placement for Indoor Sensor Networks Cuiyao Xue †, Yanmin Zhu †, Lei Ni †, Minglu Li †, Bo Li ‡ † Shanghai Jiao Tong University ‡ HK University.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Ming Li Canada Research Chair in Bioinformatics
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
paper study for class presentation on Nov16th, 2005 slider by 陳奕先
Department of Computer Science
Homology Search Tools Kun-Mao Chao (趙坤茂)
Fast Sequence Alignments
PatternHunter: faster and more sensitive homology search
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario

GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC || ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | ||||| GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA---- || ||||||||| |||||| | ||||| |||||||| ||| |||||||| | | | || GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| | TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC CAAGATTCCAGACTGGTTCTTG ||||||| |||| | | |||| ||||| || ||||| || |||||| ||||||||||||||| GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG A homology between mouse and human genomes Smith-Waterman is the most accurate method. Time complexity : O(mn).

GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC || ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | ||||| GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA---- || ||||||||| |||||| | ||||| |||||||| ||| |||||||| | | | || GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| | TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC CAAGATTCCAGACTGGTTCTTG ||||||| |||| | | |||| ||||| || ||||| || |||||| ||||||||||||||| GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG BLAST finds a “hit” and then extends Seed match = hit

Example of missing a target Fail: GAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT || ||||||||| |||||| | |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT Dilemma Sensitivity – needs shorter seeds the success rate of finding a homology Speed – needs longer seeds Mega-BLAST uses seeds of length 28.

PatternHunter uses “spaced seeds” (called a model) Eleven required matches (weight=11) Seven “don’t care” positions GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT… || ||||||||| ||||| || ||||| |||||| GAATACTCAACAGCAACACTAATGGCAGCAGAAAAT… Hit = all the required matches are satisfied. BLAST seed model =

Observations re. spaced seeds Seed models with different shapes can detect different homologies. Two consequences: Some models may detect more homologies than others More sensitive homology search PatternHunter I Can use several seed models simultaneously to hit more homologies Approaching 100% sensitive homology search PatternHunter II

Spaced Seed – PatternHunter I:

Weight of a seed Lemma: The expected number of hits of a weight W length M seed model within a length L region with similarity p is (L-M+1)p W Proof: There are (L-M+1) positions a hit can occur. At each position, p W hit is expected. Q.E.D. Seed models with the same weight generate approximately the same amount of hits. Speed is approximately the same. Sensitivity is not necessarily the same. num of hits v.s. num of regions that contain hits. GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT || ||||||||| ||||| || ||||| |||||| GAATACTCAACAGCAACACTAATGGCAGCAGAAAAT

Simulated sensitivity curves

Why spaced seeds are better? TTGACCTCACC? |||||||||||? TTGACCTCACC? CAA?A??A?C??TA?TGG? |||?|??|?|??||?|||? CAA?A??A?C??TA?TGG? BLAST’s seed usually uses more than one hits to detect one homology (redundant) Spaced seeds uses fewer hits to detect one homology (efficient)

PH’s seed does not overlap heavily PH’s seed do not overlap heavily when shifts: The hits at different positions are independent. The probability of having the second hit is 5*p 6 + … compare to BLAST’s model p + p 2 + p 3 + p 4 + …

Indeed Indeed, under the condition that there is one hit in a length 64, 70% similar homology, the average number of hits in that region is 2.0 for PH’s weight-11 seed 3.6 for contiguous weight-11 seed.

A dynamic programming algorithm to compute sensitivity R[1..n]: Random homology, Pr(R[i]=1) = p; We want Pr(R is hit by a seed model x) DP[i,s] denotes Pr(R[1..i] is hit | R[1..i] ends with s) 1; |s|=|x| and s is hit DP[i,s]= DP[i-1,s[1..|s|-1]; |s|=|x| and s is not hit p*DP[i,(1s)] + (1-p)*DP[i,(0s)]; else O( n*2 |x| ). Better algorithm exists.

PatternHunter I performance Blastn MB28 PH E.coli (4.7M) v.s. H.inf (1.8M) 716s /158M 5s/561M 34s/78M Arabidopsis chr2 (19.6M) v.s. chr4 (17.5M) s/1087M 5020s/279M Human chr21 (26.2M) v.s. chr22 (35M) s/419M All used a 700MHZ PentiumIII PC with 1G byte memory. Human (3G) v.s. Mouse (3G)* Using 2-hit, weight 12 seed, PH used 6 days with a 1GHZ PentiumIII PC with 2G byte memory. With Blast, it would otherwise take months with parallel computers to finish.

Multiple Seeds – PatternHunter II:

PatternHunter II: Optimized Multiple seeds Basic Searching Algorithm 1. Select a group of spaced seed models 2. For each hit of each model, conduct extension to find a homology. Selecting optimal multiple seed is NP- hard.

Seed Selection Algorithm 1. Let A be an empty set. 2. Let s be the seed such that A ⋃ {s} has the highest hit probability. 3. A=A ⋃ {s}; if |A|<K go to 2. Approximation ratio 1-1/e Computing the hit probability of multiple seeds is NP-hard. Efficient algorithm when number of zeros is limited. PTAS to compute the probability approximately.

Randomly generate m homologies independently. Suppose n of them are hit by our seeds. Let p be the sensitivity of our seeds. If, then with probability 1-2/K, Can be proved by Chernoff’s bounds.

The seeds obtained under a simple homology distribution (homology identity = 0.7, homology length=64) , , , , ……

Simulated sensitivity curves: Solid curves: Multiple (1, 2, 4, 8, 16) weight-12 spaced seeds. Dashed curves: Optimal spaced seeds with weight = 11, 10, 9, 8. Typically, “Doubling the seed number” gains better sensitivity than “decreasing the weight by 1”. One weight-12 Two weight-12 One weight-11

Coding region seeds The first two bases of a codon is more conserved than the third base. Coding regions matches have patterns like …… The seeds trained under a coding region homology distribution are called the coding region seeds. PHII’s default seeds were trained under a simple distribution (0.8, 0.8, 0.5).

Experiments on real data About 30k mouse ESTs (25Mb) and 4k human ESTs (3Mb) downloaded from NCBI genbank. “low complexity” regions were filtered out. SSearch (Smith-Waterman method) finds “all” pairs of ESTs with significant local alignments. Check how many percents of those pairs can be “found” by BLAST and different configurations of PatternHunter.

Sensitivity curves:

Recent development Can 100% sensitivity be achieved with reasonable speed? Yes. When >=80% similarity, 100% sensitivity can be achieved with approximately 40 weight-9 seeds.

Open questions: Can the hit probability of one (or constant number of) seed be computed in polynomial time? Current: Polynomial time algorithms exist when num of 0s in one seed is O(log n). PTAS. Can the optimal seed (or set of seeds) be found in polynomial time? For general distributions of the homologies, these are NP-hard.

How the hits are found efficiently? Put all the seeds of database in a lookup table. For each seed in the query, find all the occurrences of the seed in the database by looking at the lookup table.