Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat.

Slides:



Advertisements
Similar presentations
1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.
Advertisements

Less Than Matching Orgad Keller.
Applied Algorithmics - week7
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Information and Coding Theory
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
T(n) = 4 T(n/3) +  (n). T(n) = 2 T(n/2) +  (n)
II. Linear Block Codes. © Tallal Elshabrawy 2 Last Lecture H Matrix and Calculation of d min Error Detection Capability Error Correction Capability Error.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
1.2 Row Reduction and Echelon Forms
Linear Equations in Linear Algebra
Function Matching Amihood Amir Yonatan Aumann Moshe Lewenstein Ely Porat Bar Ilan University.
Basis of a Vector Space (11/2/05)
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Property Matching and Weighted Matching Amihood Amir, Eran Chencinski, Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications Written by: Amihood Amir, Oren Kapah and Ely Porat.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date.
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Orthogonality and Least Squares
Orgad Keller Modified by Ariel Rosenfeld Less Than Matching.
String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)
Asynchronous Pattern Matching - Address Level Errors Amihood Amir Bar Ilan University 2010.
S C A L E D Pattern Matching Amihood Amir Ayelet Butman Bar-Ilan University Moshe Lewenstein and Johns Hopkins University Bar-Ilan University.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Minimal Spanning Trees What is a minimal spanning tree (MST) and how to find one.
Theory of Computing Lecture 10 MAS 714 Hartmut Klauck.
DIGITAL COMMUNICATION Error - Correction A.J. Han Vinck.
Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 10-1 The ALU.
Information Coding in noisy channel error protection:-- improve tolerance of errors error detection: --- indicate occurrence of errors. Source.
Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
On The Connections Between Sorting Permutations By Interchanges and Generalized Swap Matching Joint work of: Amihood Amir, Gary Benson, Avivit Levy, Ely.
Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
MCS 101: Algorithms Instructor Neelima Gupta
An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Codes Codes are used for the following purposes: - to detect errors - to correct errors after detection Error Control Coding © Erhan A. Ince Types: -Linear.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
DIGITAL COMMUNICATIONS Linear Block Codes
MCS 101: Algorithms Instructor Neelima Gupta
Chapter 31 INTRODUCTION TO ALGEBRAIC CODING THEORY.
Information Theory Linear Block Codes Jalal Al Roumy.
Counting Discrete Mathematics. Basic Counting Principles Counting problems are of the following kind: “How many different 8-letter passwords are there?”
The parity bits of linear block codes are linear combination of the message. Therefore, we can represent the encoder by a linear system described by matrices.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Error Detection and Correction – Hamming Code
Some Computation Problems in Coding Theory
Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,
1.7 Linear Independence. in R n is said to be linearly independent if has only the trivial solution. in R n is said to be linearly dependent if there.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
1 1.2 Linear Equations in Linear Algebra Row Reduction and Echelon Forms © 2016 Pearson Education, Ltd.
Section Recursion 2  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
Section Recursion  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
Amihood Amir, Gary Benson, Avivit Levy, Ely Porat, Uzi Vishne
The Viterbi Decoding Algorithm
Fast Fourier Transform
II. Linear Block Codes.
Linear Equations in Linear Algebra
Linear Equations in Linear Algebra
Presentation transcript:

Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat

2 Weighted Sequences A weighted sequence T, of length n, over an alphabet Σ is a | Σ |×n matrix which contains the probability of each symbol to appear at each position A C G T Also known as Position Weight Matrix

3 Pattern Matching in Weighted Sequences Problem Definition: Given a threshold probability , find all occurrences of the pattern P ( |P|=m ) in the weighted sequence T ( |T|=n ) where: By applying the logarithm

4 Naïve Algorithm Bounded Alphabet Size For each  in  1.Construct a vector P , such that P  [i]=1 if  occurs at position i in P, P  [i]=0 otherwise. 2.Calculate the sum of probabilities by convoluting the row of  in T with P . For each text position sum the results. Time: O( n | Σ | log m )

5 Matching in Weighted Sequences Unbounded Alphabet Size Input: Triplets ( C, I, P ), whenever P  0. s = # of triplets. Applying the naive algorithm in this case results in an O( n | Σ | log m ) = O( nm log m ) algorithm. This is worse then the trivial algorithm.

6 Example (a,1,0.2) (a,0,0.5) (b,1,0.7) (a,2,0.4) (a,4,0.1) T: (b,0,0.5) (c,1,0.1) (c,2,0.6) (b,3,1.0) (c,4,0.9) P: a b c (a,1,-0.7) (a,0,-0.3) (b,1,-.15) (a,2,-0.4) (a,4,-1.0) T: (b,0,-0.3) (c,1,-1.0) (c,2,-.22) (b,3,0.0) (c,4,-.05) R:   - 

7 Step 1: Subset Matching Observation 1: A weighted matching can only appear in positions where a subset match can be found. Step 1a: Build a new text T s where for each text position there is a set of all the letters which have non-zero probabilities. Step 1b: Mark all the positions where a subset match is found. Time: O( s log 2 s ) (Cole & Hariharan STOC02).

8 Example (a,1,-0.7) (a,0,-0.3) (b,1,-.15) (a,2,-0.4) (a,4,-1.0) T: (b,0,-0.3) (c,1,-1.0) (c,2,-.22) (b,3,0.0) (c,4,-.05) P: a b c T’:{a,b}, {a,b,c}, {a,c}, {b}, {a,c} P’:{a}, {b}, {c} Subset match positions: 0,2

9 Step 2: Main Idea Linearize the input into raw vectors T’ and P’ of size O( s ), such that: T’ contains the probabilities. P’ contains 1’s and 0’s. Sum the probabilities using convolution. The linearization is done using shifting where each symbol is assigned a different shift. The same shifting will be used in both the text and the pattern.

10 Example (a,1,-0.7) (a,0,-0.3) (b,1,-.15) (a,2,-0.4) (a,4,-1.0) T: (b,0,-0.3) (c,1,-1.0) (c,2,-.22) (b,3,0.0) (c,4,-.05) P: a b c Shifts: a-0, b-3, c-1 (c,-1.0)(c,-.22)(b,-.15) T’:(a,-0.3)(a,-0.7)(a,-0.4)(b,-0.3)(a,-1.0)(c,-.05)(b,0.0) P’: a _ _ c b _ _

11 Step 2: Linearization Definition: singleton – a position which assigned only 1 triplet. multiple - a position which assigned more then 1 triplet. Text - Replace all singletons with the probability of the triplet. The empty and multiple positions will be replaced by 0. Pattern - Replace all singletons with 1. The empty and multiple positions will be replaced by 0.

12 Example (c,-1.0)(c,-.22)(b,-.15) T’:(a,-0.3)(a,-0.7)(a,-0.4)(b,-0.3)(a,-1.0)(c,-.05)(b,0.0) P’: a _ _ c b _ _ T’’: P’’: This allow us to sum the probabilities using convolution. Question: Are we summing the right values?

13 Step 2: Correctness Lemma: For any position where a subset match exists, 2 aligned singletons must be originated from the same letter. Proof: Assume that there is a subset match in position i in the text, and there are 2 aligned singletons in T’(i+j), P’(j).

14 Step 2: Completeness Solution: Zero the probability of the triplet after the first time it appeared as a singleton. Time: O( s log 2 s ) Problem: Using a several shifting set can cause adding probabilities more then once! Solution: Use a set of O(log s ) such shifting sets. Problem: We did not sum all probabilities!

15 Caution!!! Do Not Delete Triplet (c,-1.0)(c,-.22)(b,-.15) T’:(a,-0.3)(a,-0.7)(a,-0.4)(b,-0.3)(a,-1.0)(c,-.05)(b,0) P’: a _ _ c b _ _ T’’: P’’: Deleting (c,-.22) will cause (b,-0.3) to appear as a singleton!!!

16 Hamming Distance – Text Errors Bounded Alphabet Size Problem Definition: Given a threshold probability , find for each text position the minimal number of probabilities, which by changing them to 1, the following will be obtained: In case of errors in the text, a match can always be found. This does not apply for the case of errors in the pattern.

17 Hamming Distance – Text Errors Algorithm Outline… 1.Sort the probabilities in the weighted sequence. 2.Divide the list of probabilities into block of size ( n | Σ |) Calculate the sum of probabilities for each block. 4.For each text location, 1.Start adding blocks until the sum goes below the threshold. 2.Start adding probabilities from the last block until the sum goes below the threshold. Time:

18 Unbounded Alphabet Size Algorithm 1 1.Divide the list of probabilities into blocks of size s For each block calculate the sum of probabilities (shifting). 3.For each text position and each block If subset matching exist, use the shifting algorithm result. Else – use brute force. Time: Where k is the number of blocks per text position, where there is no subset match.

19 Unbounded Alphabet Size Algorithm 2 1.Sort the probabilities in the weighted sequence. 2.Divide the list of probabilities into blocks of size s 2/3. 3.For each block, 1.Calculate the sum of non-frequent letters probabilities. O(sm 2/3 ) 2.Calculate the sum of frequent letters probabilities. O(s 1/3 m 1/3 nlogm) 4.Continue as in the previous algorithm. Time: O(sm 2/3 + s 1/3 m 1/3 nlogm)

20 Unbounded Alphabet Size Combined Algorithm Start with the first algorithm. If k is small – Complete the first algorithm. Else – Apply the second algorithm.