P-value Calculating Problem Ph. D. Thesis by Jing Zhang Presented by Chao Wang.

Slides:



Advertisements
Similar presentations
Recursively Defined Functions
Advertisements

Overview What is Dynamic Programming? A Sequence of 4 Steps
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Copyright © Cengage Learning. All rights reserved. CHAPTER 5 SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Computability and Complexity 20-1 Computability and Complexity Andrei Bulatov Random Sources.
1 Introduction to Computability Theory Lecture11: Variants of Turing Machines Prof. Amos Israeli.
Computing the exact p-value for structured motif Zhang Jing (Tsinghua University and university of waterloo) Co-authors: Xi Chen, Ming Li.
1 Lecture 31 EQUAL language –Designing a CFG –Proving the CFG is correct.
Knuth-Morris-Pratt Algorithm left to right scan like the naïve algorithm one main improvement –on a mismatch, calculate maximum possible shift to the right.
CMSC 250 Discrete Structures Summation: Sequences and Mathematical Induction.
Table of Contents The independent variable, x, denotes a member of the domain and the dependent variable, y, denotes a member of the range. We say, "y.
Induction and recursion
Function: Definition A function is a correspondence from a first set, called the domain, to a second set, called the range, such that each element in the.
Introduction to AEP In information theory, the asymptotic equipartition property (AEP) is the analog of the law of large numbers. This law states that.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
College Algebra Sixth Edition James Stewart Lothar Redlin Saleem Watson.
Sequences and Series (T) Students will know the form of an Arithmetic sequence.  Arithmetic Sequence: There exists a common difference (d) between each.
Mathematics Review Exponents Logarithms Series Modular arithmetic Proofs.
The importance of sequences and infinite series in calculus stems from Newton’s idea of representing functions as sums of infinite series.  For instance,
DECIDABILITY OF PRESBURGER ARITHMETIC USING FINITE AUTOMATA Presented by : Shubha Jain Reference : Paper by Alexandre Boudet and Hubert Comon.
Sequences Definition - A function whose domain is the set of all positive integers. Finite Sequence - finite number of values or elements Infinite Sequence.
Chapter 5: Sequences, Mathematical Induction, and Recursion 5.9 General Recursive Definitions and Structural Induction 1 Erickson.
Kolmogorov Complexity and Universal Distribution Presented by Min Zhou Nov. 18, 2002.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Sequences & Series. Sequence: A function whose domain is a set of consecutive integers. The domain gives the relative position of each term of the sequence:
1 Introduction to Turing Machines
Pre-Calculus Section 8.1A Sequences and Series. Chapter 8: Sequences, Series, and Probability Sequences and series describe algebraic patterns. We will.
 2004 SDU Uniquely Decodable Code 1.Related Notions 2.Determining UDC 3.Kraft Inequality.
Recursively Enumerable and Recursive Languages. Definition: A language is recursively enumerable if some Turing machine accepts it.
Fall 2002CMSC Discrete Structures1 Chapter 3 Sequences Mathematical Induction Recursion Recursion.
Modeling Arithmetic, Computation, and Languages Mathematical Structures for Computer Science Chapter 8 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesTuring.
PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.
(Proof By) Induction Recursion
A Universal Turing Machine
Recursion 5/4/2018 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and M. H.
Relations, Functions, and Matrices
Advanced Algorithms Analysis and Design
Languages.
Non-regular languages
Numerical Analysis Lecture 25.
Induction and recursion
Diagonalization and Reducibility
Taibah University College of Computer Science & Engineering Course Title: Discrete Mathematics Code: CS 103 Chapter 2 Sets Slides are adopted from “Discrete.
Functions Defined on General Sets
Mathematical Induction Recursion
VCU, Department of Computer Science CMSC 302 Sequences and Summations Vojislav Kecman 9/19/2018.
Data Mining Lecture 11.
Copyright © Cengage Learning. All rights reserved.
Copyright © Cengage Learning. All rights reserved.
Hidden Markov Models Part 2: Algorithms
Induction and recursion
Basic Counting.
COUNTING AND PROBABILITY
Parsing Costas Busch - LSU.
Aim: What is the sequence?
Dynamic Programming 1/15/2019 8:22 PM Dynamic Programming.
Copyright © Cengage Learning. All rights reserved.
Copyright © Cengage Learning. All rights reserved.
CS 250, Discrete Structures, Fall 2015 Nitesh Saxena
Formal Languages, Automata and Models of Computation
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
CS 250, Discrete Structures, Fall 2014 Nitesh Saxena
Bioinformatics Algorithms and Data Structures
Pseudo-polynomial time algorithm (The concept and the terminology are important) Partition Problem: Input: Finite set A=(a1, a2, …, an} and a size s(a)
Longest Common Subsequence
Diagonalization and Reducibility
Theory of Computation Lecture 17: Calculations on Strings II
Presentation transcript:

P-value Calculating Problem Ph. D. Thesis by Jing Zhang Presented by Chao Wang

Problem Description Given an independent and identically distributed (i.i.d.) model R over an alphabet, a pattern m with the same alphabet, an integer k, we should calculate the probability of m hits the model at least k times.

Problem Description Note that the overlapping matches are not considered in this problem. For example, the pattern “ ACGACG ” only match the target “ TACGACGACGG ” once because between the 2 nd and 5 th positions there is a overlap. “ TACGACGACGG ” overlap

An instance of the problem …… Alphabet: Target Sequence: Pattern: ACTTGG Each position has the same distribution: A: 0.5C: 0.3G: 0.1T:0.1

Two cases of the target sequence The length of the target sequence is infinite. Finite.

Infinite case Infinite length of target sequence Define f(k) as the probability of m hits the sequence at least k times. This case is easy to calculate because the following equality holds.

Infinite case Intuitive mean of the equation Consider m hits the first |m| positions or not. Divide the probability into two cases. m=ACCGT m doesn ’ t hit R[0, |m|-1] m hits R[0, |m|-1] m=ACCGT

Infinite case A trivial observation: f(0)=1 And We can prove for all positive integer k, f(k)=1 easily using Mathematical Inductive Principle.

Infinite case An interesting example: A monkey is clicking the keyboard randomly. If the time is sufficiently long, the content contains the great drama “ Macbeth ” with probability “ 1 ”. Does it correspond with our intuition?

Finite case The inequality in the infinite case will not hold. Why? Because the “ f(k) ” in the left-side doesn ’ t equal the one in the right-side.

Finite case ACGGTATTGCCAATG f(k) in the right-side f(k) in the left-side

Finite case How to solve the puzzle? Dynamic Programming.

Trial 1 Pr(i,k) denotes that m hits R[i,n] at least k times. m=R[i] means for all, m[t]=R[i+t] and m R[i] means the opposite.

Trial 1 Why it failed? Because m R[i] has many cases as the condition so that DP doesn ’ t work.

Basic Idea We need compare all the position of m and R[i, i+|m|-1]. The number of case is. (each position pair may equals or not) In fact, only the prefixes of m need to be considered, the number of which is |m|. Thus, DP can work well.

Calculating the P-value for a word motif A simple algorithm for the case of k=1 Basic Idea: to calculate a series of conditional probabilities instead of the target probability For a string w over alphabet and, the conditional probability is

The Definition of Conditional Probability ACTTGGTACCACTCG GTA R 1 i n W= ACCAC m=

Calculating the P-value for a word motif Then the target hit probability of m in Region R equals. For any, we decompose it according to the character following w in region R[i, n].

Calculating the P-value for a word motif Next we define the longest suffix: For example, m=ACCAC and w=CCAC All the prefixes are, A, AC, ACC, ACCA and ACCAC, the. Let P(m) be the set of all prefixes of a word m. For any string w, let denote the longest suffix of w which is in P(m).

Calculating the P-value for a word motif Then the following observation helps to constrain the domain of w in P(m). For w does not belong to P(m), where

Calculating the P-value for a word motif Case 1: ACTTGGTGCCACTCG ACCAC ACTTGGTGCCACTCG ACCAC 1i n 1i’i’ n to compute: GTG No prefix of m is the suffix of w. m= w= m=

Calculating the P-value for a word motif Case 2: ACTTGGTACCACTCG ACCAC ACTTGGTACCACTCG ACCAC 1i n 1i’i’ n to compute: GTAC One prefix of m is the suffix of w and is the longest one. m= w= m= AC w=

Calculating the P-value for a word motif Algorithm 1 shows that f(i, w) can be computed by DP in polynomial time. Algorithm 2 shows how to calculate f(i, w).

Algorithm 1

Algorithm 2: calculate f(i,w)

Calculating the P-value for a word motif We generalize Algorithms 1 and 2 to arbitrary k by defining a series of probabilities where for, is exactly the P-value we want to calculate.

Calculating the P-value for a word motif Then the recursion formulae here are:

Calculating the P-value for a word motif Algorithm 3 shows how to compute the P-value

A simple example i.i.d. model, |R|=4, m=101, each position generate 1 with probability 0.4 Compute the map first: w: all prefixes of m c

A simple example f(i,w) Initialize the DP table f(i,w) w i

A simple example Compute the DB items using the recursive representation f(2,10)= f(2,100)*p(generate 0)+f(2,101)*p(generate 1) = F(5, 0)*0.6+1*0.4=0.4 f(2,1)= f(2,10)*p(generate 0)+f(2,11)*p(generate 1) = f(2,10)*0.6+f(3,1)*0.4=0.24 f(2, )= f(2,0)*p(generate 0)+f(2,1)*p(generate 1) = f(3, )*0.6+f(2,1)*0.4=0.096

A simple example f(1,10)= f(1,100)*p(generate 0)+f(2,101)*p(generate 1) = f(4, )*0.6+1*0.4=0.4 f(1,1)= f(1,10)*p(generate 0)+f(1,11)*p(generate 1) = 0.4*0.6+f(2,1)*0.4=0.336

A simple example f(1, )= f(1,0)*p(generate 0)+f(1,1)*p(generate 1) = f(2, )* *0.4=0.192 The final result is f(1, )=0.192

A simple example f(i,w) w i The final DP table will be as following: