PODS Usefulness Trees and graphs represent data in many domains in linguistics, vision, chemistry, web. (Even sociology.) Tree and graphs searching algorithms are used to retrieve information from the data.
PODS Tree Inclusion Editor Chapter Book Title XML ? (a) Title Book EditorChapter Title XMLJohn Author Name Mary Jack OLAP (b)
PODS l1l1 l5l5 l2l2 l4l4 l3l3 e1e1 e5e5 e4e4 e3e3 e2e2 From pixels to a small attributed graph Vision Application: Handwriting Characters Representation D.Geiger, R.Giugno, D.Shasha, Ongoing work at New York University
PODS l1l1 l5l5 l2l2 l4l4 l3l3 e1e1 e5e5 e4e4 e3e3 e2e2 l4l4 l2l2 l1l1 l3l3 l5l5 e2e2 e1e1 e4e4 e5e5 e3e3 e6e6 l4l4 l5l5 l3l3 l1l1 l2l2 e3e3 e4e4 e5e5 e3e3 Best Match l4l4 l2l2 l1l1 l3l3 l5l5 e2e2 e1e1 e4e4 e5e5 e3e3 e7e7 e6e6 Vision Application: Handwriting Characters Recognition QUERY DATABASEDATABASE
PODS Vision Application: Region Adjacent Graphs J. Lladós and E. Martí and J.J. Villanueva, Symbol Recognition by Error-Tolerant Subgraph Matching between Region Adjacency Graphs, IEEE Transactions on Pattern Analysis and Machine Intelligence, 23-10,1137—1143, 2001.
PODS Chemistry Application Protein Structure Search. Daylight ( MDL BCI (
PODS Algorithmic Questions Question: why can’t I search for trees or graphs at the speed of keyword searches? (Proper data structure) Why can’t I compare trees (or graphs) as easily as I can compare strings?
Zheng Zhang, Exact Matchingslide 21 Different forms of exact matching Graph isomorphism – A one-to-one correspondence must be found between each node of the first graph and each node of the second graph. – Graphs G(V G,E G ) and H(V H,E H ) are isomorphic if there is an invertible F from V G to V H such that for all nodes u and v in V G, (u,v) ∈ E G if and only if (F(u),F(v)) ∈ E H.
Zheng Zhang, Exact Matchingslide 22 Different forms of exact matching Subgraph isomorphism – It requires that an isomorphism holds between one of the two graphs and a node-induced subgraph of the other. Monomorphism – It requires that each node of the first graph is mapped to a distinct node of the second one, and each edge of the first graph has a corresponding edge in the second one; the second graph, however, may have both extra nodes and extra edges.
Zheng Zhang, Exact Matchingslide 23 Different forms of exact matching Homomorphism – It drops the condition that nodes in the first graph are to be mapped to distinct nodes of the other; hence, the correspondence can be many-to-one. – A graph homomorphism F from Graph G(V G,E G ) and H(V H,E H ), is a mapping F from V G to V H such that {x,y} ∈ E G implies {F(x),F(y)} ∈ E H.
Zheng Zhang, Exact Matchingslide 24 Different forms of exact matching Maximum common subgraph (MCS) – A subgraph of the first graph is mapped to an isomorphic subgraph of the second one. – There are two possible definitions of the problem, depending on whether node-induced subgraphs or plain subgraphs are used. – The problem of finding the MCS of two graphs can be reduced to the problem of finding the maximum clique (i.e. a fully connected subgraph) in a suitably defined association graph.
Zheng Zhang, Exact Matchingslide 25 Different forms of exact matching Properties – The matching problems are all NP-complete except for graph isomorphism, which has not yet been demonstrated whether in NP or not. – Exact graph matching has exponential time complexity in the worst case. However, in many PR applications the actual computation time can be still acceptable. – Exact isomorphism is very seldom used in PR. Subgraph isomorphism and monomorphism can be effectively used in many contexts. – The MCS problem is receiving much attention.
Zheng Zhang, Exact Matchingslide 26 Necessary concepts Epimorphism – an surjective homomorphism Monomorphism – an injective homomorphism Endomoprphism – a homomorphism from an object to itself Automoprphism – an endomorphism which is also an isomorphism an isomorphism with itself
Zheng Zhang, Exact Matchingslide 30 Algorithms for exact matching Techniques based on tree search – mostly based on some form of tree search with backtracking Ullmann’s algorithm Ghahraman’s algorithm VF and VF2 algorithm Bron and Kerbosh’s algorithm Other algorithms for the MCS problem Other techniques – based on A* algorithm Demko’s algorithm – based on group theory Nauty algorithm
36 Finite Markov Chain An integer time stochastic process, consisting of a domain D of m states {s 1,…,s m } and 1.An m dimensional initial distribution vector ( p(s 1 ),.., p(s m )). 2.An m×m transition probabilities matrix M= (a s i s j ) For example, D can be the letters {A, C, T, G}, p(A) the probability of A to be the 1 st letter in a sequence, and a AG the probability that G follows A in a sequence.
37 Simple Model - Markov Chains Markov Property: The state of the system at time t+1 only depends on the state of the system at time t X1X1 X2X2 X3X3 X4X4 X5X5
38 Markov Chain (cont.) X1X1 X2X2 X n-1 XnXn For each integer n, a Markov Chain assigns probability to sequences (x 1 …x n ) over D (i.e, x i D) as follows: Similarly, (X 1,…, X i,…)is a sequence of probability distributions over D.
39 Matrix Representation AB B A C C D D Then after one move, the distribution is changed to X 2 = X 1 M After i moves the distribution is X i = X 1 M i-1 M is a stochastic Matrix: The initial distribution vector (u 1 …u m ) defines the distribution of X 1 (p(X 1 =s i )=u i ). The transition probabilities Matrix M =(a st )
40 Weather: –raining today rain tomorrow p rr = 0.4 –raining today no rain tomorrow p rn = 0.6 –no raining today rain tomorrow p nr = 0.2 –no raining today no rain tomorrow p rr = 0.8 Simple Example
41 Simple Example Transition Matrix for Example Note that rows sum to 1 Such a matrix is called a Stochastic Matrix If the rows of a matrix and the columns of a matrix all sum to 1, we have a Doubly Stochastic Matrix
42 Gambler’s Example – At each play we have the following: Gambler wins $1 with probability p Gambler loses $1 with probability 1-p – Game ends when gambler goes broke, or gains a fortune of $100 – Both $0 and $100 are absorbing states 01 2 N-1 N p p p p 1-p Start (10$) or
43 Coke vs. Pepsi Given that a person’s last cola purchase was Coke, there is a 90% chance that her next cola purchase will also be Coke. If a person’s last cola purchase was Pepsi, there is an 80% chance that her next cola purchase will also be Pepsi. coke pepsi
44 Coke vs. Pepsi Given that a person is currently a Pepsi purchaser, what is the probability that she will purchase Coke two purchases from now? The transition matrix is: (Corresponding to one purchase ahead)
45 Coke vs. Pepsi Given that a person is currently a Coke drinker, what is the probability that she will purchase Pepsi three purchases from now?
46 Coke vs. Pepsi Assume each person makes one cola purchase per week. Suppose 60% of all people now drink Coke, and 40% drink Pepsi. What fraction of people will be drinking Coke three weeks from now? Let (Q 0,Q 1 )=(0.6,0.4) be the initial probabilities. We will regard Coke as 0 and Pepsi as 1 We want to find P(X 3 =0) P 00
47 “Good” Markov chains For certain Markov Chains, the distributions X i, as i ∞: (1) converge to a unique distribution, independent of the initial distribution. (2) In that unique distribution, each state has a positive probability. Call these Markov Chain “good”. We describe these “good” Markov Chains by considering Graph representation of Stochastic matrices.
48 Representation as a Digraph Each directed edge A B is associated with the positive transition probability from A to B. A B C D We now define properties of this graph which guarantee: 1.Convergence to unique distribution: 2.In that distribution, each state has positive probability AB B A C C D D
49 Examples of “Bad” Markov Chains Markov chains are not “good” if either : 1. They do not converge to a unique distribution. 2. They do converge to u.d., but some states in this distribution have zero probability.
50 Bad case 1: Mutual Unreachabaility A B C D In case a), the sequence will stay at A forever. In case b), it will stay in {C,D} for ever. Fact 1: If G has two states which are unreachable from each other, then {X i } cannot converge to a distribution which is independent on the initial distribution. Consider two initial distributions: a) p(X 1 =A)=1 (p(X 1 = x)=0 if x≠A). b) p(X 1 = C) = 1
51 Bad case 2: Transient States A B C D A and B are transient states, C and D are recurrent states. Once the process moves from B to D, it will never come back. Def: A state s is recurrent if it can be reached from any state reachable from s; otherwise it is transient.
52 Bad case 2: Transient States A B C D Fact 2: For each initial distribution, with probability 1 a transient state will be visited only a finite number of times. X
53 Bad case 3: Periodic States A state s has a period k if k is the GCD of the lengths of all the cycles that pass via s. A B C D E A Markov Chain is periodic if all the states in it have a period k >1. It is aperiodic otherwise. Example: Consider the initial distribution p(B)=1. Then states {B, C} are visited (with positive probability) only in odd steps, and states {A, D, E} are visited in only even steps.
54 Bad case 3: Periodic States A B C D E Fact 3: In a periodic Markov Chain (of period k >1) there are initial distributions under which the states are visited in a periodic manner. Under such initial distributions X i does not converge as i ∞.
55 Ergodic Markov Chains The Fundamental Theorem of Finite Markov Chains: If a Markov Chain is ergodic, then 1.It has a unique stationary distribution vector V > 0, which is an Eigenvector of the transition matrix. 2.The distributions X i, as i ∞, converges to V. A B C D A Markov chain is ergodic if : 1.All states are recurrent (ie, the graph is strongly connected) 2.It is not periodic
56 Use of Markov Chains in Genome search: Modeling CpG Islands In human genomes the pair CG often transforms to (methyl-C) G which often transforms to TG. Hence the pair CG appears less than expected from what is expected from the independent frequencies of C and G alone. Due to biological reasons, this process is sometimes suppressed in short stretches of genomes such as in the start regions of many genes. These areas are called CpG islands (p denotes “pair”).
57 Example: CpG Island (Cont.) We consider two questions (and some variants): Question 1: Given a short stretch of genomic data, does it come from a CpG island ? Question 2: Given a long piece of genomic data, does it contain CpG islands in it, where, what length ? We “solve” the first question by modeling strings with and without CpG islands as Markov Chains over the same states {A,C,G,T} but different transition probabilities:
58 Example: CpG Island (Cont.) The “+” model: Use transition matrix A + = (a + st ), Where: a + st = (the probability that t follows s in a CpG island) The “-” model: Use transition matrix A - = (a - st ), Where: a - st = (the probability that t follows s in a non CpG island)
59 Example: CpG Island (Cont.) With this model, to solve Question 1 we need to decide whether a given short sequence of letters is more likely to come from the “+” model or from the “–” model. This is done by using the definitions of Markov Chain. [to solve Question 2 we need to decide which parts of a given long sequence of letters is more likely to come from the “+” model, and which parts are more likely to come from the “–” model. This is done by using the Hidden Markov Model, to be defined later.] We start with Question 1:
60 Question 1: Using two Markov chains A + (For CpG islands): X i-1 XiXi ACGT A C 0.17p + (C | C)0.274p + (T|C) G 0.16p + (C|G)p + (G|G)p + (T|G) T 0.08p + (C |T) p + (G|T)p + (T|T) We need to specify p + (x i | x i-1 ) where + stands for CpG Island. From Durbin et al we have: (Recall: rows must add up to one; columns need not.)
61 Question 1: Using two Markov chains A - (For non-CpG islands): X i-1 XiXi ACGT A C 0.32p - (C|C)0.078p - (T|C) G 0.25p - (C|G) p - (G|G) p - (T|G) T 0.18p - (C|T)p - (G|T)p - (T|T) …and for p - (x i | x i-1 ) (where “-” stands for Non CpG island) we have:
62 Discriminating between the two models Given a string x=(x 1 ….x L ), now compute the ratio If RATIO>1, CpG island is more likely. Actually – the log of this ratio is computed: X1X1 X2X2 X L-1 XLXL Note: p + (x 1 |x 0 ) is defined for convenience as p + (x 1 ). p - (x1|x0) is defined for convenience as p - (x1).
63 Log Odds-Ratio test Taking logarithm yields If logQ > 0, then + is more likely (CpG island). If logQ < 0, then - is more likely (non-CpG island).
64 Where do the parameters (transition- probabilities) come from ? Learning from complete data, namely, when the label is given and every x i is measured: Source: A collection of sequences from CpG islands, and a collection of sequences from non-CpG islands. Input: Tuples of the form (x 1, …, x L, h), where h is + or - Output: Maximum Likelihood parameters (MLE) Count all pairs (X i =a, X i-1 =b) with label +, and with label -, say the numbers are N ba,+ and N ba,-.
65 Maximum Likelihood Estimate (MLE) of the parameters (using labeled data) The needed parameters are: P + (x 1 ), p + (x i | x i-1 ), p - (x 1 ), p - (x i | x i-1 ) The ML estimates are given by: X1X1 X2X2 X L-1 XLXL Where N a,+ is the number of times letter a appear in CpG islands in the dataset. Where N ba,+ is the number of times letter b appears after letter a in CpG islands in the dataset.