Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.

Similar presentations


Presentation on theme: "CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING."— Presentation transcript:

1 CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING

2 Definition of Time Series Motifs 1. Length of the motif 2. Support of the motif 3. Similarity of the Pattern 4. Relative Position of the Pattern Given a length, the most similar/least distant pair of non-overlapping subsequences. 20406080100120140160180200 -2 0 1 2

3 Problem Formulation The most similar pair of non- overlapping subsequences 1002003004005006007008009001000 -8000 -7500 -7000... 1 2 3 4 5 6 7 8. 873 time:1000 The closest pair of points in high dimensional space  Optimal algorithm in two dimension : Θ(n log n)  For large dimensionality d, optimum algorithm is effectively Θ(n 2 d)

4 Lower Bound  If P, Q and R are three points in a d-space d(P,Q)+d(Q,R) ≥ d(P,R) d(P,Q) ≥ |d(Q,R) - d(P,R)|  A third point R provides a very inexpensive lower bound on the true distance  If the lower bound is larger than the existing best, skip d(P, Q) d(P,Q) ≥ |d(Q,R) - d(P,R)| ≥ BestPairDistance PQ R

5 Circular Projection r r Pick a reference point r Circularly Project all points on a line passing through the reference point r distance Equivalent to computing distance from r and then sorting the points according to distance 1 5 3 7 16 10 12 20 11 6 24 21 18 2 22 17 15 23 13 14 8 4 9 19 r

6 The Order Line r P Q r |d(Q, r) - d(P, r)| d(Q, r) d(P, r) k = 1 k = 2 k = 3 k=1:n-1 Compare every pair having k-1 points in between Compare every pair having k-1 points in between Do k scans of the order line, starting with the 1 st to k th point Do k scans of the order line, starting with the 1 st to k th point BestPairDistance 1 5 3 7 16 10 12 20 11 6 24 21 18 2 22 17 15 23 13 14 8 4 9 19 r 0

7 Correctness If we search for all offset=1,2,…,n-1 then all possible pairs are considered. If we search for all offset=1,2,…,n-1 then all possible pairs are considered. ◦n(n-1)/2 pairs for any offset=k, if none of the k scans needs an actual distance computation then for the rest of the offsets=k+1,…,n-1 no distance computation will be needed. for any offset=k, if none of the k scans needs an actual distance computation then for the rest of the offsets=k+1,…,n-1 no distance computation will be needed. r

8 Graph Similarity Edit distance/graph isomorphism: ◦Tree Edit Distance Feature extraction ◦IN/out degree ◦Diameter Iterative methods ◦SimRank

9 Diameter Largest Shortest path in the graph. 1 let dist be a |V| × |V| array of minimum distances initialized to ∞ (infinity) 2 for each vertex v 3 dist[v][v] ← 0 4 for each edge (u,v) 5 dist[u][v] ← w(u,v) // the weight of the edge (u,v) 6 for k from 1 to |V| 7 for i from 1 to |V| 8 for j from 1 to |V| 9 if dist[i][j] > dist[i][k] + dist[k][j] 10 dist[i][j] ← dist[i][k] + dist[k][j] 11 end if http://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm

10 Simrank For a node v in a graph, we denote by I(v) and O(v) the set of in-neighbors and out-neighbors of v, respectively. http://www-cs-students.stanford.edu/~glenj/simrank.pdf 1.A solution s( ∗, ∗ ) ∈ [0, 1] to the n 2 SimRank equations always exists and is unique. 2.Symmetric 3.Reflexive

11 Tree Edit Distance http://grfia.dlsi.ua.es/ml/algorithms/references/editsurvey_bille.pdf

12 Tree Edit Distance

13 Applications Find the most frequent tree structure in a phylogenetic tree. Match a query subtree with a set of XML documents.

14 Ranking Nodes Page Rank PR(A) is the PageRank of page A, PR(Ti) is the PageRank of pages Ti which link to page A, C(Ti) is the number of outbound links on page Ti and d is a damping factor which can be set between 0 and 1. PR(A) = (1-d) + d (PR(T1)/C(T1) +... + PR(Tn)/C(Tn))

15 Example PR(A) = 0.5 + 0.5 PR(C) PR(B) = 0.5 + 0.5 (PR(A) / 2) PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B)) These equations can easily be solved. We get the following PageRank values for the single pages: PR(A) = 14/13 = 1.07692308 PR(B) = 10/13 = 0.76923077 PR(C) = 15/13 = 1.15384615

16 Matlab Script Matlab script for the example in the previous slide syms x y z; eqn1 = x == 0.5 + 0.5*z eqn2 = y == 0.5 + 0.25*x eqn3 = z == 0.5 + 0.25*x + 0.5*y [A,B] = equationsToMatrix([eqn1, eqn2, eqn3], [x, y, z]) X = linsolve(A,B)

17 HITS: Hyperlink-Induced Topic Search http://www.cs.cornell.edu/home/kleinber/auth.pdf


Download ppt "CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING."

Similar presentations


Ads by Google