Download presentation
Presentation is loading. Please wait.
1
Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving Similarity Queries over Historical Market-Basket Data Travel assistance provided by the Mary Louise Imrie Graduate Student Award
2
Reza Sherkat ICDE062 Overview Introduction –Histories and Time-series –Similarity model for histories Problem Definition Proposed Approach Results Highlight Conclusions
3
Reza Sherkat ICDE063 Querying Histories: Introduction Querying multiple snapshots of data –Temporal selection, projection, and join queries Finding similar time-series –Finding companies having similar stocks Is it possible to define a notion of similarity for objects based on the similarity of their histories?
4
Reza Sherkat ICDE064 Histories History: A sequence of time-stamped observations –Time-series: observations are real-values –Observations can be more general : bag of word the history of a web-page the history of a patient whenobservation day 1{a, b} day 2{a, b, c} day 3{ } day 4{h, i}
5
Reza Sherkat ICDE065 Similarity Model for Histories Similarity of two histories depends on: Pair-wise similarity of their observations dayh1h2h3 1{a, b}{a, b, e}{f, g, h} 2{a, b, c}{b, c, z}{a, b, c} 3{f, g}{ }{f, g, h} 4{h, i}{f, g, h}{b, c} History for 3 patients
6
Reza Sherkat ICDE066 Similarity Model for Histories Similarity of two histories depends on: Pair-wise similarity of their observations dayh1h2h3 1{a, b}{a, b, e}{f, g, h} 2{a, b, c}{b, c, z}{a, b, c} 3{f, g}{ }{f, g, h} 4{h, i}{f, g, h}{b, c} History for 3 patients The order that similar observations are recorded – Constraints on time-stamps of observations
7
Reza Sherkat ICDE067 Problem Definition Given a history as a query: –Evaluate k-NN and Range queries efficiently. –For each history in the result, find its common signature with the query - where the similarity comes from?
8
Reza Sherkat ICDE068 Alignment of histories: –An approach to line-up subsequences of two histories –Denoted by a sequence of matches: – is an observation in A (B) or a gap ( ). – is the score of a match. –Alignment score measures the quality of an alignment. Similarity Measure for Histories
9
Reza Sherkat ICDE069 Alignments of Histories Alignment score can be the sum of the score of matches in the alignment. 100 0 000 00 1 000 The best alignment of two histories:
10
Reza Sherkat ICDE0610 Alignments of Histories Alignment score can be the sum of the score of matches in the alignment. 100 0 000 00 1 000 The best alignment of two histories: What is the best alignment of length 3?
11
Reza Sherkat ICDE0611 Alignments of Histories Alignment score can be the sum of the score of matches in the alignment. 100 0 000 00 1 000 The best alignment of two histories: What is the best alignment of length 3? If the match could not be considered, what would be the best alignment of length 2?
12
Reza Sherkat ICDE0612 Constraints on the Alignments of Histories 1.The number of matches in the alignment. l-alignment: alignment with l matches 2.The r-neighborhood constraint For each match r,l : parameters of the similarity query.
13
Reza Sherkat ICDE0613 The principle of optimality holds if: Principle of Optimality p(A) p(B) s(A) s(B) : optimal alignment of p(A) and p(B) : optimal alignment of s(A) and s(B) : optimal alignment of A and B : concatenation operator
14
Reza Sherkat ICDE0614 Score of Optimal l-alignment Optimal l-alignment of suffixes can formed by: Concatenating with optimal (l-1)-alignment of suffixes Matching with gap, and considering l-alignment of suffixes njj bbbb,,,,, 11
15
Reza Sherkat ICDE0615 Similarity Measure for Histories : the score of optimal l-alignment of two histories. can be used to find common signature of histories: A sequence of observations that appear in the same order in two histories. Generalizes the notion of longest common subsequence.
16
Reza Sherkat ICDE0616 Similarity Queries over Collection of Histories Straightforward (not practical) approach: naïve scan Indexing techniques are proposed for metric spaces, but is not metric: – when the distance between observations is not metric. – when an r-neighberhood constraint is specified. We propose upper bounds to prune history search space.
17
Reza Sherkat ICDE0617 A General Upper Bound for the Similarity Measure Intuition: The score of an optimal relaxed l-alignment is not less than the score of optimal l-alignment. 1.For each observation, find an optimal match. 2.Aggregate the scores for top l optimal matches to find an upper bound for. This upper bound can prune some extra computations, but still all histories will be accessed to evaluate a query.
18
Reza Sherkat ICDE0618 Intuitions: Observations are sparse in real life applications. The score of an optimal relaxed match is not less than the score of an optimal match. The score of an optimal relaxed alignment is not less than the score of optimal relaxed l-alignment. An Index-based Upper Bound for the Similarity Measure This upper bound can be evaluated efficiently by exploiting an inverted index if is Cosine or Extended Jaccard Coefficient.
19
Reza Sherkat ICDE0619 Experiments Experiments performed on AMD/XP 2600 512 Mb RAM Datasets: –DBLP –Synth1: Our synthetic data –Synth2: Modified IBM synthetic data generator Investigated: –Effectiveness of similarity measure –Efficiency of our approach Pruning power, Running time, Saleability
20
Reza Sherkat ICDE0620
21
Reza Sherkat ICDE0621 Synth2 dataset contains: 20,000 histories for each history is selected randomly from {1,…,10} Length of histories: {32,…,64} Effectiveness of our Similarity Measure observation: document modeled as bit string First observation: randomly selected … V(1) … V( i+1 ) … V( i ) … … V( n ) … : Poisson distribution V(i+1): bit string following V(i) in a pre-determined order [Cho et al. VLDB 2000]
22
Reza Sherkat ICDE0622 Effectiveness of our Similarity Measure (cnt.) Mean deviation of from for k-NN queries: * For 2,000 randomly generated queries
23
Reza Sherkat ICDE0623 Pruning Power vs. k No. of neighbours in k-NN query (LOG scale) 1 10 100 1024 Fraction of database examined 0 20 40 60 80 100
24
Reza Sherkat ICDE0624 Running Time vs. k Dataset: Synth2, 8,000 Histories, 1,000 items Time (msec) 0 100 200 300 400 500 600 1 10 100 1024 No. of neighbours in k-NN query (LOG scale)
25
Reza Sherkat ICDE0625 Scalability for 1-NN queries No. of histories in the collection 8,000 16,000 32,000 64,000 Time (msec)
26
Reza Sherkat ICDE0626 Running time vs. Sparseness of Observations No. of items (LOG scale) 256 512 1,024 2,048 4,096 8,092 Time (msec)
27
Reza Sherkat ICDE0627 Conclusions Introduced a domain-independent framework to formulate and evaluate similarity queries over historical data. Generalized few concepts, including edit distance and longest common subsequence to histories. Developed upper bounds to efficiently evaluate queries. One of our upper bounds can directly take advantage of an index even though it is not metric. Our experiments confirm the effectiveness and efficiency of our approach.
28
Reza Sherkat ICDE0628 Thank you for your attention!
29
Reza Sherkat ICDE0629 Related Works Detecting, representing, querying histories –[Chawathe 1998], [Chien 2001] Similarity-based sequence matching –[Altschul 1990], [Pearson 1990], [Bieganski 1994] Finding similar sequence of events –[Wang 2003] Finding similar time series –[Agrawal 1995], [Rafiei 1997], [Keogh 2002], [Vlachos 2002, 2003],...
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.