Presentation is loading. Please wait.

Presentation is loading. Please wait.

Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Similar presentations


Presentation on theme: "Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師."— Presentation transcript:

1 Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師 學生 : 林奕森

2 Outline Introduction Related work Definitions and Examples Algorithms for Infinite Norm Distance Algorithms for Average Block Distance Experimental Results Conclusion and Discussion

3 Introduction(1/4) Sequence Databases occur in many areas of research in Database Management Systems. For example, Temporal Databases, Time-series Databases and Video Databases are some examples of sequence databases. In this paper we consider similarity based retrieval from sequence databases.

4 Introduction (2/4) Similarity based retrieval consists of retrieving those subsequences that closely satisfy the query based on a similarity measure. In this paper, we consider a language based on finite state automata for specifying queries on sequences, and develop similarity based methods for retrieval

5 Introduction (3/4) We consider the following problems for a given database sequence d and a specification automaton A: (i) retrieval of k closest subsequences of d with respect to the automaton A (called “ nearest neighbor query ” ) (ii) retrieval of all subsequences of d with in a given distance from A (also called “ range query ” )

6 Introduction (4/4) We have implemented the proposed methods on top of Sequel Server. We also consider a restricted class of automata, called cycle-restricted automata. We present more efficient algorithms for these automata.

7 Related work (1/3) There has been much work done on querying from time-series and other sequence databases For example, methods for similarity based retrieval from such databases have been proposed in [11, 2, 3, 5, 15, 14]

8 Related work (2/3) The paper [1] presents a language, called SDL The retrieval is done based on exact match and is not similarity based retrieval like ours using a global distance measure. There has also been much work done on data-mining over time series data [4, 12, 6] and other databases. Among these works, [6] uses automata

9 Related work (3/3) All these works mostly consider discovery of patterns that have a given minimum level of support. They do not consider similarity based retrieval. A temporal query language and efficient algorithms for similarity based retrieval have been presented in [18].

10 Definitions and Examples Basic Automata and Similarity values Automata 1 An automaton A is 5-tuple (Q,Σ, δ, I,F) where Q is a finite set of states, Σ is a finite set of symbols called the input alphabet, δ is the set of transitions, I,F ⊆ Q are the set of initial and final states, respectively. 2 Each input symbol represents an atomic predicate (also called an atomic query in some places) on a single database state.

11 Definitions and Examples Automata example 1 Each transition of A, i.e. each member of δ, is a triple of the form (q, a, q ’ ) where q, q ’ ∈ Q and a ∈ Σ; this triple denotes that the automaton makes a transition from state q to q ’ on input a; we also represent such a transition as q → a q ’. 2 For example, in a stock market database, price(ibm) = 100 represents an atomic predicate.

12 Definitions and Examples Automata example the automaton B defined as follows. It has three states 1,2,3. Its input symbols are the atomic queries time = 10AM, time = 4PM and price(IBM) < 100. States 1,3 are the start and final states repsectively. The automaton has the following transitions — from state 1 to 2 on the input symbol time = 10AM, from state 2 back to 2 on the symbol price(IBM) < 100, and from state 2 to 3 on the symbol time = 4PM.

13 Definitions and Examples Similarity Measure A database sequence d is a finite sequence of database states A database state represent an image (in case of video databases) or a document in case of textual databases. For a database state c and an atomic query c ’, we let sim (c ’, c) denote the similarity value with which c satisfies the query c ’.

14 Definitions and Examples Similarity Measure We let dist(c, c ’) = 1- sim(c, c ’) represent the distance between c and c ’ we define the similarity of a database sequence d = (do,..., dn-1) with respect to an automaton A we define a distance measure dist(d, a) between d and an input sequence a = (a0,..., an-1) of equal length.

15 Definitions and Examples Similarity Measure Let sim_vec(d, a) be the sequence (s0,..., sn-1) where for each i = 0,..., n- 1, si = sim(ai, di). We assume that all similarity values and distances are normalized, i.e. they lie in the interval [0, 1] Let F be a vector distance function which given two vectors x, y as arguments, associates a positive real number lying in the interval [0, 1]

16 Definitions and Examples Similarity Measure We define dist(d, a) = F(sim vec(d, a), 1). Now, we define a distance measure dist(d,C) between the database sequence d and a set C ⊆ Σ. dist(d,C) is the minimum of dist(d,α), where the minimum is taken over all α ∈ C such that |α| = |d|; if there is no sequence α ∈ C such that |α| = |d| then we take dist(d,C) to be equal to 1.

17 Definitions and Examples Similarity Measure we define the distance of d with respect to A, denoted by dist(d,A), to be dist(d,L(A)). We define the similarity of d with respect to the automaton A, denoted by sim(d,A), to be 1- dist(d,L(A)).

18 Definitions and Examples Similarity Measure

19 Definitions and Examples Similarity Measure Note that F1 is the average block distance function and F2 is the mean square distance function, etc. We call F1 as the average block distance function and F∞ as the infinite norm distance function.

20 Definitions and Examples Wild Card Symbol We assume that there is a special input symbol φ which denotes a wild card symbol, i.e. it denotes an atomic query which is always satisfied. Cycle-Restricted Automata Let A = (Q,Σ, δ,I,F) be an automaton. A path of the automaton is a sequence of transitions of the following form — q 0 → a0 q 1, q 1 → a1 q 2,..., qn- 1 → an-1 qn. We call such a sequence as a path from q 0 to qn.

21 Definitions and Examples Cycle-Restricted Automata We call the path a φ-path if all input symbols appearing in it are wild cards, i.e., for each i = 0,..., n- 1, ai = φ. The above path is called a cycle if qn = q0 and q0, q1,..., qn-1 are all distinct. A φ-path which is also a cycle is called a φ-cycle. We say that an automaton is cycle-restricted if it has no φ-cycles of length greater than 1

22 Definitions and Examples Nearest Neighbor and Range Queries In this paper, we consider the evaluation of the two types of queries assuming that we are given a query automaton A and a database sequence d

23 Definitions and Examples Nearest Neighbor and Range Queries The first type of queries are called nearest neighbor queries. Here we have to retrieve k subsequences of d having the lowest distances with respect to A where k is an additional input which is a positive integer.

24 Definitions and Examples Nearest Neighbor and Range Queries The second type of queries are called range queries. Here we have to retrieve all subsequences of d whose distance with respect to A is less than or equal to &, where & is an additional input which is a positive fraction.

25 ALGORITHMS FOR INFINITE NORM DISTANCE definitions and lemma Lemma4.1 Let q be any state in Q and i be an integer such that 1 ≤ i ≤ n. Further, let q 1,..., qm be the successor states of q on input symbols a 1,..., am respectively

26 ALGORITHMS FOR INFINITE NORM DISTANCE

27

28

29 Employing Indices for fast retrieval for each i = 1,...m, we can retrieve a list Li of entries of the form (I, val) where I is an interval of the form [u,v] such that 1 ≤ u ≤ v ≤ n and and 0 ≤ val < 1. The entry ( [u,v], val) on the list Li denotes that the the distance, with respect to ai, of all database states whose indices fall with in the range [u,v] is val; that is, for all j such that u ≤ j ≤ v, dist(dj, ai) = val.

30 Algorithms for Average Block Distance For any subsequence σ = (di,..., di + l- 1 ) of d and any string a = (a 1,..., al) ∈ Σ* of the same length, let bd(σ, a) be the sum Σj =0,...,l- 1 dist(di + j, aj +1 ); it denotes the block distance between σ and a.

31 Algorithms for Average Block Distance let val(q, i, r) = min{bd(σ, a) : σ is a subsequence of d starting from di and a is any string in T(q) which is of the same length as σ whose pseudo length is r } T(q) is the set of strings accepted by A starting from the state q

32 Algorithms for Average Block Distance AVG-DIST :computes the minimum of the distances of all the subsequences of the database sequence with respect to the automaton A. AVGDIST- RESTR-AUT :cycle restricted automata.

33 Experimental Results We have implemented all the algorithms INF- NORM, INFNORM-INDX, AVG-DIST, INF- NORM-RESTR-AUT and AVG-DIST-RESTR- AUT. They use SQL to run algorithms on a stock market database.

34 Experimental Results The database stored the end-of-day Dow- Jones Industrial averages over the last 98 years giving a database sequence of length 26,716 ( the length is the total number of trading days during that period). This query is specified by an automaton that accepts the language given by the regular expression ab*c.

35 Experimental Results

36 Conclusion and Discussion Introduced a powerful formalism based on automata for expressing queries on sequence databases. We also have given efficient algorithms for similarity based retrieval that employ indices. Implemented the algorithms for time-series databases on PC using Sequel server

37 Conclusion and Discussion Experimental results showing the effectiveness of our methods are presented. It will also be interesting to see if and how the techniques of the paper can be extended for data mining over sequences.


Download ppt "Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師."

Similar presentations


Ads by Google