Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.

Slides:

Advertisements

Similar presentations

Properties of Regular Sets

Advertisements

Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.

Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.

CS 208: Computing Theory Assoc. Prof. Dr. Brahim Hnich Faculty of Computer Sciences Izmir University of Economics.

Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Lecture 6 Nondeterministic Finite Automata (NFA)

Fast Algorithms For Hierarchical Range Histogram Constructions

Hongliang Li, Senior Member, IEEE, Linfeng Xu, Member, IEEE, and Guanghui Liu Face Hallucination via Similarity Constraints.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Finite Automata Section 1.1 CSC 4170 Theory of Computation.

1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.

Introduction to Computability Theory

1 Introduction to Computability Theory Lecture7: PushDown Automata (Part 1) Prof. Amos Israeli.

Intro to DFAs Readings: Sipser 1.1 (pages 31-44) With basic background from Sipser 0.

Infinite Automata -automata is an automaton that accepts infinite strings A Buchi automaton is similar to a finite automaton: S is a finite set of states,

Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.

Specification Formalisms Book: Chapter 5. Properties of formalisms Formal. Unique interpretation. Intuitive. Simple to understand (visual). Succinct.

Based on Slides by D. Gunopulos (UCR)

CS5371 Theory of Computation Lecture 6: Automata Theory IV (Regular Expression = NFA = DFA)

79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.

Fall 2006Costas Busch - RPI1 Non-Deterministic Finite Automata.

Normal forms for Context-Free Grammars

CS5371 Theory of Computation Lecture 4: Automata Theory II (DFA = NFA, Regular Language)

1 Non-Deterministic Finite Automata. 2 Alphabet = Nondeterministic Finite Automaton (NFA)

Finite State Machines Data Structures and Algorithms for Information Processing 1.

Great Theoretical Ideas in Computer Science.

Regular Model Checking Ahmed Bouajjani,Benget Jonsson, Marcus Nillson and Tayssir Touili Moran Ben Tulila

Chapter 8. Section 8. 1 Section Summary Introduction Modeling with Recurrence Relations Fibonacci Numbers The Tower of Hanoi Counting Problems Algorithms.

CSCI 2670 Introduction to Theory of Computing August 24, 2005.

DECIDABILITY OF PRESBURGER ARITHMETIC USING FINITE AUTOMATA Presented by : Shubha Jain Reference : Paper by Alexandre Boudet and Hubert Comon.

Panconnectivity and Edge- Pancyclicity of 3-ary N-cubes 指導教授 : 黃鈴玲老師學生 : 郭俊宏 Sun-Yuan Hsieh, Tsong-Jie Lin and Hui-Ling Huang Journal of Supercomputing.

1 Edge-bipancyclicity of star graphs under edge-fault tolerant Applied Mathematics and Computation, Volume 183, Issue 2, 15 December 2006, Pages

Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski

Lecture 1 Computation and Languages CS311 Fall 2012.

Athasit Surarerks THEORY OF COMPUTATION 07 NON-DETERMINISTIC FINITE AUTOMATA 1.

Great Theoretical Ideas in Computer Science.

An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan.

Decidable Questions About Regular languages 1)Membership problem: “Given a specification of known type and a string w, is w in the language specified?”

Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.

Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.

Copyright © Cengage Learning. All rights reserved.

Recognizing safety and liveness Presented by Qian Huang.

Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.

CIS 540 Principles of Embedded Computation Spring Instructor: Rajeev Alur

CS 203: Introduction to Formal Languages and Automata

Chapter 3 Regular Expressions, Nondeterminism, and Kleene’s Theorem Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction.

Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.

Discrete Mathematics Lecture # 22 Recursion.  First of all instead of giving the definition of Recursion we give you an example, you already know the.

Finding Regular Simple Paths Sept. 2013Yangjun Chen ACS Finding Regular Simple Paths in Graph Databases Basic definitions Regular paths Regular simple.

Great Theoretical Ideas in Computer Science for Some.

Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.

Algorithms for hard problems Automata and tree automata Juris Viksna, 2015.

Automata & Formal Languages, Feodor F. Dragan, Kent State University 1 CHAPTER 3 The Church-Turing Thesis Contents Turing Machines definitions, examples,

1/39 Motion Adaptive Search for Fast Motion Estimation 授課老師：王立洋老師製作學生： M 蔡鐘葳.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Deterministic Finite Automata Nondeterministic Finite Automata.

Theory of Computation Automata Theory Dr. Ayman Srour.

CSCI 2670 Introduction to Theory of Computing September 22, 2004.

Theory of Computation Automata Theory Dr. Ayman Srour.

Deterministic Finite-State Machine (or Deterministic Finite Automaton) A DFA is a 5-tuple, (S, Σ, T, s, A), consisting of: S: a finite set of states Σ:

COOLCAT: An Entropy-Based Algorithm for Categorical Clustering

Chapter 2: Intro to Relational Model

Rosen 5th ed., §3.2 ~9 slides, ~½ lecture

CSE322 Definition and description of finite Automata

Intro to Data Structures

Introduction to Finite Automata

Algorithm Discovery and Design

Advanced Algorithms Analysis and Design

Rosen 5th ed., §3.2 ~9 slides, ~½ lecture

Presentation transcript:

Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師學生 : 林奕森

Outline Introduction Related work Definitions and Examples Algorithms for Infinite Norm Distance Algorithms for Average Block Distance Experimental Results Conclusion and Discussion

Introduction(1/4) Sequence Databases occur in many areas of research in Database Management Systems. For example, Temporal Databases, Time-series Databases and Video Databases are some examples of sequence databases. In this paper we consider similarity based retrieval from sequence databases.

Introduction (2/4) Similarity based retrieval consists of retrieving those subsequences that closely satisfy the query based on a similarity measure. In this paper, we consider a language based on finite state automata for specifying queries on sequences, and develop similarity based methods for retrieval

Introduction (3/4) We consider the following problems for a given database sequence d and a specification automaton A: (i) retrieval of k closest subsequences of d with respect to the automaton A (called “ nearest neighbor query ” ) (ii) retrieval of all subsequences of d with in a given distance from A (also called “ range query ” )

Introduction (4/4) We have implemented the proposed methods on top of Sequel Server. We also consider a restricted class of automata, called cycle-restricted automata. We present more efficient algorithms for these automata.

Related work (1/3) There has been much work done on querying from time-series and other sequence databases For example, methods for similarity based retrieval from such databases have been proposed in [11, 2, 3, 5, 15, 14]

Related work (2/3) The paper [1] presents a language, called SDL The retrieval is done based on exact match and is not similarity based retrieval like ours using a global distance measure. There has also been much work done on data-mining over time series data [4, 12, 6] and other databases. Among these works, [6] uses automata

Related work (3/3) All these works mostly consider discovery of patterns that have a given minimum level of support. They do not consider similarity based retrieval. A temporal query language and efficient algorithms for similarity based retrieval have been presented in [18].

Definitions and Examples Basic Automata and Similarity values Automata 1 An automaton A is 5-tuple (Q,Σ, δ, I,F) where Q is a finite set of states, Σ is a finite set of symbols called the input alphabet, δ is the set of transitions, I,F ⊆ Q are the set of initial and final states, respectively. 2 Each input symbol represents an atomic predicate (also called an atomic query in some places) on a single database state.

Definitions and Examples Automata example 1 Each transition of A, i.e. each member of δ, is a triple of the form (q, a, q ’ ) where q, q ’ ∈ Q and a ∈ Σ; this triple denotes that the automaton makes a transition from state q to q ’ on input a; we also represent such a transition as q → a q ’. 2 For example, in a stock market database, price(ibm) = 100 represents an atomic predicate.

Definitions and Examples Automata example the automaton B defined as follows. It has three states 1,2,3. Its input symbols are the atomic queries time = 10AM, time = 4PM and price(IBM) < 100. States 1,3 are the start and final states repsectively. The automaton has the following transitions — from state 1 to 2 on the input symbol time = 10AM, from state 2 back to 2 on the symbol price(IBM) < 100, and from state 2 to 3 on the symbol time = 4PM.

Definitions and Examples Similarity Measure A database sequence d is a finite sequence of database states A database state represent an image (in case of video databases) or a document in case of textual databases. For a database state c and an atomic query c ’, we let sim (c ’, c) denote the similarity value with which c satisfies the query c ’.

Definitions and Examples Similarity Measure We let dist(c, c ’) = 1- sim(c, c ’) represent the distance between c and c ’ we define the similarity of a database sequence d = (do,..., dn-1) with respect to an automaton A we define a distance measure dist(d, a) between d and an input sequence a = (a0,..., an-1) of equal length.

Definitions and Examples Similarity Measure Let sim_vec(d, a) be the sequence (s0,..., sn-1) where for each i = 0,..., n- 1, si = sim(ai, di). We assume that all similarity values and distances are normalized, i.e. they lie in the interval [0, 1] Let F be a vector distance function which given two vectors x, y as arguments, associates a positive real number lying in the interval [0, 1]

Definitions and Examples Similarity Measure We define dist(d, a) = F(sim vec(d, a), 1). Now, we define a distance measure dist(d,C) between the database sequence d and a set C ⊆ Σ. dist(d,C) is the minimum of dist(d,α), where the minimum is taken over all α ∈ C such that |α| = |d|; if there is no sequence α ∈ C such that |α| = |d| then we take dist(d,C) to be equal to 1.

Definitions and Examples Similarity Measure we define the distance of d with respect to A, denoted by dist(d,A), to be dist(d,L(A)). We define the similarity of d with respect to the automaton A, denoted by sim(d,A), to be 1- dist(d,L(A)).

Definitions and Examples Similarity Measure

Definitions and Examples Similarity Measure Note that F1 is the average block distance function and F2 is the mean square distance function, etc. We call F1 as the average block distance function and F∞ as the infinite norm distance function.

Definitions and Examples Wild Card Symbol We assume that there is a special input symbol φ which denotes a wild card symbol, i.e. it denotes an atomic query which is always satisfied. Cycle-Restricted Automata Let A = (Q,Σ, δ,I,F) be an automaton. A path of the automaton is a sequence of transitions of the following form — q 0 → a0 q 1, q 1 → a1 q 2,..., qn- 1 → an-1 qn. We call such a sequence as a path from q 0 to qn.

Definitions and Examples Cycle-Restricted Automata We call the path a φ-path if all input symbols appearing in it are wild cards, i.e., for each i = 0,..., n- 1, ai = φ. The above path is called a cycle if qn = q0 and q0, q1,..., qn-1 are all distinct. A φ-path which is also a cycle is called a φ-cycle. We say that an automaton is cycle-restricted if it has no φ-cycles of length greater than 1

Definitions and Examples Nearest Neighbor and Range Queries In this paper, we consider the evaluation of the two types of queries assuming that we are given a query automaton A and a database sequence d

Definitions and Examples Nearest Neighbor and Range Queries The first type of queries are called nearest neighbor queries. Here we have to retrieve k subsequences of d having the lowest distances with respect to A where k is an additional input which is a positive integer.

Definitions and Examples Nearest Neighbor and Range Queries The second type of queries are called range queries. Here we have to retrieve all subsequences of d whose distance with respect to A is less than or equal to &, where & is an additional input which is a positive fraction.

ALGORITHMS FOR INFINITE NORM DISTANCE definitions and lemma Lemma4.1 Let q be any state in Q and i be an integer such that 1 ≤ i ≤ n. Further, let q 1,..., qm be the successor states of q on input symbols a 1,..., am respectively

ALGORITHMS FOR INFINITE NORM DISTANCE

Employing Indices for fast retrieval for each i = 1,...m, we can retrieve a list Li of entries of the form (I, val) where I is an interval of the form [u,v] such that 1 ≤ u ≤ v ≤ n and and 0 ≤ val < 1. The entry ( [u,v], val) on the list Li denotes that the the distance, with respect to ai, of all database states whose indices fall with in the range [u,v] is val; that is, for all j such that u ≤ j ≤ v, dist(dj, ai) = val.

Algorithms for Average Block Distance For any subsequence σ = (di,..., di + l- 1 ) of d and any string a = (a 1,..., al) ∈ Σ* of the same length, let bd(σ, a) be the sum Σj =0,...,l- 1 dist(di + j, aj +1 ); it denotes the block distance between σ and a.

Algorithms for Average Block Distance let val(q, i, r) = min{bd(σ, a) : σ is a subsequence of d starting from di and a is any string in T(q) which is of the same length as σ whose pseudo length is r } T(q) is the set of strings accepted by A starting from the state q

Algorithms for Average Block Distance AVG-DIST :computes the minimum of the distances of all the subsequences of the database sequence with respect to the automaton A. AVGDIST- RESTR-AUT :cycle restricted automata.

Experimental Results We have implemented all the algorithms INF- NORM, INFNORM-INDX, AVG-DIST, INF- NORM-RESTR-AUT and AVG-DIST-RESTR- AUT. They use SQL to run algorithms on a stock market database.

Experimental Results The database stored the end-of-day Dow- Jones Industrial averages over the last 98 years giving a database sequence of length 26,716 ( the length is the total number of trading days during that period). This query is specified by an automaton that accepts the language given by the regular expression ab*c.

Experimental Results

Conclusion and Discussion Introduced a powerful formalism based on automata for expressing queries on sequence databases. We also have given efficient algorithms for similarity based retrieval that employ indices. Implemented the algorithms for time-series databases on PC using Sequel server

Conclusion and Discussion Experimental results showing the effectiveness of our methods are presented. It will also be interesting to see if and how the techniques of the paper can be extended for data mining over sequences.