Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.

Slides:



Advertisements
Similar presentations
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12): Multimedia.
Advertisements

DCSP-12 Jianfeng Feng
An Introduction to Fourier and Wavelet Analysis: Part I Norman C. Corbett Sunday, June 1, 2014.
Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn.
Image Registration  Mapping of Evolution. Registration Goals Assume the correspondences are known Find such f() and g() such that the images are best.
Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
Time Series II.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.
Spatial and Temporal Data Mining
Outline Transmitters (Chapters 3 and 4, Source Coding and Modulation) (week 1 and 2) Receivers (Chapter 5) (week 3 and 4) Received Signal Synchronization.
Sampling theorem, I  Suppose function h(t) is sampled at evenly spaced intervals in time; – 1/  : Sampling rate  For any sampling interval , there.
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
MSP15 The Fourier Transform (cont’) Lim, MSP16 The Fourier Series Expansion Suppose g(t) is a transient function that is zero outside the interval.
Data Mining: Concepts and Techniques Mining time-series data.
CS490D: Introduction to Data Mining Prof. Chris Clifton
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
Spatial and Temporal Data Mining
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
A Multiresolution Symbolic Representation of Time Series
Transforms: Basis to Basis Normal Basis Hadamard Basis Basis functions Method to find coefficients (“Transform”) Inverse Transform.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Indexing Time Series.
The Frequency Domain Sinusoidal tidal waves Copy of Katsushika Hokusai The Great Wave off Kanagawa at
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Integral Transform Dongsup Kim Department of Biosystems, KAIST Fall, 2004.
Multimedia and Time-series Data
CH 14 Multimedia IR. Multimedia IR system The architecture of a Multimedia IR system depends on two main factors –The peculiar characteristics of multimedia.
Transforms. 5*sin (2  4t) Amplitude = 5 Frequency = 4 Hz seconds A sine wave.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Content-Based Music Information Retrieval in Wireless Ad-hoc Networks.
COLOR HISTOGRAM AND DISCRETE COSINE TRANSFORM FOR COLOR IMAGE RETRIEVAL Presented by 2006/8.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
Shape-based Similarity Query for Trajectory of Mobile Object NTT Communication Science Laboratories, NTT Corporation, JAPAN. Yutaka Yanagisawa Jun-ichi.
Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
E.G.M. PetrakisSearching Signals and Patterns1  Given a query Q and a collection of N objects O 1,O 2,…O N search exactly or approximately  The ideal.
Content-Based Image Retrieval Using Fuzzy Cognition Concepts Presented by Tienwei Tsai Department of Computer Science and Engineering Tatung University.
CCN COMPLEX COMPUTING NETWORKS1 This research has been supported in part by European Commission FP6 IYTE-Wireless Project (Contract No: )
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Content-Based Image Retrieval Using Block Discrete Cosine Transform Presented by Te-Wei Chiang Department of Information Networking Technology Chihlee.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
Clustering.
Time Series Sequence Matching Jiaqin Wang CMPS 565.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Dr. Abdul Basit Siddiqui FUIEMS. QuizTime 30 min. How the coefficents of Laplacian Filter are generated. Show your complete work. Also discuss different.
FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.
Content-Based Image Retrieval Using Color Space Transformation and Wavelet Transform Presented by Tienwei Tsai Department of Information Management Chihlee.
CSCI 631 – Foundations of Computer Vision March 15, 2016 Ashwini Imran Image Stitching.
 Carrier signal is strong and stable sinusoidal signal x(t) = A cos(  c t +  )  Carrier transports information (audio, video, text, ) across.
Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)
Fast Subsequence Matching in Time-Series Databases.
Data Mining Soongsil University
Spectral Analysis Spectral analysis is concerned with the determination of the energy or power spectrum of a continuous-time signal It is assumed that.
Computer Vision Lecture 16: Texture II
Data Mining: Concepts and Techniques — Chapter 8 — 8
Data Mining: Concepts and Techniques — Chapter 8 — 8
Similarity Search: A Matching Based Approach
Data Mining: Concepts and Techniques — Chapter 8 — 8
Presentation transcript:

Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian

Similarity Exact queries Similarity –Identify companies with similar pattern of growth –Determine products with similar selling pattern –Discover stocks with similar movement in stock prices.

Similarity Queries Whole Matching. The sequences to be compared have the same length n. –Range Query. Given a query sequences that are similar within distance “e”. –All-Pairs queries. Given N sequences, find the pairs of sequences that are within “e” of each other. Subsequence Matching. The query sequence is smaller; we look for a subsequence in the large sequence that best matches the query sequence.

Extracting Features from Sequences For numerical sequences, extracting K features, mapping it to k-dimensional space and using multidimensional index methods (R*-tree, R- tree,grid-files,…) to store and search these points. Completeness of feature extracting Dimensionality “curse”

Discrete Fourier Transform All periodic waves can be generated by combining Sin and Cos waves of different frequencies Number of Frequencies may not be finite Fourier Transform Decomposes a Periodic Wave into its Component Frequencies

DFT Concept I

DFT Concept II

DFT Characteristics Completeness of feature extracting Dimensionality curse Parseval theorem gives that Euclidean distance between two signals x and y in the time domain is the same as their Euclidean distance in the frequency domain

Proposed Technique Obtain the coefficients of DFT of each sequence in the database Build a multidimensional index (F-index) using the first fc (<5)Fourier coefficients. For a range query, obtain the first fc Fourier coefficients of the query. For an all-pairs query, doing a spatial join using the F-index (superset of the answer set) The actual answer set is obtained in a post- processing step

Euclidean distance features Euclidean distance is useful in many cases It can be used with any other type of similarity measure Euclidean distance is the optimal distance measure of estimation if signals are corrupted by Gaussian additive noise It is preserved under orthonormal transforms

DFT Characteristics Preserves the distance Is easy to compute Concentrate the energy of the signal in few coefficients It’s a orthonormal transform The data dependent ones –+ better performance –- expensive data reorganization if data set evolves over time Data independent ones(DFT, DCT, wavelet)

Number of Fourier coefficients Worst-case signal is White noise when xt is completely independent of its neighbors. –It has the same energy in every frequency means all frequency are equally important. This is bad for F- index. Random walks (brown noise) –Stock movements and exchange rates Primary and secondary trends correspond to strong, low frequency signals while minor trends corresponds to weak, high frequency signals

Performance Experiments How to choose the number of Fourier coefficients to be retained (cut-off frequency fc) in the F-index method. –A larger fc reduces the false hits increases the search time. How does the search time grow as a function of number of sequences in the database? How does the length n of the sequences affect the performance?

Range Queries All-Pairs Queries Number of Fourier coefficients

Different Sequence Set Size All-Pairs Queries Range Queries

Varying Sequence Length All-Pairs Queries Range Queries

Discussion The minimum execution time for both range and all-pairs queries is achieved for a small number of f c Increasing the number of sequences in the database results in higher gains for this method Increasing the length of the sequence n also results in higher gain for the method

Summary Use DFT to extract sequence features Only first few coefficient is strong enough DFT is orthonotmal Use R*-tree for indexing Use Euclidean distance Complexity is O(nlog(n))