Efﬁcient Similarity Search : Arbitrary Similarity Measures, Arbitrary Composition Date: 2011/10/31 Source: Dustin Lange et. al (CIKM’11) Speaker:Chiang,guang-ting.

Slides:

Advertisements

Similar presentations

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Advertisements

Aggregating local image descriptors into compact codes

Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :

Fast Algorithms For Hierarchical Range Histogram Constructions

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

Multiple Shape Correspondence by Dynamic Programming Yusuf Sahillioğlu 1 and Yücel Yemez 2 Pacific Graphics 2014 Computer Eng. Depts, 1, 2, Turkey.

Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.

Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.

Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,

Diversification And Refinement In Collaborative Filtering Recommender Date: 2012/02/16 Source: Rubi Boim et. al (CIKM’11) Speaker: Chiang,guang-ting Advisor:

Similarity Search for Adaptive Ellipsoid Queries Using Spatial Transformation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

Adaptive Query Processing for Data Aggregation: Mining, Using and Maintaining Source Statistics M.S Thesis Defense by Jianchun Fan Committee Members: Dr.

A Study of Approaches for Object Recognition

Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam

CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.

Large-Scale Cost-sensitive Online Social Network Profile Linkage.

Query Processing Presented by Aung S. Win.

Evaluating Performance for Data Mining Techniques

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Operations Research Models

Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research.

Active Learning for Class Imbalance Problem

Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,

Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.

Querying Structured Text in an XML Database By Xuemei Luo.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Google News Personalization: Scalable Online Collaborative Filtering

Date: 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Frequency-aware Similarity Measures 1.

BING: Binarized Normed Gradients for Objectness Estimation at 300fps

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:

Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Manoranjan.

Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.

Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.

Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.

Date: 2011/1/11 Advisor: Dr. Koh. Jia-Ling Speaker: Lin, Yi-Jhen Mr. KNN: Soft Relevance for Multi-label Classification (CIKM’10) 1.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree

Content Based Color Image Retrieval vi Wavelet Transformations Information Retrieval Class Presentation May 2, 2012 Author: Mrs. Y.M. Latha Presenter:

Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.

Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS

TO Each His Own: Personalized Content Selection Based on Text Comprehensibility Date: 2013/01/24 Author: Chenhao Tan, Evgeniy Gabrilovich, Bo Pang Source:

A DDING S TRUCTURE TO T OP -K: F ORM I TEMS TO E XPANSIONS Date : Source : CIKM’ 11 Speaker : I-Chih Chiu Advisor : Dr. Jia-Ling Koh 1.

Chapter 11 Introduction to Computational Complexity Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.

Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

1 A Practical Introduction to Data Structures and Algorithm Analysis Chaminade University of Honolulu Department of Computer Science Text by Clifford Shaffer.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.

Chapter 7 Process Control.

Database Management System

Spatio-temporal Pattern Queries

Overview of Query Evaluation

CoXML: A Cooperative XML Query Answering System

Presentation transcript:

Efﬁcient Similarity Search : Arbitrary Similarity Measures, Arbitrary Composition Date: 2011/10/31 Source: Dustin Lange et. al (CIKM’11) Speaker:Chiang,guang-ting Advisor: Dr. Koh. Jia-ling 1

Index Introduction Problem Definition Evaluating Query Plans Query Plan Optimization Evaluation Conclusion 2

Introduction Many commercial applications maintain a person/customer data set that typically contains information such as the name, the date of birth, and the address of individuals that are somehow related to the company. Searching require: 1.answered extremely fast. 2.answered result that might be relevent with query. 3

Introduction Motivation : For these and similar problem settings, a common approach is to define several similarity measures for different aspects of the problem. Goal: To combine such similarity measures, there exist different approaches. perform efficient search based on a set of similarity measures and an arbitrary composition technique. 4

Filter-and-refine Search 5 Introduction query Database result Database Filter criterion Query plan 1.Contain correct similar record 2.Missing data should be as low as possible 3.List should be as low as possible

Problem Definition Composed similarity measures Query plan optimization 6

Composed similarity measures FirstNameLastNameBirthdatezip 7 Problem Definition Jarco-winkler distance Jarco-winkler distance Relative distance Numeric difference [0,1] Weighted sum

Query plan optimization 8 Problem Definition Query plan template: a combination of attribute predicates with the logical operators conjunction and disjunction. Query plan : all threshold variables in a query plan template are assigned a value.

To evaluate query plans, we define different performance metrics. 1.completeness comp(qp) : a query plan qp is the expected proportion of correct query/result record pairs that are covered by qp. 1.costs costs(qp) : a query plan qp is the expected average number of records in R that are covered by qp. Given a set R of records and a cost threshold C, the query plan optimization problem is to find the query plan qp that maximizes comp(qp) with costs(qp) < C. 9 Query plan optimization Problem Definition

Evaluating Query Plans Completeness Estimation How many matches can be found with this query plan? Cost Estimation How large is the cardinality of an average query result with this query plan? 10

Completeness Estimation Gathering Training Data : Real Training Data: In our use case, we have a set of 2m queries, most of them with results manually labeled as correct. This is the ideal situation, where we can rely on a manually labeled set of query/result record pairs. 11 Evaluating Query Plans

Completeness Estimation 12 Evaluating Query Plans

Completeness Estimation Estimation Basic algorithm: Iterate over the list T and count all query/result record pairs that are covered by qp. Speed up the basic algorithm: Speed up For each distinct base similarity value combination, we save its count, thus eliminating all “duplicate” combinations. Reduce the number of value combinations by rounding to two decimal places. Reduce the amount of value combinations to be checked from 2.0m to 4.4k (a reduction rate of 99.8 %). 13 Evaluating Query Plans

Cost Estimation 14 Evaluating Query Plans Costly task Naïve : Propose : Quickly

Cost Estimation Similarity histograms For each analyzed value, we calculate the similarity to all other values of the attribute. 15 Evaluating Query Plans

Cost Estimation Similarity histograms For each possible similarity, averag the measured numbers of records with similar values. 16 Evaluating Query Plans

Cost Estimation 17 Evaluating Query Plans

Cost Estimation 18 Evaluating Query Plans

Cost Estimation 19 Evaluating Query Plans

Cost Estimation 20 Evaluating Query Plans

Query Plan Optimization Best query plan, i. e., the plan with highest completeness below a cost threshold. How?? 1.Iterate over the set of all possible query templates and then optimize thresholds for each query plan template. 2.Then,got a optimal query plan per query plan template. 3.From these query plans, can simply select the overall best query plan. Cost too much. 21

Query Plan Optimization Observations-cost 22

Query Plan Optimization Observations-completeness 23

Query Plan Optimization Top neighborhood algorithm The algorithm should only evaluate as few query plans as possible and determine an overall good solution. 24 ……….W(q6)W(q4)W(q1) ……….W(q5)W(q2) W(q10)………W(q3) …….. W(qn)…… (A ∧ B) ∨ (C ∧ D) D C B A

Query Plan Optimization (A ∧ B) ∨ (C ∧ D) D C B A A query plan qp has a neighborhood N(qp) that contains all query plans that can be constructed by lowering one thresholds of qp by one step Ex: by 0.01

Query Plan Optimization 26 ……….N(W6)N(W4)N(W1) ……….N(W5)N(W2) N(W10)………N(W3) …….. N(Wm)…… (A ∧ B) ∨ (C ∧ D) D C B A

Query Plan Optimization 27

28

Real-world Similarity Search 210 query plan templates t = 100 (a window size that results in acceptable runtime of our algorithm) C = 1000 (a cost threshold that is acceptable for our use case). Randomly selected 50,000 queries from our test data set. 29

Conclusion Introduced a novel approach to efficient similarity search for composed similarity measures. Showed how to efficiently compute completeness and costs, two important performance metrics for query plans based on a set of similarity indexes. The resulting trade-off between completeness and costs has been solved with the approximative top neighborhood algorithm. Showed that the algorithm efficiently determines near- optimal query plans for real-world data. 30

Thx for your listening ….. 31

Speed up the basic algorithm AmountFirstNameLastNameBirthdatezip 12345EDDYMA1/ JACKJOHNSON2/ KOBECHU7/ JACKJOHNSON2/ JACKJOHNSON2/ Query plan: (FirstName ≥ 0.8) Query : Jackson Mark Susan Jackson Maggie 3 delete