Supporting Ranking and Clustering as Generalized Order-By and Group-By

Slides:



Advertisements
Similar presentations
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Advertisements

RankSQL: Supporting Ranking Queries in RDBMS Chengkai Li (UIUC) Mohamed A. Soliman (Univ. of Waterloo) Kevin Chen-Chuan Chang (UIUC) Ihab F. Ilyas (Univ.
1 RankSQL: Query Algebra and Optimization for Relational Top-k Queries Chengkai Li (UIUC) joint work with Kevin Chen-Chuan Chang (UIUC) Ihab F. Ilyas (U.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Querying for Information Integration: How to go from an Imprecise Intent to a Precise Query? Aditya Telang Sharma Chakravarthy, Chengkai Li.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Analyzing Minerva1 AUTORI: Antonello Ercoli Alessandro Pezzullo CORSO: Seminari di Ingegneria del SW DOCENTE: Prof. Giuseppe De Giacomo.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
NiagaraCQ A Scalable Continuous Query System for Internet Databases Jianjun Chen, David J DeWitt, Feng Tian, Yuan Wang University of Wisconsin – Madison.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
Querying Structured Text in an XML Database By Xuemei Luo.
Advanced Databases: Lecture 6 Query Optimization (I) 1 Introduction to query processing + Implementing Relational Algebra Advanced Databases By Dr. Akhtar.
Data Warehouse Design Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
1 Information Retrieval LECTURE 1 : Introduction.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
Boolean + Ranking: Querying a Database by K-Constrained Optimization Joint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
1 VLDB, Background What is important for the user.
1 Chengkai Li Kevin-Chen-Chuan Chang Ihab Ilyas Sumin Song Presented by: Mariam John CSE /20/2006 RankSQL: Query Algebra and Optimization for Relational.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
1 CS122A: Introduction to Data Management Lecture #15: Physical DB Design Instructor: Chen Li.
CHAPTER 19 Query Optimization. CHAPTER 19 Query Optimization.
University of Maryland Baltimore County
More SQL: Complex Queries, Triggers, Views, and Schema Modification
On-Line Application Processing
Boolean + Ranking: Querying a Database by K-Constrained Optimization
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
RankSQL: Query Algebra and Optimization for Relational Top-k Queries
CS 540 Database Management Systems
Indexing Structures for Files and Physical Database Design
CS 540 Database Management Systems
Parallel Databases.
Supporting Ad-Hoc Ranking Aggregates
RankSQL: Query Algebra and Optimization for Relational Top-k Queries
Database Performance Tuning &
Parameter Sniffing in SQL Server Stored Procedures
Why the interest in Queries?
COMP 430 Intro. to Database Systems
Ishan Sharma Abhishek Mittal Vivek Raj
Physical Database Design for Relational Databases Step 3 – Step 8
Chapter 12: Query Processing
Database Performance Tuning and Query Optimization
Evaluation of Relational Operations
Chapter 15 QUERY EXECUTION.
Lecture 16: Probabilistic Databases
CS222: Principles of Data Management Notes #09 Indexing Performance
The Relational Model Textbook /7/2018.
CS222P: Principles of Data Management Notes #09 Indexing Performance
Chapters 15 and 16b: Query Optimization
Introduction to Information Retrieval
Overview of Query Evaluation
Chapter 11 Database Performance Tuning and Query Optimization
Chen Li Information and Computer Science
Evaluation of Relational Operations: Other Techniques
Information Retrieval and Web Design
Prefer: A System for the Efficient Execution
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #08 Comparisons of Indexes and Indexing Performance Instructor: Chen Li.
Presentation transcript:

Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin Chang (UIUC)

Boolean database queries (Relational Algebra and System R papers)

Data Retrieval

Example: What Boolean queries provide SELECT * FROM Houses H WHERE 200K<price<400K AND #bedroom = 4 query semantics results organization Boolean query hard constraints (True or False) a flat table too many (few) answers

Example: What may be desirable < 500K is more acceptable but willing to pay more for big house close to the lake is a plus avoid locations near airport … query semantics results organization Boolean query hard constraints (True or False) a flat table too many (few) answers fuzzy retrieval “soft” constraints (preference,similarity,relevance,…) a ranked list a grouping of results etc.

“Retrieval” of DATA: From Boolean query to fuzzy retrieval retrieval engine data text search engine

Retrieval mechanisms: Learning from Web search Ranking Clustering Navigation Map Facets Categorization …

Generalizing SQL constructs for data retrieval Order-By Group-By

From crispy ordering to fuzzy ranking Order By Houses.size, Houses.price by attribute values equality of values order ≠ desirability Fuzzy ranking Order By Houses.size − 4*Houses.price Limit 5 by ranking function combine matching criteria order => desirability : top-k

From crispy grouping to fuzzy clustering Group By Houses.size, Houses.price by attribute values equality partition no limit on output size Fuzzy clustering Group By k-means(H.size, H.price) Into 5 by distance function proximity of values number of clusters

Need for combining ranking with clustering Clustering-only A group can be big “too many answers” problem persists How to compare things within each group? Ranking-only Lack of global view top-k results may come from same underlying group (e.g., cheap and big houses come from a less nice area.) Different groups may not be comparable

Contributions Concepts Efficient processing framework generalize Group-By to fuzzy clustering, parallel to the generalization from Order-By to ranking integrate ranking with clustering in database queries Efficient processing framework summary-based approach

Related works Clustering use summary (grid with buckets) not on dynamic query results use summary (grid with buckets) (e.g., STING [WangYM97] WaveCluster [SheikholeslamiCZ98] ) Ranking (top-k) in DB: many instances (e.g., [LiCIS05]) use summary (histogram) in top-k to range query translation [ ChaudhuriG99] Categorization of query results [ChakrabartiCH04] different from clustering no integration with ranking focus on reducing navigation overhead, not processing Web search and IR (e.g., [ZamirE99] [LeuskiA00])

Integrate the two generalizations Boolean conditions Clustering Ranking ? semantics evaluation

Query semantics: ClusterRank queries SELECT * FROM Houses WHERE area=“Chicago” GROUP BY longitude, latitude INTO 5 ORDER BY size – 4*price LIMIT 3 Boolean clustering ranking Semantics: order-within-groups Return the top k tuples within each group (cluster).

Several notes Non-deterministic semantics clustering is non-deterministic by nature sacrificing the crispiness of SQL queries worthy for exploring the fuzziness in data retrieval? Language syntax isn’t our focus current SQL semantics: order-among-groups Select… From… Where…Group By… Order By…(RankAgg[LiCI06]) OLAP function Clustering function algorithm, distance measure hidden behind Other semantics e.g., cluster the global top k

Query evaluation: Straightforward Materialize-Cluster-Sort approach SELECT * FROM Houses WHERE area=“Chicago” GROUP BY longitude, latitude INTO 5 ORDER BY size – 4*price LIMIT 3 Boolean clustering sorting

Query evaluation: Straightforward Materialize-Cluster-Sort approach Overkill: cluster and rank all, only top 10 in each cluster are requested Inefficient: fully generate Boolean results clustering large amount of results is expensive sorting big group is costly Boolean clustering sorting

Query evaluation: Summary-driven approach use summary to cluster use summary for pruning in ranking use bitmap-index to construct query-dependant summary to bring together Boolean, clustering, and ranking

Summary for clustering latitude 2 1 5 4 longitude longitude latitude K-means on original tuples  weighted K-means on virtual tuples 

Summary for ranking 2 1 - 4*price - 4*price [ -40,10] size size size 10 20 30 40 -30 -20 -10 (20,40] 1 (10,30] 2 (0,20] - 4*price -40 [ -40,10] - 4*price -10 -20 -30 -40 10 20 30 40 size ORDER BY size – 4 *price LIMIT 3

Construct summary by bitmap-index TID t1 t2 t3 t4 t5 … size(10,20] 1 … -4*price[-40,-30] 1 … & 1 … - 4*price t1 t3 -10 t4 -20 -30 t2 t5 -40 10 20 30 40 size The advantages of using bitmap index: Small Bit operations (&, |, ~, count) are fast Easily integrate Boolean, clustering, and ranking

Integrating Boolean, clustering, and ranking longitude latitude - 4*price size Vec(area=“chicago”) 00111111… - 4*price size 10 20 30 40 & longitude latitude & - 4*price size 10 20 30 40 & longitude latitude 2 1 5 4 Vec(cluster1) 00101001… Vec(cluster2) 00010110…

Experiments ClusterRank (summary-driven approach) vs. StraightFwd (materialize-cluster-rank) Processing efficiency: ClusterRank >> StrightFwd Clustering Quality: ClusterRank ≈ StrightFwd synthetic data various configuration parameters (#tuples, #clusters, #clustering attr, #ranking attr, #paritions per attr, k)

Efficiency t: #clusters s: #tuples 4M tuples, 5 clustering attr, 3 ranking attr, top 5 s: #tuples 10 clusters, 5 clustering attr, 3 ranking attr, top 5

Clustering quality t: #clusters s: #tuples close(res_SF, res_CR): closeness of results from StraightFwd and ClusterRank close(res_SF, res_SF): closeness of results from different runs of StraightFwd t: #clusters 4M tuples, 8 clustering attr s: #tuples 10 clusters, 3 clustering attr

Conclusions Borrow innovative mechanisms from other areas to support data retrieval applications Ranking and clustering as generalized Order-By and Group-By, integrated in database queries Query semantics: ClusterRank queries Query evaluation: summary-driven approch vs. materialize-cluster-sort evaluation efficiency: ClusterRank >> StraightFwd clustering quality: ClusterRank ≈ StraightFwd

Acknowledgement Rishi Rakesh Sinha: source code of bitmap index Jiawei Han: discussions regarding presentation

Alternative semantics? global clustering / local ranking (focus of this paper) clustering: Boolean results ranking: local top k in each cluster local clustering / global ranking clustering: global top k ranking: Boolean results global clustering / global ranking ranking: in each cluster, return those belonging to global top k rank the clusters? (by average of local top k?)

Join queries Star-schema fact table, dimension tables, key and foreign key Bitmap join-index index the fact table by the attributes in dimension tables