1 VLDB,2014. 2 Background What is important for the user.

Slides:



Advertisements
Similar presentations
MapReduce.
Advertisements

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Copyright © 2011 Ramez Elmasri and Shamkant Navathe Algorithms for SELECT and JOIN Operations (8) Implementing the JOIN Operation: Join (EQUIJOIN, NATURAL.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
CS CS4432: Database Systems II Operator Algorithms Chapter 15.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
SUPPORTING TOP-K QUERIES IN RELATIONAL DATABASES. PROCEEDINGS OF THE 29TH INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, MARCH 2004 Sowmya Muniraju.
Clydesdale: Structured Data Processing on MapReduce Jackie.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
Advanced Databases: Lecture 8 Query Optimization (III) 1 Query Optimization Advanced Databases By Dr. Akhtar Ali.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Advanced Databases: Lecture 6 Query Optimization (I) 1 Introduction to query processing + Implementing Relational Algebra Advanced Databases By Dr. Akhtar.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Copyright © Curt Hill Query Evaluation Translating a query into action.
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Richa Varshney.
1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.
Lecture 1- Query Processing Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Segmented Hash: An Efficient Hash Table Implementation for High Performance Networking Subsystems Sailesh Kumar Patrick Crowley.
CS4432: Database Systems II Query Processing- Part 2.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
Lecture 3 - Query Processing (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Cost Estimation For each plan considered, must estimate cost: –Must estimate cost of each operation in plan tree. Depends on input cardinalities. –Must.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.
Chapter 10 The Basics of Query Processing. Copyright © 2005 Pearson Addison-Wesley. All rights reserved External Sorting Sorting is used in implementing.
CS4432: Database Systems II Query Processing- Part 1 1.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Information Retrieval in Practice
Information Retrieval in Practice
MapReduce Types, Formats and Features
Chapter 12: Query Processing
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Evaluation of Relational Operations: Other Operations
Lecture 2- Query Processing (continued)
Charles Tappert Seidenberg School of CSIS, Pace University
Evaluation of Relational Operations: Other Techniques
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Computational Advertising and
Evaluation of Relational Operations: Other Techniques
Map Reduce, Types, Formats and Features
Presentation transcript:

1 VLDB,2014

2 Background What is important for the user

3 Background Quality of results. Processing Time. Top-k results.

4 Usually user doesn't need all the results, But needs Top-k results. Rank top-k query play a key role in modern analytics tasks. Top-k

5 Rank join queries Usually user doesn't write query on one relation.

6 Problem Formulation SELECT select-list FROM R1,R2,……,Rn WHERE equi-join-expression(R1,R2,…..,Rn) ORDER BY f(R1,R2,…..,Rn) STOP AFTER k

7 Contributions A study of how to efficiently process top-k join queries in NoSQL cloudstores. We will use as a reference point the baseline techniques of using Hive or Pig. However, acknowledging their disadvantages we will contribute and study the performance of a number of different approaches.

8 Solutions INDEXED RANK JOINS Inverse Join List MapReduce RankJoin. Inverse Score List RankJoin. STATISTICAL RANKJOINS BFHM Rank-Join

9 Rank Joins with Hive and Pig In Hive, rank join processing consists of two MapReduce jobs plus a final stage. The first job computes and materializes the join result set. The second one computes the score of the join result set tuples and stores them sorted on their score. A third, non-MapReduce stage then fetches the highest-k ranked results from the final list.

10 Rank Joins with Hive and Pig Pig takes a smarter approach. Its query plan optimizer pushes projections and top-k (STOP AFTER) operators as early in the physical plan as possible. And takes extra measures to better balance the load caused by the join result ordering (ORDER BY) operator.

11 Input for running example Scoring of individual rows based on either a(predefined) function on attribute values. or some explicit\score" attribute (e.g., movie rating, PageRank score, etc.).

12 Returning to Map Reduce

13 IJLMR Index Inverse Join List MapReduce RankJoin The IJLMR index is built with a map-only MapReduce job. Mappers create an inverted list of input tuples keyed by their join. The output of mappers is written directly into the NoSQL store.

14 IJLMR Index for our running example The IJLMR index for each indexed table is stored as a separate column family in one big table. Index entries for the same join values across all indexed tables are stored next to each other on the same node.

15 IJLMR Query Processing Input: Rows from the IJLMR index for column families A and B, of the form frow.joinValue: row.rowKey,row.scoreg. Output: Top-k join result set. The score of tuples in the join result set is computed using a monotonic aggregate function f() on their individual scores.

16 IJLMR Query Processing For each row, it computes the join result and join score of index entries from the different column families. The mappers store in-memory only the top-k ranking result Tuples. And emit their final top-k list when their input data is exhausted. The single reducer then combines the individual top-k lists and emits the global top-k result.

17 IJLMR Query Processing

18 ISL Index Inverse Score List RankJoin The ISL index is built with a map-only MapReduce job. ISL is based on the existence of inverted score lists. The output of mappers is written directly into the NoSQL store.

19 ISL Index running example The ISL index for each indexed table is stored as a separate column family in one big table. Index entries for the same score values across all indexed tables are stored next to each other on the same node.

20 ISL Query Processing Input: Rows from the ISL index for column families A and B, of the form frow.joinValue: row.rowKey,row.scoreg. Output: Top-k join result set. The score of tuples in the join result set is computed using a monotonic aggregate function f() on their individual scores.

21 ISL Query Processing The coordinator stores all retrieved tuples in separate hash tables, with a join value key. Maintains a list of the current top-k results. With every new tuple fetched and processed, the coordinator computes the current threshold value. And terminates when it is below the score of the k'th tuple in the result set.

22 ISL Query Processing

23 STATISTICAL RANK JOINS Both of the previous algorithms, ship tuples even though they may not participate in the top-k result set. Our next contribution aims to avoid this. we need not only estimate which tuples will produce the join result. But also to predict whether these tuples can have a top-k score.

24 Returning to Bloom Filter

25

26 The BFHM Data Structure The BFHM index is a two-level statistical data structure, encompassing histograms and Bloom filters. At the first level, we have an equi-width histogram on score axis; That is, all histogram buckets have the same spread and each such bucket stores information for tuples.

27 The BFHM Data Structure At the second level, instead of a simple counter per bucket (plus the actual min and max scores of tuples recorded in the bucket). Choose to maintain a Bloom filter-like data structure, recording the join values of the tuples belonging to the bucket. This will then be used to estimate the cardinality of the join result set during query processing.

28 BFHM bucket structure BFHM bucket maintain: (i)The minimum and maximum score values of tuples recorded in the bucket. (ii) A single-hash-function Bloom filter of size m (bits). (iii) A hash table of counters for each non-zero bit of the Bloom filter.

29 BFHM Index Creation

30 BFHM Index Creation In the Map phase, the mappers partition incoming tuples into the various histogram buckets. Each reducer operates on the mapped tuples for one BFHM bucket at a time. Each incoming tuple is first added to the BFHM hybrid filter based on its join value, and its corresponding bit position is recorded. The reducer emits a reverse mapping entry for each such tuple, and keeps track of the min and max scores of all tuples in the bucket. When the bucket tuples are exhausted, the reducer finally emits the BFHM bucket blob row.

31 BFHM Index Creation

32 BFHM Query Processing Query processing consists of two phases: (i) Estimating the result. (ii) Reverse mapping and computation of the true result.

33 BFHM Query Processing In the first phase: The coordinator fetches BFHM bucket rows for the joined relations, one at a time. With newly fetched buckets being \joined“ with older ones. The bucket join result { an estimation of the join result for tuples recorded in the joined buckets} is then added to the list of estimated results.

34 BFHM Query Processing In the first phase: When the estimated number of result tuples in this list is above k. The algorithm tests for the BFHM termination condition. If the latter is satisfied, processing continues with the reverse mapping/final result set computation phase.

35 BFHM Query Processing Termination condition : First, we compute the minimum score of the k'th estimated result. The estimation phase terminates if there are more than k estimated results And there is no combination of buckets not examined so far that could have a maximum score above that of the k'th estimated result.

36 BFHM join result estimation

37 BFHM Query Processing In the second phase: Examines the estimated results of the first phase. And purges all estimated results whose maximum score is below that of the (estimated) k'th tuple. Then fetches the reverse mapping rows and compute the final result set.

38 BFHM Query Processing

39 BFHM Query Processing First, we compute the bitwise-AND of the Bloom filter bitmaps from the two buckets. IF the resulting bitmap is empty then there are no joining tuples recorded in these two buckets. Otherwise, we compute an estimation of the cardinality of the join.

40 BFHM Query Processing

41 EXPERIMENTAL EVALUATION

42 Query Processing Time The number of BFHM buckets was set to 100 and ISL was configured with batching sizes matching the number of BFHM buckets; 1% and 0.1%.

43 Query Processing Bandwidth Consumption The number of BFHM buckets was set to 100 and 1000 ISL was configured with batching sizes matching the number of BFHM buckets; 1% and 0.1 The number of bytes transferred through the network

44 Query Processing Dollarcost The number of BFHM buckets was set to 100 and 1000 ISL was configured with batching sizes matching the number of BFHM buckets; 1% and 0.1 The number of tuples read from the cloud store during query processing

45 Indexing time

46 CONCLUSIONS Top-k join queries arise naturally in many real-world settings. We studied algorithms for rank joins in NoSQL stores. The BFHM Rank Join algorithm is very desirable.

47

48