Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research.

Slides:



Advertisements
Similar presentations
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
Advertisements

Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
Chapter 15 Algorithms for Query Processing and Optimization Copyright © 2004 Pearson Education, Inc.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Fast Algorithms For Hierarchical Range Histogram Constructions
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
Query processing and optimization. Advanced DatabasesQuery processing and optimization2 Definitions Query processing –translation of query into low-level.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Chap8: Trends in DBMS 8.1 Database support for Field Entities 8.2 Content-based retrieval 8.3 Introduction to spatial data warehouses 8.4 Summary.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,
Graph Algebra with Pattern Matching and Aggregation Support 1.
Query Optimization, part 2 CS634 Lecture 13, Mar Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Bellwether Analysis Bellwether Analysis Predicting Global Aggregates from Local Regions Raghu Ramakrishnan Yahoo! Research University of Wisconsin—Madison.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 13: Query Processing.
Breaking the Memory Wall in MonetDB
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Querying Structured Text in an XML Database By Xuemei Luo.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Lecture 1- Query Processing Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
CS4432: Database Systems II Query Processing- Part 2.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.
Lecture 17: Query Execution Tuesday, February 28, 2001.
Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
IMS 4212: Database Implementation 1 Dr. Lawrence West, Management Dept., University of Central Florida Physical Database Implementation—Topics.
1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.
More Optimization Exercises. Block Nested Loops Join Suppose there are B buffer pages Cost: M + ceil (M/(B-2))*N where –M is the number of pages of R.
Database Management Systems 1 Raghu Ramakrishnan Evaluation of Relational Operations Chpt 14.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 Jianping Fan Dept of Computer Science UNC-Charlotte.
CS4432: Database Systems II Query Processing- Part 1 1.
Author : Tzi-Cker Chiueh, Prashant Pradhan Publisher : High-Performance Computer Architecture, Presenter : Jo-Ning Yu Date : 2010/11/03.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
UNIT 11 Query Optimization
Parallel Databases.
Database Management System
Chapter 12: Query Processing
Database Performance Tuning and Query Optimization
Database Management Systems (CS 564)
File Processing : Query Processing
File Processing : Query Processing
Selected Topics: External Sorting, Join Algorithms, …
Query Optimization CS 157B Ch. 14 Mien Siao.
Chapter 11 Database Performance Tuning and Query Optimization
Performance Tuning ETL Process
Presentation transcript:

Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research and University of Wisconsin – Madison

2 Motivation Consider this query:  “For each year and each country, compute the ratio of the average personal incomes between richest city and poorest city. Then find the number of countries where such ratio continuously decrease between “ It is  Hard to write in SQL  Hard to optimize/understand the SQL query This kind of queries is increasingly common:  Multi-step aggregation  Must scale to very large datasets, often distributed

3 Contributions A new framework for expressing such compositional aggregate queries  Key contribution is how we look at the computation, in terms of aggregating over related regions in “cube space” An efficient evaluation framework based on sorted scans that take into account of multiple aggregation steps  Experimental results

4 Background Computing “measures”  Measures summarize some characteristic of data subsets (e.g., SUM, std dev, beta-value of a portfolio)  Approaches: Group by, data cubes, Hancock, Sawzall Cube space  Partition feature space using attribute values; domain hierarchies organize this space into nested collections of regions  Regions: (2006, Korea), (2006/09, Seoul)  Region sets: (Year, Country), (Month, City)

5 Composite Subset Measures The measure of a cube region is computed by:  Aggregating data in a region directly (e.g., sales volumes for each day), or  Summarizing the measures for related regions, e.g.: The maximum of daily volumes within a year The ratio of average personal incomes between the richest and poorest cities in a country

6 What is “Related” in Cube Space Focus on relationships which  are commonly used  can be efficiently evaluated Self Parent/Child  E.g., Year/Day Child/Parent  E.g., Day/Year Sibling  E.g., Today/Tomorrow

7 Examples (Network Analysis) Data involved:  Stream of data records for IP packet information  Time (t), Source (U), Destination (D), Size (s) Queries:  For every minute, the number of outgoing packets from each given source IP  For every hour, the maximum number of minutely outgoing packets from a given source IP

8 Expression Algebra Each measure entity is defined as a collection of region/value pairs  Regions should belong to same region set Fact Table Aggregation Selection Match join Combine join

9 Example: Aggregation For every hour and every unique IP, compute the number of outgoing packets

10 Example: Selection For every hour, compute the sum of outgoing packets from those source IP with at least five packets in that hour (High traffic count) time Source

11 Example: Match For each six hour time window, compute the average of the high traffic count

12 Example: Combine For each hour, compute the ratio between the six hour average and the high traffic count

13 Aggregation Workflows A diagrammatic way to express multiple composite subset measure expressions  Semantically equivalent to the algebra Rectangles: Region sets Ellipses: Measures associated with the Region sets Arcs: Computational dependencies among measures

14 Example Region set Measure name Aggregation formula Selection condition Match condition

15 Example (cont.)

16 Multi-step Execution Plan Evaluation based on the topology order of the aggregation workflow Materialize non-dependent measures Then evaluate dependent measures  following the arcs of the aggregation workflow  May need to perform join Problem  Intermediate measures: extra I/O

17 Simple Scan Execution [*] Build one hash table for each measure “Insert” data into hash tables of low-level measures Propagate the measures upwards after the scan is over Distributive or algebraic aggregation function Problem  Each hash table keeps all the entries  Bottleneck: Memory capacity [*] T. Johnson and D. Chatziantoniou, Extending complex ad-hoc OLAP, in CIKM, 1999,

18 Sort/Scan Execution Simple scan requires large memory  For each hash table, we need to keep all the entries during the scan When the data is ordered  Some hash entries can be flushed out before the scan is finished  The memory footprint can be reduced  One pass scan becomes feasible  CPU cost is reduced

19 Evaluation t:Day U:IP t:Month U:IP Sort by day month 1month 2 Output stream for each hash table is still ordered! COUNT0 count(*) COUNT2 count(*) COUNT3 count(*)

20 Evaluation t:Day U:IP t:Month U:IP Sort by month month 1month 2 COUNT0 count(*) COUNT2 count(*) COUNT3 count(*) All the output stream is ordered by month!

21 Evaluation t:Month U:IP Data are sorted by (t:month, U:IP) month 1month 2 COUNT3 count(*) 1112 By carefully choosing the sort order of the raw data, we can greatly reduce the memory footprint

22 Order and Slack Order  How the records are ordered in the stream  E.g., Slack  The gap between the output stream of the measure and the scan progress of raw data  E.g., We have developed a mechanism to  Calculate the order/slack  Take advantage of the order/slack information during evaluation

23 Evaluation Network Scan sorted data

24 Optimization How to find a good sort order?  Enumerate all possible orders  For each order estimate the memory usage  Use sort orders with minimal usage Evaluation with multiple passes  What measure to compute during each pass?  What order to use in each pass?

25 Experiments 64 million records Synthetic data set Scenario 1  The measures of a region are computed by combining the aggregated measures for different kinds of child region sets Scenario 2  The measures of a region are computed by aggregating the measures of multiple chained siblings

26 Experimental Results (cont.)

27 Experimental Results

28 Conclusions Composite measures as building blocks for complicated analysis process Algebra provides the semantic foundation Aggregation workflow offers intuitive interface Sort/Scan execution plan evaluates multiple dependent measures in the same run  and hence improve the evaluation performance