Mining for Empty Rectangles in Large Data Sets

Slides:



Advertisements
Similar presentations
Algorithm Analysis Input size Time I1 T1 I2 T2 …
Advertisements

Algorithms (and Datastructures) Lecture 3 MAS 714 part 2 Hartmut Klauck.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 12, Part A.
Register Allocation Zach Ma.
Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
Fast Algorithms For Hierarchical Range Histogram Constructions
CS 540 Database Management Systems
Data Mining for Query Optimization. 2 Outline Semantic Query Optimization Soft Constraints Query Optimization via Soft Constraints Selectivity Estimation.
Two Technique Papers on High Dimensionality Allan Rempel December 5, 2005.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Generating Efficient Plans for Queries Using Views Chen Li Stanford University with Foto Afrati (National Technical University of Athens) and Jeff Ullman.
1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
Instance Based Learning. Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return the answer associated.
August 2005RSFDGrC 2005, Regina, Canada 1 Feature Selection Based on Relative Attribute Dependency: An Experimental Study Jianchao Han 1, Ricardo Sanchez.
Data Structures for Orthogonal Range Queries
I/O-Algorithms Lars Arge University of Aarhus March 7, 2005.
A Parallel Algorithm for Approximate Regularity, by Laurence Boxer and Russ Miller, A presentation for the Niagara University Research Council, Nov.,
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
Query Processing & Optimization
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
1 Optimization. 2 Why Optimize? Given a query of size n and a database of size m, how big can the output of applying the query to the database be? Example:
CSE53111 Computational Geometry TOPICS q Preliminaries q Point in a Polygon q Polygon Construction q Convex Hulls Further Reading.
Overview of Implementing Relational Operators and Query Evaluation
Simple Efficient Algorithm for MPQ-tree of an Interval Graph Toshiki SAITOH Masashi KIYOMI Ryuhei UEHARA Japan Advanced Institute of Science and Technology.
Randomized Turing Machines
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
Status “Lifetime of a Query” –Query Rewrite –Query Optimization –Query Execution Optimization –Use cost-estimation to iterate over all possible plans,
Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Implementing Relational Operators and Query Evaluation Chapter 12.
3.3 Complexity of Algorithms
Mining for Empty Rectangles in Large Data Sets Jeff Edmonds Jarek Gryz Dongming Liang Renee Miller.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
Lecture 24 Query Execution Monday, November 28, 2005.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Computing & Information Sciences Kansas State University Monday, 03 Nov 2008CIS 560: Database System Concepts Lecture 27 of 42 Monday, 03 November 2008.
More Optimization Exercises. Block Nested Loops Join Suppose there are B buffer pages Cost: M + ceil (M/(B-2))*N where –M is the number of pages of R.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Implementation of Database Systems, Jarek Gryz1 Relational Query Optimization Chapters 12.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
1 A Methodology for automatic retrieval of similarly shaped machinable components Mark Ascher - Dept of ECE.
CS4432: Database Systems II Query Processing- Part 1 1.
Partitioned Sorting of Bitmap Indices Kyle Brooks.
Storage Access Paging Buffer Replacement Page Replacement
Relational Algebra Chapter 4 1.
Data Engineering Query Optimization (Cost-based optimization)
Introduction to Query Optimization
Evaluation of Relational Operations
Chapter 15 QUERY EXECUTION.
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Relational Operations
Relational Algebra 1.
Yan Huang - CSCI5330 Database Implementation – Access Methods
Relational Algebra Chapter 4 1.
CS143:Evaluation and Optimization
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Lecture 2- Query Processing (continued)
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Implementation of Relational Operations
Semantic Query Optimization
Lecture 22: Query Execution
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Yan Huang - CSCI5330 Database Implementation – Query Processing
Relational Algebra Chapter 4 - part I.
Lecture 22: Friday, November 22, 2002.
Presentation transcript:

Mining for Empty Rectangles in Large Data Sets Jeff Edmonds Jarek Gryz Dongming Liang Renee Miller

Matrix representation A,B(R S) 1 1 2 3 6 7 8 A B 3 1 6 7 8

Find All Maximal 0-Rectangles um al A,B(R S) 1 1 2 3 6 7 8 A B 3 1 6 7 8

Example 1 1 1 A,B(R S) … First BMW Z3 series cars were made in 1997. 95 96 97 Car Year … BMW Z3 1 Honda L2 1 Toyota 6A 1 First BMW Z3 series cars were made in 1997.

Relation to Previous Work [Namaad, Hsu, Lee] Our Work [Lui, Ku, Hsu] & [Orlowski] between points in real plane within a 0-1 matrix Find all maximal empty rectangles Problem: Purpose: Machine Learning Computational Geometry Query Optimization O( (# 1’s)2 ) O( #0’s ) # of maximal 0-rectangles:

Relation to Previous Work [Namaad, Hsu, Lee] Our Work [Lui, Ku, Hsu] & [Orlowski] O( # 1’s log(#1’s) + # rectangles ) = O(|X||Y|) O( #0’s ) = O(|X||Y|) Time: Space: O(|X||Y|) O(min(|X|, |Y|)) only two rows of matrix kept in memory

Relation to Previous Work [Namaad, Hsu, Lee] Our Work [Lui, Ku, Hsu] & [Orlowski] Intensive random memory access Requires a single scan of the sorted data Practical Implementation: Scalable: Scales Badly Scales well wrt # of tuples in join # of maximal rectangles # of values |X| & |Y| IBM paid us $25,000 to patent it! Practical?

Structure of Algorithm loop y = 1..|Y| loop x = 1..|X| Construct staircase(x,y) Output all maximal 0-rectangles with <x,y> as bottom-right corner First Third Second 1 Fourth X Y 1 Timing O(1) amortized time per <x,y> 1 1 1 <x,y> * 1

Structure of Algorithm loop y = 1..|Y| loop x = 1..|X| Construct staircase(x,y) Output all maximal 0-rectangles with <x,y> as bottom-right corner 1 X Y Fifth 1 Query Optimization & Experimental Results 1 1 1 <x,y> * 1

Staircase(x,y) Staircase(x,y) step Stack of steps Y X <x,y> * 1 Jarek Gryz: Staircase(x,y) Staircase(x,y) ( x ,y ) r 1 2 3 4 5 Stack of steps 1 1 step Y 1 1 <x,y> * X

Constructing Maximal Rectangles Jarek Gryz: Constructing Maximal Rectangles <x,y> *

Constructing Maximal Rectangles Jarek Gryz: Constructing Maximal Rectangles Too Narrow Maximal Too short <x,y> *

Constructing staircase(x,y) from staircase(x-1,y) Jarek Gryz: Constructing staircase(x,y) from staircase(x-1,y) 1 <x,y> * 1 Case 1 1 1 1 1 1 1 <x-1,y> * 1

Constructing staircase(x,y) from staircase(x-1,y) Jarek Gryz: Constructing staircase(x,y) from staircase(x-1,y) 1 Case 2 <x,y> * 1 1 1 1 1 1 <x-1,y> * 1 1

Constructing staircase(x,y) from staircase(x-1,y) Jarek Gryz: Constructing staircase(x,y) from staircase(x-1,y) Delete Keep <x,y> * 1 Too Narrow Maximal Too short ( x ,y ) r r 1 1 Y 1 1 1 1 ( x ,y ) 1 1 1 <x-1,y> * ( x, y ) 1 X

Constructing x*(x,y) & y*(x,y) Jarek Gryz: Constructing x*(x,y) & y*(x,y) 1 ( x ,y ) r r 1 1 y*(x-1,y) 1 1 1 1 ( x ,y ) 1 1 1 <x-1,y> * ( x, y ) x*(x-1,y) 1

Constructing x*(x,y) & y*(x,y) from x*(x-1,y) & y*(x,y-1) Jarek Gryz: Constructing x*(x,y) & y*(x,y) from x*(x-1,y) & y*(x,y-1) 1 <x,y> * y*(x,y) x*(x,y) ( x ,y ) r r 1 1 y*(x,y-1) 1 (saved) 1 1 1 1 ( x ,y ) 1 1 1 <x-1,y> * ( x, y ) Query x*(x-1,y) 1

Structure of Algorithm loop y = 1..|Y| loop x = 1..|X| Construct staircase(x,y) Output all maximal 0-rectangles with <x,y> as bottom-right corner 1 Third X Y 1 Timing O(1) amortized time per <x,y> 1 1 1 <x,y> * <x.y> 1

Timing Only work that is not constant Time Delete Too Narrow Maximal Jarek Gryz: Timing Only work that is not constant Time Delete 1 Too Narrow Maximal Too short ( x ,y ) r r 1 1 Y 1 1 1 1 ( x ,y ) 1 1 1 <x,y> * ( x, y ) 1 X

Timing Amortized # of steps deleted (per <x,y>) = # of steps created (per <x,y>) £ 1 <x-1,y> * 1

Number of Maximal Rectangles £ # of maximal 0-rectangles: O( (# 1’s)2 ) [Namaad, Hsu, Lee] Running time of alg = O( #0’s ) £

How many empty rectangles are there? Tests done on 4 pairs of attributes with numerical domain present in typical joins in a real-world workload of a health insurance company.

How big are the rectangles?

Query rewrite: simple case select … from R, S,... where R.C=S.C and 60<R.A<80 and 20<S.B<80 and... select … from R, S,... where R.C=S.C and 60<R.A<80 and 20<S.B<60 and...

Query rewrite: complex case select … from R, S,... where R.C=S.C and 60<R.A<80 and 20<S.B<80 and... select … from R, S,... where R.C=S.C and (… and …) or ...

How much do the rectangles overlap with queries?

Query optimization experiments real-world workload of 26 queries 5 of the queries “qualified” for the rewrite only simple rewrites were considered all rewrites led to improved performance