Mining for Empty Rectangles in Large Data Sets Jeff Edmonds Jarek Gryz Dongming Liang Renee Miller
Matrix representation A,B(R S) 1 1 2 3 6 7 8 A B 3 1 6 7 8
Find All Maximal 0-Rectangles um al A,B(R S) 1 1 2 3 6 7 8 A B 3 1 6 7 8
Example 1 1 1 A,B(R S) … First BMW Z3 series cars were made in 1997. 95 96 97 Car Year … BMW Z3 1 Honda L2 1 Toyota 6A 1 First BMW Z3 series cars were made in 1997.
Relation to Previous Work [Namaad, Hsu, Lee] Our Work [Lui, Ku, Hsu] & [Orlowski] between points in real plane within a 0-1 matrix Find all maximal empty rectangles Problem: Purpose: Machine Learning Computational Geometry Query Optimization O( (# 1’s)2 ) O( #0’s ) # of maximal 0-rectangles:
Relation to Previous Work [Namaad, Hsu, Lee] Our Work [Lui, Ku, Hsu] & [Orlowski] O( # 1’s log(#1’s) + # rectangles ) = O(|X||Y|) O( #0’s ) = O(|X||Y|) Time: Space: O(|X||Y|) O(min(|X|, |Y|)) only two rows of matrix kept in memory
Relation to Previous Work [Namaad, Hsu, Lee] Our Work [Lui, Ku, Hsu] & [Orlowski] Intensive random memory access Requires a single scan of the sorted data Practical Implementation: Scalable: Scales Badly Scales well wrt # of tuples in join # of maximal rectangles # of values |X| & |Y| IBM paid us $25,000 to patent it! Practical?
Structure of Algorithm loop y = 1..|Y| loop x = 1..|X| Construct staircase(x,y) Output all maximal 0-rectangles with <x,y> as bottom-right corner First Third Second 1 Fourth X Y 1 Timing O(1) amortized time per <x,y> 1 1 1 <x,y> * 1
Structure of Algorithm loop y = 1..|Y| loop x = 1..|X| Construct staircase(x,y) Output all maximal 0-rectangles with <x,y> as bottom-right corner 1 X Y Fifth 1 Query Optimization & Experimental Results 1 1 1 <x,y> * 1
Staircase(x,y) Staircase(x,y) step Stack of steps Y X <x,y> * 1 Jarek Gryz: Staircase(x,y) Staircase(x,y) ( x ,y ) r 1 2 3 4 5 Stack of steps 1 1 step Y 1 1 <x,y> * X
Constructing Maximal Rectangles Jarek Gryz: Constructing Maximal Rectangles <x,y> *
Constructing Maximal Rectangles Jarek Gryz: Constructing Maximal Rectangles Too Narrow Maximal Too short <x,y> *
Constructing staircase(x,y) from staircase(x-1,y) Jarek Gryz: Constructing staircase(x,y) from staircase(x-1,y) 1 <x,y> * 1 Case 1 1 1 1 1 1 1 <x-1,y> * 1
Constructing staircase(x,y) from staircase(x-1,y) Jarek Gryz: Constructing staircase(x,y) from staircase(x-1,y) 1 Case 2 <x,y> * 1 1 1 1 1 1 <x-1,y> * 1 1
Constructing staircase(x,y) from staircase(x-1,y) Jarek Gryz: Constructing staircase(x,y) from staircase(x-1,y) Delete Keep <x,y> * 1 Too Narrow Maximal Too short ( x ,y ) r r 1 1 Y 1 1 1 1 ( x ,y ) 1 1 1 <x-1,y> * ( x, y ) 1 X
Constructing x*(x,y) & y*(x,y) Jarek Gryz: Constructing x*(x,y) & y*(x,y) 1 ( x ,y ) r r 1 1 y*(x-1,y) 1 1 1 1 ( x ,y ) 1 1 1 <x-1,y> * ( x, y ) x*(x-1,y) 1
Constructing x*(x,y) & y*(x,y) from x*(x-1,y) & y*(x,y-1) Jarek Gryz: Constructing x*(x,y) & y*(x,y) from x*(x-1,y) & y*(x,y-1) 1 <x,y> * y*(x,y) x*(x,y) ( x ,y ) r r 1 1 y*(x,y-1) 1 (saved) 1 1 1 1 ( x ,y ) 1 1 1 <x-1,y> * ( x, y ) Query x*(x-1,y) 1
Structure of Algorithm loop y = 1..|Y| loop x = 1..|X| Construct staircase(x,y) Output all maximal 0-rectangles with <x,y> as bottom-right corner 1 Third X Y 1 Timing O(1) amortized time per <x,y> 1 1 1 <x,y> * <x.y> 1
Timing Only work that is not constant Time Delete Too Narrow Maximal Jarek Gryz: Timing Only work that is not constant Time Delete 1 Too Narrow Maximal Too short ( x ,y ) r r 1 1 Y 1 1 1 1 ( x ,y ) 1 1 1 <x,y> * ( x, y ) 1 X
Timing Amortized # of steps deleted (per <x,y>) = # of steps created (per <x,y>) £ 1 <x-1,y> * 1
Number of Maximal Rectangles £ # of maximal 0-rectangles: O( (# 1’s)2 ) [Namaad, Hsu, Lee] Running time of alg = O( #0’s ) £
How many empty rectangles are there? Tests done on 4 pairs of attributes with numerical domain present in typical joins in a real-world workload of a health insurance company.
How big are the rectangles?
Query rewrite: simple case select … from R, S,... where R.C=S.C and 60<R.A<80 and 20<S.B<80 and... select … from R, S,... where R.C=S.C and 60<R.A<80 and 20<S.B<60 and...
Query rewrite: complex case select … from R, S,... where R.C=S.C and 60<R.A<80 and 20<S.B<80 and... select … from R, S,... where R.C=S.C and (… and …) or ...
How much do the rectangles overlap with queries?
Query optimization experiments real-world workload of 26 queries 5 of the queries “qualified” for the rewrite only simple rewrites were considered all rewrites led to improved performance