The Generalized MDL Approach for Summarization Laks V.S. Lakshmanan (UBC) Raymond T. Ng (UBC) Christine X. Wang (UBC) Xiaodong Zhou (UBC) Theodore J. Johnson.

Slides:



Advertisements
Similar presentations
MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Dynamic Programming.
Merged Processes of Petri nets Victor Khomenko Joint work with Alex Kondratyev, Maciej Koutny and Walter Vogler.
Data Organization - B-trees. 11.2Database System Concepts A simple index Brighton A Downtown A Downtown A Mianus A Perry.
Multidimensional Data
Yang Xiang, Ruoming Jin, David Fuhry, Feodor F. Dragan
Multiversion Access Methods - Temporal Indexing. Basics A data structure is called : Ephemeral: updates create a new version and the old version cannot.
Approximating Sensor Network Queries Using In-Network Summaries Alexandra Meliou Carlos Guestrin Joseph Hellerstein.
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Computational Methods for Management and Economics Carla Gomes
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Catriel Beeri Pls/Winter 2004/5 type reconstruction 1 Type Reconstruction & Parametric Polymorphism  Introduction  Unification and type reconstruction.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Chapter 3: Data Storage and Access Methods
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.
MAE 552 – Heuristic Optimization Lecture 5 February 1, 2002.
ECE 667 Synthesis and Verification of Digital Systems
Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
Stevenson and Ozgur First Edition Introduction to Management Science with Spreadsheets McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies,
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
Practical Database Design and Tuning. Outline  Practical Database Design and Tuning Physical Database Design in Relational Databases An Overview of Database.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.
1 The TSP : NP-Completeness Approximation and Hardness of Approximation All exact science is dominated by the idea of approximation. -- Bertrand Russell.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Mutlidimensional Indices Instructor: Randal Burns Lecture for 29 November 2005 Computer Science Johns Hopkins University.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
Prof. Bayer, DWH, Ch.4, SS Chapter 4: Dimensions, Hierarchies, Operations, Modeling.
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
1 The MV3R-Tree: A Spatio- Temporal Access Method for Timestamp and Interval Queries Yufei Tao and Dimitris Papadias Hong Kong University of Science and.
CIS 350 – I Game Programming Instructor: Rolf Lakaemper.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Optimizing Multidimensional Index Trees for Main Memory Access Author: Kihong Kim, Sang K. Cha, Keunjoo Kwon Members: Iris Zhang, Grace Yung, Kara Kwon,
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
NPC.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Content based on Chapter 10 Database Management Systems, (3 rd.
Approximation Algorithms based on linear programming.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
1 Chapter 5 Branch-and-bound Framework and Its Applications.
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
0 Integer Programming Introduction to Integer Programming (IP) Difficulties of LP relaxation IP Formulations Branch and Bound Algorithms Reference: Chapter.
1 R-Trees Guttman. 2 Introduction Range queries in multiple dimensions: Computer Aided Design (CAD) Geo-data applications Support special data objects.
Dense-Region Based Compact Data Cube
Spatial Data Management
CSE 554 Lecture 5: Contouring (faster)
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
RE-Tree: An Efficient Index Structure for Regular Expressions
Mean Shift Segmentation
Approximate Lineage for Probabilistic Databases
NP-Completeness Yin Tat Lee
Data Mining Concept Description
NP-Completeness Yin Tat Lee
Collision Detection.
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

The Generalized MDL Approach for Summarization Laks V.S. Lakshmanan (UBC) Raymond T. Ng (UBC) Christine X. Wang (UBC) Xiaodong Zhou (UBC) Theodore J. Johnson (AT&T Research) (Work supported by NSERC and NCE/IRIS.)

Overview Introduction Motivation & Problem Statement Spatial Case – MDL & GMDL Experiments  X Categorical Case More Experiments  X Related work Summary and Related/Future Work

Introduction How best to convey large answer sets for queries? –Simple enumeration: accurate but not necessarily most useful –Summaries: not (necessarily) 100% accurate but can be more intuitive Why is this problem interesting? –OLAP queries over multi-dimensional data typically produce data intensive answers

Introduction (contd.) Example: (i) customer segmentation based on buying pattern age salary K frequency  t too many answers, in general solution: summarize description via range constraints  axis-parallel hyper-rectangles  most concise = MDL

Introduction (contd.) Example: (ii) aggregate sales performance analysis new york albany summit boston chicago minneapolis san francisco san jose edmonton vancouver NE MW NW location jkts tops wmn’s jns skirts blouses frml wear men’s jns ties dress pnts shorts women’s men’s clothes  2 * last year’s sales description via hierarchical ranges = tuples of nodes most concise = MDL

Motivation Examples: (i) customer segmentation based on buying pattern age salary K frequency  t X X frequency < t/2 “white” otherwise white budget = 2 white budget  10 XX

Motivation (contd.) Example: (ii) aggregate sales performance analysis new york albany summit boston chicago minneapolis san francisco san jose edmonton vancouver NE MW NW location jkts tops wmn’s jns skirts blouses frml wear men’s jns ties dress pnts shorts women’s men’s clothes  2 * last year’s sales description via hierarchical ranges = tuples of nodes most concise = MDL

Motivation (contd.) Example: (ii) aggregate sales performance analysis new york albany summit boston chicago minneapolis san francisco san jose edmonton vancouver NE MW NW location jkts tops wmn’s jns skirts blouses frml wear men’s jns ties dress pnts shorts women’s men’s clothes  2 * last year’s sales XX X X < ½ * last year’s sales white budget = 2 white budget  7

GMDL Problem Statement (spatial case) k totally ordered dimensions D i  S (set of all cells) B (blue) and R (red) – colored cells W = S – ( B  R ) (white cells) Find axis-parallel hyper-rectangles {R 1, …, R m } (i.e., GMDL covering) s.t.: –(R 1  …  R m )  R =  (validity) –|(R 1  …  R m )  W |  w (white budget) –m is the least possible (optimality)

(G)MDL Problem Statement (hierarchical case) k (tree) hierarchical dimensions cell = tuple of leaves region = tuple of nodes region R covers cell c iff c is a descendant of R, component-wise covering rules similar to spatial case MDL/GMDL problem formulations analogous

Algorithms for spatial GMDL challenges for spatial: even MDL 2D is NP-hard, so we must turn to heuristics important properties: –blue-maximality –non-redundancy Algorithms for spatial GMDL: –bottom-up pairwise (BP) merging –R-tree splitting (RTS) [based on Garcia+98] –color-aware splitting (CAS) –CAS corner

Algorithms for spatial GMDL (CAS) build indices I R, I B for red and blue cells start with C = region R covering all blue cells; curr-consum = # white cells in R while (  R  C containing a red cell) { –grow the red cell to a larger blue-free region (using I B ) –split R into at most 2k regions (excluding the grown red region) –replace R by new regions } while (curr-consum > w) { –split as above, but based on white cells } return C

CAS – An Example X X X trade-off non-overlapping regions  loss in quality overlapping regions  greater bookkeeping overhead Algorithms RTS, the two CAS’  non-redundant valid/feasible solutions BP  may produce redundant solution; can be made non-redundant

Categorical Case – MDL  key diff. between spatial and categorical? optimal covering  non-redundant optimal need not be blue-maximal, but can be expanded into one is blue-maximal non-redundant MDL covering unique? what about their size?

A spatial example two blue-maximal non-redundant coverings of diff. size

Categorical – fundamentals projection of regions on dimensions: e.g., (MW, women’s) – projection on location = {chicago, minneapolis}. Claim: R, S any categorical regions (tree hierarchies); R i – projection of R on dimension i;  i, R i  S i or S i  R i or R i  S i =  see violation in “tough” spatial example major factor in deciding complexity

Categorical – fundamentals (contd.) Theorem: space of k categorical dimensions with tree hierarchies  unique blue- maximal non-redundant MDL covering. Corollary: (i) the said covering can be obtained on a per hierarchy basis. (ii) furthermore, it can be done in polynomial time.

Categorical case – MDL algorithm illustrated a b c d e f g h i X X X a c d a b c d e f g h i a d a c d b c a before redundancy check after redundancy check c i a c d b c a a d initialize propagate

Categorical case – MDL Lemma: Optimal MDL covering for a categorical space with tree hierarchies can be obtained by visiting each node once and each node of last hierarchy twice. Key idea: for tree hierarchies, finding all blue-maximal regions and removing redundant ones yields the optimal covering.

Categorical case – GMDL Basic idea: for each internal node, determine the cost and gain of involving it in a GMDL covering; sort candidates in decreasing gain order and increasing cost. Pick greedily. Example: candidate occurrence max-gain cost (1,h)(2,h)(3,h)(4,h)(5,h) X3

Categorical Case – GMDL (contd.) Compile similar info. for other parents of leaves; sort and pick best w cells for color change. [drop candidates with cost X or 0.] Run MDL on the new data.

Related Work Substantial work on using MDL for summarization principle in data compression [Ristad & Thomas 95], decision trees [Quinaln & Rivest 89, Mehta+ 95], learning of patterns [Kilpelinen 95], etc. [Agrawal+ 98] – subspace clustering. Summarizing cube query answers and (G)MDL on categorical spaces – novel.

Summary & Future Work summarization using MDL/GMDL as a principle MDL on spatial – NP-complete even on 2D; utility of GMDL – trade compactness for quality (i.e., include “impurity” in answers) Heuristic algorithms Efficient algo. for MDL for categorical with tree hierarchies Heuristics for GMDL Experimental validation

Future Work What is the best we can do to summarize data with both spatial and categorical dimensions? How far can we push the poly time complexity? (e.g., almost-tree hierarchies? Can we impose restrictions on “allowable” intervals even on spatial dimensions?)