The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.

Slides:

Advertisements

Similar presentations

Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.

Advertisements

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,

Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Fast Algorithms For Hierarchical Range Histogram Constructions

BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.

Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.

Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University

Optimal Workload-Based Weighted Wavelet Synopsis

Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.

Dilys Thomas PODS Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, A. Zhu.

Approximation Algorithms

Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.

A Quick Introduction to Approximate Query Processing Part II

Dependency-Based Histogram Synopses for High-dimensional Data Amol Deshpande, UC Berkeley Minos Garofalakis, Bell Labs Rajeev Rastogi, Bell Labs.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

A Sparsification Approach for Temporal Graphical Model Decomposition Ning Ruan Kent State University Joint work with Ruoming Jin (KSU), Victor Lee (KSU)

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Internet Management Research Dept. Bell Labs, Lucent

Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.

Classification and Prediction: Regression Analysis

UCSC 1 Aman ShaikhICNP 2003 An Efficient Algorithm for OSPF Subnet Aggregation ICNP 2003 Aman Shaikh Dongmei Wang, Guangzhi Li, Jennifer Yates, Charles.

Database Laboratory Regular Seminar TaeHoon Kim.

Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung,

Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis

Graph Coalition Structure Generation Maria Polukarov University of Southampton Joint work with Tom Voice and Nick Jennings HUJI, 25 th September 2011.

Publishing Microdata with a Robust Privacy Guarantee

Topology aggregation and Multi-constraint QoS routing Presented by Almas Ansari.

Special Topics in Data Engineering Panagiotis Karras CS6234 Lecture, March 4 th, 2009.

Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.

Refined privacy models

Wavelet Synopses with Predefined Error Bounds: Windfalls of Duality Panagiotis Karras DB seminar, 23 March, 2006.

Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis

K-Anonymity & Algorithms

Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.

Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.

The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.

Collection Depots Facility Location Problems in Trees R. Benkoczi, B. Bhattacharya, A. Tamir 陳冠伶‧王湘叡‧李佳霖‧張經略 Jun 12, 2007.

The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis.

Histograms for Selectivity Estimation

On the Approximability of Geometric and Geographic Generalization and the Min- Max Bin Covering Problem Michael T. Goodrich Dept. of Computer Science joint.

ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,

Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.

The Dominating Set and its Parametric Dual  the Dominated Set  Lan Lin prepared for theory group meeting on June 11, 2003.

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

Privacy-preserving data publishing

Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.

Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.

Lecture.6. Table of Contents Lp –rounding Dual Fitting LP-Duality.

One-Pass Wavelet Synopses for Maximum-Error Metrics Panagiotis Karras Trondheim, August 31st, 2005 Research at HKU with Nikos Mamoulis.

Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.

Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) Jagdish Achara (INRIA)

Algorithms For Solving History Sensitive Cascade in Diffusion Networks Research Proposal Georgi Smilyanov, Maksim Tsikhanovich Advisor Dr Yu Zhang Trinity.

1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.

Dense-Region Based Compact Data Cube

Versatile Publishing For Privacy Preservation

Fast Data Anonymization with Low Information Loss

Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS

Computing and Compressive Sensing in Wireless Sensor Networks

Non-additive Security Games

Data-Streams and Histograms

RE-Tree: An Efficient Index Structure for Regular Expressions

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,

Lattice Histograms: A Resilient Synopsis Structure

SPACE EFFICENCY OF SYNOPSIS CONSTRUCTION ALGORITHMS

Presented by : SaiVenkatanikhil Nimmagadda

Clustering Large Datasets in Arbitrary Metric Space

Major Design Strategies

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

Major Design Strategies

Presentation transcript:

The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007

Introduction Many data representation problems require the optimization of one parameter under a bound on one or more others. Classical approaches treat them in a direct manner, producing complicated solutions, and sometimes resorting to heuristics. Parameters involved have a monotonic relationship. Hence, an alternative approach is possible, based on dual problems.

Outline Histograms. Restricted Haar Wavelet Synopses. Unrestricted Haar and Haar+ Synopses. l-Diversification in 1D. Compact Hierarchical Histograms.

Histograms Approximate a data set [d 1, d 2, …, d n ] with B buckets, s i = [b i, e i, v i ] so that a maximum-error metric is minimized. Classical solution: Jagadish et al. VLDB 1998 Guha et al. VLDB 2004, Guha VLDB 2005 Recent solutions: Buragohain et al. ICDE 2007 Guha and Shim TKDE 19(7) 2007 (linear for )

Histograms Solve the error-bounded problem. Maximum Absolute Error bound ε = … [ 4 ][ 16 ][ 4.5 ][… Generalized to any weighted maximum-error metric. Each value d i defines a tolerance interval Bucket closed when running union of interval becomes null Complexity:

Histograms Apply to the space-bounded problem. Perform binary search in the domain of the error bound ε Complexity: For error values requiring space, with actual error, run an optimality test: Error-bounded algorithm running under constraint instead of If requires space, then optimal solution has been reached. Independent of buckets B

Restricted Haar Wavelet Synopses Select subset of Haar wavelet decomposition coefficients, so that a maximum-error metric is minimized. Classical solution: Garofalakis and Kumar PODS 2004 Guha VLDB

Restricted Haar Wavelet Synopses Solve the error-bounded problem. Muthukrishnan FSTTCS 2005 Local search within each of subtrees in bottom Haar tree levels Complexity: Apply to the space-bounded problem. Complexity:no significant advantage

Unrestricted Haar and Haar + Synopses Assign arbitrary values to Haar/Haar + coefficients, so that a maximum-error metric is minimized. Classical solutions: Guha and Harb KDD 2005, SODA 2006 c1c1 + c2c2 c3c3 C1C1 c5c5 c6c6 + C2C2 c7c7 c8c8 c9c9 coco d3d3 d2d2 d1d1 d0d c4c C3C3 time space Karras and Mamoulis ICDE 2007

Unrestricted Haar and Haar + Synopses Solve the error-bounded problem. Complexity: Apply to the space-bounded problem. Complexity: unrestricted Haar Haar + time space significant time & space advantage

l-Diversification in 1D Given database table T(A 1, A 2,…, A n ), a quasi-identifier attribute set Q T is a subset of attributes which can reveal the personal identity of records. Equivalence class with respect to quasi-identifier attribute set Q T is a set of records indistinguishable in the projection of T on Q T. A database table T with quasi-identifier set Q T and sensitive attribute S conforms to the l-diversity property iff each equivalence class in T with respect to Q T has at least l well- represented values of S [Machanavajjhala et al. ICDE 2006] Utility metric: Extent of equivalence class (group). Other parameter: Outliers, records whose quasi-identifier values are suppressed.

Lead Poisoning Parkinson’s Flu Hyperthyroidism Age Postcode Age Postcode l-Diversification in 1D A two-dimensional example.

quasi-identifier Sensitive value l-Diversification in 1D Study the problem in one dimension (a single quasi- identifier). Total order exists. Similar to histogram construction. Polynomially tractable.

quasi-identifier Sensitive value D1D1 D3D3 D2D2 D4D4 r1r1 r6r6 r4r4 r2r2 r3r3 r5r5 Groups consecutive in each sensitive value domain. Groups order the same in each domain. Example for l=3. l-Diversification in 1D

quasi-identifier Sensitive value D1D1 D3D3 D2D2 D4D4 r1r1 r6r6 r4r4 r2r2 r3r3 r5r5 Groups consecutive in each sensitive value domain. Groups order the same in each domain. Example for l=3 l-Diversification in 1D

quasi-identifier Sensitive value e E l-Diversification in 1D Given interval I of extent E, which includes c items with m different sensitive values, number of possible boundaries/groups in I is:

l-Diversification in 1D Solve the outlier minimization problem. Complexity: timespace Apply to the accuracy maximization problem. Complexity: Apply to the privacy maximization problem. Complexity: time

Compact Hierarchical Histograms Assign arbitrary values to CHH coefficients, so that a maximum- error metric is minimized. Heuristic solutions: Reiss et al. VLDB 2006 c0c0 c1c1 c2c2 c3c3 c4c4 c5c5 c6c6 d3d3 d2d2 d1d1 d0d0 time space The benefit of making node B a bucket (occupied) node depends on whether node A is a bucket node – and also on whether node C is a bucket node. [Reiss et al. VLDB 2006]

Compact Hierarchical Histograms Solve the error-bounded problem. Next-to-bottom level case cici c 2i c 2i+1 cici c 2i

Compact Hierarchical Histograms Solve the error-bounded problem. General, recursive case Complexity: time space Apply to the space-bounded problem. Complexity: Polynomially Tractable

Conclusions Offline data representation problems under constrains are more easily solvable through their counterparts optimizing another parameter. Dual-problem-based algorithms are simpler, more scalable, more elegant, and more memory- parsimonious than the direct ones. In the CHH case, the dual-problem-based algorithm achieves an optimal solution to the maximum-error longest-prefix-match CHH partitioning problem, which was considered intractable. Future: assessment of privacy and CHH solutions.

Related Work H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. VLDB 1998 S. Guha, K. Shim, and J. Woo. REHIST: Relative error histogram construction algorithms. VLDB 2004 M. Garofalakis and A. Kumar. Deterministic wavelet thresholding for maximum-error metrics. PODS 2004 S. Guha. Space efficiency in synopsis construction algorithms. VLDB 2005 S. Guha and B. Harb. Wavelet Synopses for Data Streams: Minimizing Non- Euclidean Error. KDD 2005 S. Muthukrishnan. Subquadratic algorithms for workload-aware haar wavelet synopses. FSTTCS 2005 S. Guha and B. Harb. Approximation algorithms for wavelet transform coding of data streams. SODA 2006 we devised a specialized, highly efficient method for the case that a F. Reiss, M. Garofalakis, and J. M. Hellerstein. Compact histograms for hierarchical identifiers. VLDB 2006 A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l- diversity: Privacy beyond k-anonymity. ICDE 2006 P. Karras and N. Mamoulis. The Haar + tree: a refined synopsis data structure. ICDE 2007

Thank you! Questions?