Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T.

Slides:



Advertisements
Similar presentations
 Review: The Greedy Method
Advertisements

MCS 312: NP Completeness and Approximation algorithms Instructor Neelima Gupta
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Fast Algorithms For Hierarchical Range Histogram Constructions
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
Greedy Algorithms Basic idea Connection to dynamic programming
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Infinite Horizon Problems
Richard Fateman CS 282 Lecture 21 Basic Domains of Interest used in Computer Algebra Systems Lecture 2.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Computational problems, algorithms, runtime, hardness
Non Linear Programming 1
Optimal Merging Of Runs
1 Internet Networking Spring 2006 Tutorial 6 Network Cost of Minimum Spanning Tree.
Probably Approximately Correct Model (PAC)
[1][1][1][1] Lecture 5-7: Cell Planning of Cellular Networks June 22 + July 6, Introduction to Algorithmic Wireless Communications David Amzallag.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
1 Cell Planning of 4G Cellular Networks David Amzallag Computer Science Department, Technion Joint work with Roee Engelberg (Technion), Seffi Naor (Microsoft.
On Testing Convexity and Submodularity Michal Parnas Dana Ron Ronitt Rubinfeld.
1 Internet Networking Spring 2004 Tutorial 6 Network Cost of Minimum Spanning Tree.
Lecture 10 Review Rank Sum test (Chapter 4.2) Welch t-test for comparing two normal populations with unequal spreads (Chapter 4.3.2) Practical and statistical.
1 Combinatorial Dominance Analysis Keywords: Combinatorial Optimization (CO) Approximation Algorithms (AA) Approximation Ratio (a.r) Combinatorial Dominance.
1 Internet Networking Spring 2002 Tutorial 6 Network Cost of Minimum Spanning Tree.
Using Homogeneous Weights for Approximating the Partial Cover Problem
Mining Association Rules
Minimising Lifecycle Transitions in Service-Oriented Business Processes Roland Ukor and Andy Carpenter School of Computer Science, University of Manchester,
Lecture 19 Simple linear regression (Review, 18.5, 18.8)
Approximation Algorithms: Bristol Summer School 2008 Seffi Naor Computer Science Dept. Technion Haifa, Israel TexPoint fonts used in EMF. Read the TexPoint.
Computational aspects of stability in weighted voting games Edith Elkind (NTU, Singapore) Based on joint work with Leslie Ann Goldberg, Paul W. Goldberg,
Normalised Least Mean-Square Adaptive Filtering
Stock Value Ratio Classification Yan SuiZheng Chai.
Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava.
Bold Stroke January 13, 2003 Advanced Algorithms CS 539/441 OR In Search Of Efficient General Solutions Joe Hoffert
CS910: Foundations of Data Analytics Graham Cormode Time Series Analysis.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Genome Rearrangements [1] Ch Types of Rearrangements Reversal Translocation
Computer Animation Rick Parent Computer Animation Algorithms and Techniques Optimization & Constraints Add mention of global techiques Add mention of calculus.
Minimizing Stall Time in Single Disk Susanne Albers, Naveen Garg, Stefano Leonardi, Carsten Witt Presented by Ruibin Xu.
Additive Data Perturbation: the Basic Problem and Techniques.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
and 6.855J Lagrangian Relaxation I never missed the opportunity to remove obstacles in the way of unity. —Mohandas Gandhi.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
A Membrane Algorithm for the Min Storage problem Dipartimento di Informatica, Sistemistica e Comunicazione Università degli Studi di Milano – Bicocca WMC.
On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)
1 Chapter 5-1 Greedy Algorithms Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
Unique Games Approximation Amit Weinstein Complexity Seminar, Fall 2006 Based on: “Near Optimal Algorithms for Unique Games" by M. Charikar, K. Makarychev,
September 28, 2000 Improved Simultaneous Data Reconciliation, Bias Detection and Identification Using Mixed Integer Optimization Methods Presented by:
1 Approximation Algorithms for Generalized Min-Sum Set Cover Ravishankar Krishnaswamy Carnegie Mellon University joint work with Nikhil Bansal and Anupam.
Harbin Institute of Technology Application-Aware Data Collection in Wireless Sensor Networks Fang Xiaolin Harbin Institute of Technology.
TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.
Computationally speaking, we can partition problems into two categories. Easy Problems and Hard Problems We can say that easy problem ( or in some languages.
Polyhedral Optimization Lecture 5 – Part 3 M. Pawan Kumar Slides available online
Two Types of Empirical Likelihood Zheng, Yan Department of Biostatistics University of California, Los Angeles.
Approximation Algorithms based on linear programming.
SCREEN: Stream Data Cleaning under Speed Constraints Shaoxu Song, Aoqian Zhang, Jianmin Wang, Philip S. Yu SIGMOD 2015.
Estimation.
Optimal Merging Of Runs
Shortest Path Problems
Data Integration with Dependent Sources
The Greedy Method Spring 2007 The Greedy Method Merge Sort
Sequential Data Cleaning: A Statistical Approach
Shortest Path Problems
Minimizing the Aggregate Movements for Interval Coverage
Greedy Algorithms: Introduction
The Greedy Approach Young CS 530 Adv. Algo. Greedy.
Presentation transcript:

Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T

Data Quality Cost to business: $600 billion Problems prevalent in measurement data – Equipment failure – Calibration/systemic errors – Configuration errors – Management errors Our goal: detection, not cleaning

Data Quality Principled approach: integrity constraints – Assert semantics – Deviations = quality issues – Allow approx for real-world data – Condition  tableau discovery Domain often gives rise to semantics – missing, extraneous, out-of-order

Sequential Dependency Definition Sequential dependency X  g Y – (y i – y i-1 )  g, y’s sorted w.r.t. x-values – extension of functional dependency, g = (0,  ) X  Y :  t 1,t 2 t 1 [X] < t 2 [X]  t 1 [Y] < t 2 [Y] Example #1: date  [20,  ) price – Prices increasing by at least 20 units Example #2: poll#  [4,6] time – Consecutive polls within 4-6 mins

Approx Sequential Dependency Start End Confidence = 67% g = (0,  )

Conditional Sequential Dependencies Confidence ≥ 80% [1,6] [2,11] [7,12]

Conditional Sequential Dependencies Example #1: g = (0,  )

Conditional Sequential Dependencies Example #2: g = [9,11] g = [20,  ]

Contributions Introduce sequential dependencies (SDs) – algorithm for computing confidence Tableau Discovery for CSDs – problem definition – fast approximation algorithm

Approx Sequential Dependency Confidence: (N-OPS)/N, OPS = min ins+del – Edit distance Ex: with [4,6] – del 12, ins 15, ins 20, del 31  conf = 4/8 Doesn’t overpenalize for rare drops – Eg, with [5,5] Penalize large gaps – Eg, [3,5] with gap of 6 vs. 1000

Approx Sequential Dependency How to compute OPS for g=[G1,G2]? dcost(d) = #ins (or  ) to to end in d – Eg, [4,6]: dcost(6) = 1, dcost(7) = , dcost(8) = 2 –  d/G2  when  (d+1)/G1  =  d/G2  ; else  Let T(i) := OPS made to  Suppose T(1), T(2), …, T(i-1) already computed T(i) = minj { T(j) + (i-1-j) + [dcost(a i -a j )-1] } – O(G2/(G2-G1)  N log N) algorithm

Tableau Discovery Assume underlying SD given – Data often suggest ordering semantics Good tableau = small set of intervals – Each interval satisfies confidence threshold – Union satisfies support threshold Find maximal time intervals [i,j] s.t. – Confidence satisfied in [i,j] Can we do better than testing all [i,j]’s?

Tableau Discovery: Candidates Relax constraint: confidence ≥ ĉ/(1+ε) For any interval I, exists J s.t. (a) I  J and (b) |J| ≤ (1+ε)|I|  conf(J) ≥ conf(I)/(1+ε) I J

Tableau Discovery: Candidates Test just enough intervals: (a) lengths 1, (1+δ), (1+δ) 2, … (b) starting points δ, δ(1+δ), δ(1+δ) 2, …

Tableau Discovery: Candidates Processing cost: – Intervals at level h have length (1+δ) h – N/(δ(1+δ) h ) intervals at level h – log 1+δ N total levels –  sum of lengths = O((N/δ)log 1+δ N) = O(N/δ 2 lg N) Improvement: – Interval lengths in [A,2A] start at δA,2δA,3δA,… – Prefix property

Tableau Discovery: Assembly Optimal solution in quadratic time Greedy partial set cover Can implement in linear time Constant performance ratio

Summary of Results Tableau almost identical at small δ Significant speedup at small δ “Inflating” ĉ to (1+δ)ĉ works well

Experiments: Sample Tableau Data: WeatherDates, conf ≥ 0.995, support ≥ 0.5, δ = 0.05

Experiments: Tableau Size Gaps in [0,∞ ) Gaps in [0,5] DowJones data: support  0.5

Experiments: Scalability Gaps in [0,∞ ) Gaps in [4,6] Network data support  0.5 conf  0.99 WeatherDates support  0.5 conf  0.9

Case Study: Polled Data conf ≥ 0.995, support ≥ 0.5, δ=0.05

Case Study: Stock Data conf ≥ 0.995, support ≥ 0.5, δ=0.05 Dow Jones 2-week moving average

Conclusions Constraint-driven approach – Define, discover, detect Use whatever semantics available – Domain knowledge, expectation, etc. Model errors carefully – Confidence measure Tableaux useful for summary

The End

Background Functional Dependency – X  Y :  t 1,t 2 t 1 [X] = t 2 [X]  t 1 [Y] = t 2 [Y] Example – title  salary – What happens when data merged?

Page 26 Background ssn|name|title|salary 123|alice|manager|50 456|bob|sales|40 789|cathy|manager|50 title  salary ssn|name|company|title|salary 123|alice|ATT|manager|50 456|bob|ATT|sales|40 789|cathy|ATT|manager|50 012|david|IBM|engineer|30 345|emily|IBM|engineer|35 [title,company]  salary? – 100% support, 80% confidence Hold Tableau Fail Tableau ATT Company ** SalaryTitle 60% support, 100% confidence IBM Company ** SalaryTitle 40% support, 50% confidence

CFD Results Given FD, discover tableau: – min tableau size – subj. to support and confidence constraints Hardness: – global conf: inapproximable – local conf: NP-hard, fast approx algo