Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava.

Slides:



Advertisements
Similar presentations
Hashing.
Advertisements

National Credit Education Week Take the Credit Challenge!
24-1 Chapter 24. Congestion Control and Quality of Service (part 1) 23.1 Data Traffic 23.2 Congestion 23.3 Congestion Control 23.4 Two Examples.
VTrack: Accurate, Energy-Aware Road Traffic Delay Estimation Using Mobile Phones Arvind Thiagarajan, Lenin Ravindranath, Katrina LaCurts, Sivan Toledo,
A Transmission Control Scheme for Media Access in Sensor Networks Lee, dooyoung AN lab A.Woo, D.E. Culler Mobicom’01.
BEWARE! IDENTITY THEFT CARL JOHNSON FINANCIAL LITERACY JENKS HIGH CSHOOL.
Copyright © 2011 Pearson Education, Inc. Managing Your Money.
Introduction and the Context The Use and value of Urban Planning.
Distributed Process Management
DYNAMIC POWER ALLOCATION AND ROUTING FOR TIME-VARYING WIRELESS NETWORKS Michael J. Neely, Eytan Modiano and Charles E.Rohrs Presented by Ruogu Li Department.
Appendix on Payroll Accounting
© 2006 Cisco Systems, Inc. All rights reserved. ICND v2.3—4-1 Managing IP Traffic with ACLs Introducing ACLs.
Distributed Process Management
Tracking Moving Objects in Anonymized Trajectories Nikolay Vyahhi 1, Spiridon Bakiras 2, Panos Kalnis 3, and Gabriel Ghinita 3 1 St. Petersburg State University.
Interest Rates and Rates of Return
SAP R/3 Materials Management Module
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
Error Checking continued. Network Layers in Action Each layer in the OSI Model will add header information that pertains to that specific protocol. On.
Inventory Management for Independent Demand
BUDGETING Financing for Students The Basics of Financing for Students.
Radial Basis Function Networks
A Four Corners Activity. What is a “credit report?” How does someone’s credit report impact his or her financial opportunities?
Summary of Last Lecture Introduction to Stocks Stock Valuation.
Data Communications and Networking
6-0 Week 3 Lecture 3 Ross, Westerfield and Jordan 7e Chapter 6 Discounted Cash Flow Valuation.
Interior Gateway Routing Protocol (IGRP) is a distance vector interior routing protocol (IGP) invented by Cisco. It is used by routers to exchange routing.
Definition of a tax What is a tax?
Tomo-gravity Yin ZhangMatthew Roughan Nick DuffieldAlbert Greenberg “A Northern NJ Research Lab” ACM.
Flow Models and Optimal Routing. How can we evaluate the performance of a routing algorithm –quantify how well they do –use arrival rates at nodes and.
Mathematics for Economics and Business Jean Soper chapter two Equations in Economics 1.
5.1 Savings and Investing 5.2 The Rule of 72 Getting Started.
Distributed Quality-of-Service Routing of Best Constrained Shortest Paths. Abdelhamid MELLOUK, Said HOCEINI, Farid BAGUENINE, Mustapha CHEURFA Computers.
Lecture 3 Managerial Finance FINA 6335 Ronald F. Singer.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2007 (TPDS 2007)
Particle Filtering in Network Tomography
3 rd 9 Weeks Benchmark Review Career Preparedness.
Chapter 12 Inventory Models
CMSC 345 Fall 2000 Unit Testing. The testing process.
Introduction to Operations Research
Invitation to Computer Science, Java Version, Second Edition.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
1 Slides used in class may be different from slides in student pack Chapter 17 Inventory Control  Inventory System Defined  Inventory Costs  Independent.
Network Optimization Problems
The Time Value of Money Lecture 3 and 4 Corporate Finance Ronald F. Singer Fall, 2010.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 18 Inference for Counts.
Sequential Dependencies Flip Korn, AT&T Lukasz Golab, AT&T Howard Karloff, AT&T Avishek Saha, University of Utah Divesh Srivastava, AT&T.
Load-Balancing Routing in Multichannel Hybrid Wireless Networks With Single Network Interface So, J.; Vaidya, N. H.; Vehicular Technology, IEEE Transactions.
1 Distributed Process Management Chapter Distributed Global States Operating system cannot know the current state of all process in the distributed.
Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003.
Lecture 2 Managerial Finance FINA 6335 Ronald F. Singer.
© 2003 The McGraw-Hill Companies, Inc. All rights reserved. Discounted Cash Flow Valuation Chapter Six.
0 Glencoe Accounting Unit 3 Chapter 12 Copyright © by The McGraw-Hill Companies, Inc. All rights reserved. Unit 3 Accounting for a Payroll System Chapter.
Dynamic Programming.  Decomposes a problem into a series of sub- problems  Builds up correct solutions to larger and larger sub- problems  Examples.
Handout Manajemen Keuangan
A plan for managing money during a given period of time Financial Roadmap.
Carroll County Advisement Program FINANCIAL LITERACY *IDENTITY THEFT *MONEY MANAGEMENT.
An End-to-End Service Architecture r Provide assured service, premium service, and best effort service (RFC 2638) Assured service: provide reliable service.
Chapter 36 Financing the Business Section 36.1 Preparing Financial Documents Section 36.2 Financial Aspect of a Business Plan Section 36.1 Preparing Financial.
The Principles of Operating Systems Chapter 9 Distributed Process Management.
Chapter 6 Accounting for Sales.
Merchandising Activities
Market-Risk Measurement
BA 101 Introduction to Business
Topics discussed in this section:
GC 211:Data Structures Week 2: Algorithm Analysis Tools
Software Testing and Maintenance 1
Learning Objectives Calculate Gross Pay, Employee Payroll Tax Deductions for Federal Income Tax Withholding, State Income Tax Withholding, FICA (OASDI,
Understanding Credit Cards
Routing and Logistics Arc Routing 2018/11/19.
Accounts Receivable and Inventory Management
Presentation transcript:

Barna Saha, AT&T Research Laboratory Joint work with: Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava

 Data Quality  Data Cleaning

IRS Vs Federal Mathematician “ You owe us $10,000 plus accrued interest in taxes for last year. You earned $36,000 last year but only had $1 withheld from your paycheck for Federal taxes. ” “ How could I work the entire year and only have $1 withheld ? I do not have time to waste on this foolishness. Goodbye !” The Federal Government Agency had only allocated enough storage on the computer to handle withholding amounts of $ or less. Amount withheld was $ The last $1 made the crucial difference.

The Risk of Massive ID Fraud In May 2004, Ryan Pirozzi of Edina, Minnesota opened his mail box and found more than a dozen bank statements inside. None of the accounts were his ! Because of a data entry error made by a clerk at the processing center of Wachovia Corp, a large bank headquartered in the Southeastern USA, over the course of 9 months, Pirozzi received the financial statements of 73 strangers. Their names, SSN, bank account numbers constitute an identity thief's dream !

The Risk of Massive ID Fraud Pirozzi began receiving completed 1099 tax forms belonging to many of these people. Finally one day in January 2005, a strange thing happened. Mr. Pirozzi went to his mail box and discovered an envelope from Wachovia that contained his completed 1099 tax form. That was the first piece of correspondence that he received from the bank that actually belonged to them.

Source of these stories 800 houses in Montgomery County, Maryland, were put on auction block in 2005 due to mistakes in the tax payment data of Washington Mutual Mortgage FOR SALE!

Data Quality Tools Real world data is often dirty: Inconsistent Inaccurate Incomplete Stale  Enterprises typically find data error rates of approximately 1%-5%, for some companies, it is above 30%.  Dirty data costs US businesses 600 billion dollars annually.  Data cleaning accounts for 30%-80% of development time and budget in most data warehouse projects. Data Quality Tools: Detect and repair errors Differentiate between dirty and clean data

A Systematic Approach to Improve Data Quality Impose integrity constraints Semantic rules for data Errors and inconsistencies in data emerge as violation of the constraints

Integrity Constraints: Functional Dependency [Codd, 1972] Functional Dependency: (Name, Type, State) (Price, Vat) NameTypeStatePriceVat CH1ClothingMaryland$50$3 BK45BookNew Jersey$120$15 FN30FurnitureWashington$100$0 CH1ClothingMaryland$100$6 BK66BookWashington$80$10

New Integrity Constraints for Data Quality Functional Dependency: (Name, Type, State) (Price, Vat) NameTypeStatePriceVat CH1ClothingMaryland$50$3 BK45BookNew Jersey$120$15 FN30FurnitureWashington$100$0 CH1ClothingMaryland$100$6 BK66BookWashington$80$10 1. If (Type=Book) then the above FD holds. 2.If (State=Washington) then (Vat= 0 ) Conditional Functional Dependency

New Integrity Constraints for Data Quality Conditional Functional Dependency Sequential Dependency Aggregation Dependency: Discovering Conservation Rules

Motivations Infrastructure networks are continuously monitored over time. Example: Highway monitoring systems use road sensors to identify traffic buildup and suggest alternate routes. Routers in an IP telecommunications network maintain counters to keep track of the traffic flowing through them. Power meters measure electricity flowing through different systems Monitored to troubleshoot customer problems, check network performance and understand provisioning requirements.

Data Quality Problems in Infrastructure Networks Missing or delayed data, especially over large interval of times, can be detrimental to any attempt to ensure reliable and well-functioning network. IP network monitoring typically uses the UDP protocol, so measurements can be delayed (or even lost) when there is high network congestion. Sometimes a new router interface is activated and traffic is flowing through it, but this interface is not known to the monitoring system; in this case, there is missing data that is hard to detect.

Data Quality Problems in Infrastructure Networks Missing or delayed data, especially over large interval of times, can be detrimental to any attempt to ensure reliable and well-functioning network. Monitoring road networks in the presence of sensor failures or unmonitored road segments Monitoring electricity networks in the presence of hacked power meters or if someone is diverting (stealing) electricity, etc. Detecting data quality issues is difficult when monitoring large and complex networks

Approach Impose integrity constraints to capture the semantics of data Provide concise summary of data where the rules hold/fail efficiently.

Integrity Constraint: Conservation Rules In many infrastructure networks, there exists a conservation law between related quantities in monitored data.  Kirchoff’s Node Law of Conservation of Electricity : The current flowing into a node in an electric circuit equals the current flowing out of the node.  Road Network Monitoring: Every car that enters an intersection must exit.  Telecommunication Networks: Every packet entering a router must exit. And many more…

Conservation Rules One to One Matching  Match each incoming event to each outgoing event, and report average delay/ loss as measure for violation of conservation laws.  Infeasible with respect to storage and processing costs to collect individual packets/ monitor individual events Monitoring systems provide aggregate counts at regular intervals.

Conservation Rules Incoming traffic at a router Outgoing traffic at a router time We expect the two time series to be identical Matching incoming and outgoing aggregated traffic at every time point may not reveal true data quality issues.  Clock synchronization error  Queuing delay Compare aggregated total over time windows.

Conservation Rules: Confidence of an Interval Confidence of an interval = Ignores duration of violation Incoming traffic at a router Outgoing traffic at a router IN OUT a 1 a 2 a 3 a 4 a 5 b 1 b 2 b 3 b 4 b 5 34 ∑ a i ∑ b i

Conservation Rules: Confidence of an interval Rightward Matching between IN and OUT Confidence=1 Incoming traffic at a router Outgoing traffic at a router Confidence= IN OUT a 1 a 2 a 3 a 4 a 5 b 1 b 2 b 3 b 4 b 5 34

Earth Mover Distance A measure of distance between two distributions over some region D. Interpret the distributions as two different ways of piling up a certain amount of dirt over the region D. EMD is the minimum cost of turning one pile into the other. Cost is assumed to be amount of dirt moved times the distance by which its is moved. Also known as Wasserstein distance.

Rightward Matching (RM): A special case of Earth Mover Distance (EMD) Only right shifting Simple greedy algorithm works Confidence=1 Incoming traffic at a router Outgoing traffic at a router Confidence= IN OUT EMD=114 Maximum EMD Possible=114 EMD=0 Maximum EMD Possible=114

RM: Interpretation by area over cumulative counts Confidence of an interval I = area(CUM-OUT(I))/ area (CUM-IN(I)) time CUM- IN CUM-OUT Cumulative count CUM- IN CUM-OUT

RM: Interpretation by area over cumulative counts Find all intervals with confidence >= 0.9 (say) Cumulative count time CUM- IN CUM-OUT CUM- IN CUM-OUT

RM: Interpretation by area over cumulative counts Return a minimum collection of intervals with confidence >= 0.9 (say) covering at least 95% (say) of data Cumulative count time CUM- IN CUM-OUT CUM- IN CUM-OUT

Finding intervals with high confidence Trivial using O(n 3 ) time Try all possible n 2 intervals For each interval using O(n) time find the confidence Cumulative count time CUM- IN CUM-OUT CUM- IN CUM-OUT

Finding intervals with high confidence Easy to do in O( n 2 ) time Compute in linear time confidence of all the intervals that start from a specific point time Cumulative count time CUM- IN CUM-OUT CUM- IN CUM-OUT

How do you solve it in sub-quadratic time ? Only maximal intervals time Cumulative count Finding intervals with high confidence CUM- IN CUM-OUT CUM- IN CUM-OUT

Relax confidence: If outputs I, then conf(I) ≥ c/(1+ε) (no false positives) If conf(I*) ≥ c, output I I* with conf(I) ≥ c/(1+ε) (no false negatives) Finding intervals with high confidence

Algorithm 1 Finding intervals with high confidence

Generating Sparse Set of Intervals = Compute the confidence of intervals with growing geometrically by a factor of Finding intervals with high confidence

Generating Sparse Set of Intervals = Compute the confidence of intervals with growing geometrically by a factor of Finding intervals with high confidence

Running time depends on area B

Finding intervals with high confidence: Avoiding dependency on area Main Idea: Consider each possible ending point of intervals instead of starting points Compute confidence of intervals with interval lengths growing exponentially in 1+ε

Finding intervals with high confidence: Avoiding dependency on area

Finding intervals with high confidence: Avoiding dependency on area

Discount Models

Finding minimum collection of maximal intervals with support threshold Partial set cover on line Can be solved exactly in quadratic time using dynamic programming Can be solved in linear time if we allow constant factor approximation using greedy algorithm Greedy gives 7-approximation

Finding minimum collection of maximal intervals with support threshold Partial set cover using greedy algorithm If OPT chooses t intervals then We can choose at most t intervals that do not intersect any of the OPT intervals. We can choose at most 6 intervals that intersect a particular OPT intervals.

Credit Card Data Dec Jan

Entrance-Exit Data

Network Monitoring Data

Running Time on Job-log Data Area Based Non-area Based

Summary We study data quality problems that arise frequently in many infrastructure networks. We propose rules that express conservation laws between related quantities, such as those between the inbound and outbound counts reported by network monitoring systems. We present several confidence metrics for conservation rules. We give efficient approximation algorithms for finding a concise set of intervals that satisfy (or fail) a supplied conservation rule given a confidence threshold