Xian Li  Cisco) Xin Luna Dong (AT&T  Google) Kenneth Lyons (AT&T Labs-Research) Weiyi Meng Divesh Srivastava (AT&T.

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

Números.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
PDAs Accept Context-Free Languages
Fill in missing numbers or operations
AP STUDY SESSION 2.
1
EuroCondens SGB E.
Worksheets.
Slide 1Fig 26-CO, p.795. Slide 2Fig 26-1, p.796 Slide 3Fig 26-2, p.797.
Sequential Logic Design
STATISTICS HYPOTHESES TEST (I)
STATISTICS INTERVAL ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
Addition and Subtraction Equations
By John E. Hopcroft, Rajeev Motwani and Jeffrey D. Ullman
David Burdett May 11, 2004 Package Binding for WS CDL.
Measurements and Their Uncertainty 3.1
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
CHAPTER 18 The Ankle and Lower Leg
The 5S numbers game..
突破信息检索壁垒 -SciFinder Scholar 介绍
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Media-Monitoring Final Report April - May 2010 News.
St. Edward’s University
Break Time Remaining 10:00.
The basics for simulations
Factoring Quadratics — ax² + bx + c Topic
PP Test Review Sections 6-1 to 6-6
Network TV Scatter vs. Spot TV Cost Comparison A25-54 CPMs Q Q TVB Analysis of SQAD National and Local Market Cost Estimate Data 1 Updated.
Discrete Event (time) Simulation Kenneth.
Divesh Srivastava AT&T Labs-Research. The Web is Great.
Regression with Panel Data
Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.
Xin Luna Dong (AT&T Labs  Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Progressive Aerobic Cardiovascular Endurance Run
1..
Name of presenter(s) or subtitle Canadian Netizens February 2004.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
When you see… Find the zeros You think….
1 Joseph Ghafari Artificial Neural Networks Botnet detection for Stéphane Sénécal, Emmanuel Herbert.
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
Subtraction: Adding UP
: 3 00.
5 minutes.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Static Equilibrium; Elasticity and Fracture
Converting a Fraction to %
Resistência dos Materiais, 5ª ed.
Clock will move after 1 minute
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
Chapter 2 Tutorial 2nd & 3rd LAB.
Physics for Scientists & Engineers, 3rd Edition
Select a time to count down from the clock above
Copyright Tim Morris/St Stephen's School
Patient Survey Results 2013 Nicki Mott. Patient Survey 2013 Patient Survey conducted by IPOS Mori by posting questionnaires to random patients in the.
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Introduction Embedded Universal Tools and Online Features 2.
Xin Luna Dong (Google Inc.) Divesh Srivastava (AT&T 5/2013.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Presentation transcript:

Xian Li  Cisco) Xin Luna Dong (AT&T  Google) Kenneth Lyons (AT&T Labs-Research) Weiyi Meng Divesh Srivastava (AT&T Labs-Research) VLDB’2013

February 29, 1922

Study on Two Domains #SourcesPeriod#Objects#Local- attrs #Global- attrs Consider ed items Stock557/ * *20 Flight3812/ * *31  Stock  Search “stock price quotes” and “AAPL quotes”  Sources: 200 (search results)  89 (deep web)  76 (GET method)  55 (none JavaScript)  1000 “Objects”: a stock with a particular symbol on a particular day  30 from Dow Jones Index  100 from NASDAQ100 (3 overlaps)  873 from Russell 3000  Attributes: 333 (local)  153 (global)  21 (provided by > 1/3 sources)  16 (no change after market close)

Study on Two Domains #SourcesPeriod#Objects#Local- attrs #Global- attrs Consider ed items Stock557/ * *20 Flight3812/ * *31  Flight  Search “flight status”  Sources: 38  3 airline websites (AA, UA, Continental)  8 airport websites (SFO, DEN, etc.)  27 third-party websites (Orbitz, Travelocity, etc.)  1200 “Objects”: a flight with a particular flight number on a particular day from a particular departure city  Departing or arriving at the hub airports of AA/UA/Continental  Attributes: 43 (local)  15 (global)  6 (provided by > 1/3 sources)  scheduled dept/arr time, actual dept/arr time, dept/arr gate

Study on Two Domains  Why these two domains?  Belief of fairly clean data  Data quality can have big impact on people’s lives  Resolved heterogeneity at schema level and instance level #SourcesPeriod#Objects#Local- attrs #Global- attrs Consider ed items Stock557/ * *21 Flight3812/ * *31

Q1. Are There a Lot of Redundant Data on the Deep Web?

Q2. Are the Data Consistent?  Inconsistency on 70% data items  Tolerance to 1% difference 

Q2. Are the Data Consistent?  Inconsistency on 50% data items after removing StockSmart 

Q2. Are the Data Consistent? (II)  Entropy measures distribution of different values  Quite low entropy: one value provided more often than others

Q2. Are the Data Consistent? (III)  Deviation measures difference of numerical values  High deviation: 13.4 for Stock, 13.1 min for Flight

Why Such Inconsistency? — I. Semantic Ambiguity Yahoo! Finance Nasdaq Day’s Range: wk Range: Wk:

Why Such Inconsistency? — II. Instance Ambiguity

Why Such Inconsistency? — III. Out-of-Date Data 4:05 pm 3:57 pm

Why Such Inconsistency? — IV. Unit Error 76,821, B

Why Such Inconsistency? — V. Pure Error FlightViewFlightAware Orbitz 6:15 PM 6:22 PM 9:40 PM 8:33 PM 9:54 PM

Why Such Inconsistency?  Random sample of 20 data items and 5 items with the largest #values in each domain

Q3. Is Each Source of High Accuracy?  Not high on average:.86 for Stock and.8 for Flight  Gold standard  Stock: vote on data from Google Finance, Yahoo! Finance, MSN Money, NASDAQ, Bloomberg  Flight: from airline websites 

Q3-2. Are Authoritative Sources of High Accuracy?   Reasonable but not so high accuracy  Medium coverage

Q4. Is There Copying or Data Sharing Between Deep-Web Sources?

Q4-2. Is Copying or Data Sharing Mainly on Accurate Data? 

Basic Solution: Voting  Only 70% correct values are provided by over half of the sources  Voting precision: .908 for Stock; i.e., wrong values for 1500 data items .864 for Flight; i.e., wrong values for 1000 data items

Improvement I. Leveraging Source Accuracy S1S2S3 Flight 17:02PM6:40PM7:02PM Flight 25:43PM 5:50PM Flight 39:20AM Flight 49:40PM9:52PM8:33PM Flight 56:15PM 6:22PM

Improvement I. Leveraging Source Accuracy S1S2S3 Flight 17:02PM6:40PM7:02PM Flight 25:43PM 5:50PM Flight 39:20AM Flight 49:40PM9:52PM8:33PM Flight 56:15PM 6:22PM Naïve voting obtains an accuracy of 80% Higher accuracy; More trustable

Improvement I. Leveraging Source Accuracy S1S2S3 Flight 17:02PM6:40PM7:02PM Flight 25:43PM 5:50PM Flight 39:20AM Flight 49:40PM9:52PM8:33PM Flight 56:15PM 6:22PM Considering accuracy obtains an accuracy of 100% Higher accuracy; More trustable Challenges: 1. How to decide source accuracy? 2. How to leverage accuracy in voting?

Results on Stock Data (I)  Sources ordered by recall (coverage * accuracy)  Among various methods, the Bayesian-based method (Accu) performs best at the beginning, but in the end obtains a final precision (=recall) of.900, worse than Vote (.908)

Results on Stock Data (II)  AccuSim obtains a final precision of.929, higher than Vote and any other method (around.908)  This translates to 350 more correct values

Results on Stock Data (III)

Results on Flight Data  Accu/AccuSim obtains a final precision of.831/.833, both lower than Vote (.857)  WHY??? What is that magic source?

Copying or Data Sharing Can Happen on Inaccurate Data

S1S2S3S4S5 Flight 17:02PM6:40PM7:02PM 8:02PM Flight 25:43PM 5:50PM Flight 39:20AM Flight 49:40PM9:52PM8:33PM Flight 56:15PM 6:22PM Naïve voting works only if data sources are independent.

S1S2S3S4S5 Flight 17:02PM6:40PM7:02PM 8:02PM Flight 25:43PM 5:50PM Flight 39:20AM Flight 49:40PM9:52PM8:33PM Flight 56:15PM 6:22PM Higher accuracy; More trustable Considering source accuracy can be worse when there is copying

Improvement II. Ignoring Copied Data It is important to detect copying and ignore copied values in fusion S1S2S3S4S5 Flight 17:02PM6:40PM7:02PM 8:02PM Flight 25:43PM 5:50PM Flight 39:20AM Flight 49:40PM9:52PM8:33PM Flight 56:15PM 6:22PM Challenges: 1. How to detect copying? 2. How to leverage copying in voting?

Results on Flight Data  AccuCopy obtains a final precision of.943, much higher than Vote (.864)  This translates to 570 more correct values

Results on Flight Data (II)

Take-Aways  Web data is not fully trustable, Web sources have different accuracy, and copying is common  Leveraging source accuracy, copying relationships, and value similarity can improve truth finding  Data sets downloadable from