SampleClean: Bringing Data Cleaning into the BDAS Stack Sanjay Krishnan and Daniel Haas In Collaboration With: Juan Sanchez, Wenbo Tao, Jiannan Wang, Tim Kraska, Michael Franklin, Tova Milo, Ken Goldberg
Who publishes more? 2
Microsoft Academic Search Paper IdAffiliation 16Computer Science Division--University of California Berkeley CA 101University of California at Berkeley 102Department of Physics Stanford University California 116Lawrence Berkeley National Labs California 3
Microsoft Academic Search Paper IdAffiliation 16Computer Science Division--University of California Berkeley CA 101University of California at Berkeley 102Department of Physics Stanford University California 116Lawrence Berkeley National Labs California X 4
Microsoft Academic Search University of California at Berkeley Computer Science Division University of California at Berkeley Computer Science Division University of California at Berkeley Department of Physics Stanford University California 5
Data cleaning in BDAS. –Problem 1. Scale –Problem 2. Latency Sampling to cope with scale. Asynchrony to cope with latency. Enter SampleClean 6
Now it’s your turn! Be the crowd and help us decide 7
Dirty Data is Ubiquitous 8 Example: Missing, incomplete, inconsistent data
Data Cleaning is Hard 9 Costly Domain-specific Time consuming
Analytics 10 A New Data Cleaning Architecture Data Cleaning Data
Can it Scale? Crowd Machine Learning Regex Time People are slow and expensive 11
Insight 1: Asynchrony Hides Latency 12
Insight 2: Sampling Hides Scale Query Error Time BlinkDB 13
Query Error Time Data Error BlinkDB Insight 2: Sampling Hides Scale 14
Query Error Time Data Error SampleClean Insight 2: Sampling Hides Scale BlinkDB 15
SampleClean Data Flow Dirty Data Dirty Sample Dirty Sample Clean Sample Clean Sample Query Data Cleaning Data Cleaning 16 SamplingAsynchrony
The SampleClean Architecture Data Cleaning Library Approximate Query Processing Asynchronous Pipelines Clean Sample Issue Queries, Get Results Declare Cleaning Operations Dirty Sample 17
The SampleClean Architecture Data Cleaning Library Approximate Query Processing Asynchronous Pipelines Clean Sample Issue Queries, Get Results Declare Cleaning Operations Dirty Sample 18
Approximate Query Processing Estimate early results and bound with error bars SampleClean: Fast and Accurate Query Processing on Dirty Data. SIGMOD 2014 BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. EuroSys 2013 Query Error Time
The SampleClean Architecture 20 Approximate Query Processing Asynchronous Pipelines Clean Sample Issue Queries, Get Results Declare Cleaning Operations Dirty Sample Data Cleaning Library
Crowds and Machines Work Together Extensible library of data cleaning tools Tools are: –Automated –Human-powered –Hybrid Crowd Machine Learning Regex Time 21
Active Learning and Crowds Choose informative training points Not Informative Are these the same? Stanford Department of IEOR UC Berkeley Stats Yes No Informative Are these the same? Department of Mathematics Stanford University University of California Berkeley Department of Mathematics Yes No 22
Active Learning and Crowds Choose informative training points Not Informative Are these the same? Stanford Department of IEOR UC Berkeley Stats Yes No Informative Are these the same? Department of Mathematics Stanford University University of California Berkeley Department of Mathematics Yes No 23
The SampleClean Architecture 24 Data Cleaning Library Clean Sample Issue Queries, Get Results Declare Cleaning Operations Dirty Sample Approximate Query Processing Asynchronous Pipelines
Putting it all together: Asynchronous Pipelines Users group data cleaning operations into pipelines 25
The SampleClean Architecture Data Cleaning Library Approximate Query Processing Asynchronous Pipelines Clean Sample Issue Queries, Get Results Declare Cleaning Operations Dirty Sample 26
Great, Now What? Prototype implementation complete! Significant research challenges remain: Crowd worker performance and quality Pipeline semantics and optimization Programming model and interface Open source release targeted for next year 27
Summary Data Cleaning is slow, costly, and domain-specific SampleClean brings data cleaning into the BDAS stack SampleClean uses asynchrony to hide latency, and sampling to hide scale SampleClean combines Algorithms, Machines, and People, all in one system 28
Asynchrony in Spark The Spark abstraction: blocking BSP So how do we achieve asynchrony? Multithreaded master Intermediate results materialized in Hive Standalone Finagle HTTP server for crowd work 29