SampleClean: Bringing Data Cleaning into the BDAS Stack Sanjay Krishnan and Daniel Haas In Collaboration With: Juan Sanchez, Wenbo Tao, Jiannan Wang, Tim.

Slides:



Advertisements
Similar presentations
Deep Packet Inspection with DFA-trees and Parametrized Language Overapproximation Author: Daniel Luchaup, Lorenzo De Carli, Somesh Jha, Eric Bach Publisher:
Advertisements

High throughput chain replication for read-mostly workloads
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
“FENDER” AUTOMATIC MEMORY FENCE INFERENCE Presented by Michael Kuperstein, Technion Joint work with Martin Vechev and Eran Yahav, IBM Research 1.
Firewall Query Engine and Firewall Comparison Engine Mohamed Gouda Alex X. Liu Computer Science Department The University of Texas at Austin.
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Crowds, Clouds, and Algorithms: Exploring the Human Side of “Big Data” Panelists: Sihem Amer-Yahia (Yahoo! Research) AnHai Doan (Wisconsin) Jon Kleinberg.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of.
Chess Review November 21, 2005 Berkeley, CA Edited and presented by Platform Based Design for Wireless Sensor Networks Alvise Bonivento UC Berkeley.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
1998/5/21by Chang I-Ning1 ImageRover: A Content-Based Image Browser for the World Wide Web Introduction Approach Image Collection Subsystem Image Query.
STEPPING STONE PROJECT STEPPING STONE PROJECT designing a new engineering discipline presented by team 1.
Bioinformatics Tool Development Dong Xu Computer Science Department 109 Engineering Building West
Interactive Query Processing in Scientific Applications David Liu UC Berkeley Computer Science Division.
CSI Fall 2002 Dr. William A. Maniatty Assistant Prof. Dept. of Computer Science University At Albany Programming Languages and Systems Concepts Fall.
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
The Power of Choice in Data-Aware Cluster Scheduling
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
Cache-Conscious Runtime Optimization for Ranking Ensembles Xun Tang, Xin Jin, Tao Yang Department of Computer Science University of California at Santa.
PIQL: Success- Tolerant Query Processing in the Cloud Michael Armbrust, Kristal Curtis, Tim Kraska Armando Fox, Michael J. Franklin, David A. Patterson.
© What do bioinformaticians do?
© 2010 Cisco and/or its affiliates. All rights reserved. 1 Managing Microsoft Applications with Cisco UCS Manager & PowerTool.
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
Update on Selective Editing and Implications for Staff Skills International Trade Conference September 2008 Ken Smart.
Data-Centric Human Computation Jennifer Widom Stanford University.
The Design of Innovation is coming July 2002 a new book by David E. Goldberg Department of General Engineering University of Illinois at Urbana-Champaign.
© 2010 Cisco and/or its affiliates. All rights reserved. 1 Managing Microsoft Applications with Cisco UCS Manager & PowerTool.
- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.
Tetherless World Constellation Open Government Data Jim Hendler Tetherless World Professor of Computer and Cognitive Science Assistant Dean of Information.
1 EECS 6083 Compiler Theory Based on slides from text web site: Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
Introduction to the Semantic Web and Linked Data
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Connecting with Computer Science2 Objectives Learn how software engineering is used to create applications Learn some of the different software engineering.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
1 The Good  HPC brings a wealth of parallelization experience, petaflop scaling and hybrid architectures.  Analytics brings new algorithms and new markets.
Microsoft Partner Conference Integrated Innovation Don Kerr Partner Technology Specialist.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
The Three E’s of Big Data and What DB People can do About Them UC BERKELEY Michael Franklin – UC Berkeley Beckman Database Get Together October 14, 2013.
Presented by SciDAC-2 Petascale Data Storage Institute Philip C. Roth Computer Science and Mathematics Future Technologies Group.
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
4/18/2018 3:49 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Packing Tasks with Dependencies
Designing a Scalable Data Cleaning Infrastructure
Dynamo: A Runtime Codesign Environment
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
So, what was this course about?
A Deeper Look Into Reactive Streams with Akka Streams 1.0
Alan Williams, Donal Fellows, Finn Bacall,
for more information ... Performance Tuning
StreamApprox Approximate Stream Analytics in Apache Flink
StreamApprox Approximate Stream Analytics in Apache Spark
StreamApprox Approximate Computing for Stream Analytics
CMPT 733, SPRING 2016 Jiannan Wang
Interpret the execution mode of SQL query in F1 Query paper
Panagiotis G. Ipeirotis Luis Gravano
“To improve the life and business success of the farmer and rancher.”
“To improve the life and business success of the farmer and rancher.”
CMPT 733, SPRING 2017 Jiannan Wang
2019/11/12 Efficient Measurement on Programmable Switches Using Probabilistic Recirculation Presenter:Hung-Yen Wang Authors:Ran Ben Basat, Xiaoqi Chen,
Mulesoft Anypoint Connector for AS/400 and Web Transaction Framework
Presentation transcript:

SampleClean: Bringing Data Cleaning into the BDAS Stack Sanjay Krishnan and Daniel Haas In Collaboration With: Juan Sanchez, Wenbo Tao, Jiannan Wang, Tim Kraska, Michael Franklin, Tova Milo, Ken Goldberg

Who publishes more? 2

Microsoft Academic Search Paper IdAffiliation 16Computer Science Division--University of California Berkeley CA 101University of California at Berkeley 102Department of Physics Stanford University California 116Lawrence Berkeley National Labs California 3

Microsoft Academic Search Paper IdAffiliation 16Computer Science Division--University of California Berkeley CA 101University of California at Berkeley 102Department of Physics Stanford University California 116Lawrence Berkeley National Labs California X 4

Microsoft Academic Search University of California at Berkeley Computer Science Division University of California at Berkeley Computer Science Division University of California at Berkeley Department of Physics Stanford University California 5

Data cleaning in BDAS. –Problem 1. Scale –Problem 2. Latency Sampling to cope with scale. Asynchrony to cope with latency. Enter SampleClean 6

Now it’s your turn! Be the crowd and help us decide 7

Dirty Data is Ubiquitous 8 Example: Missing, incomplete, inconsistent data

Data Cleaning is Hard 9 Costly Domain-specific Time consuming

Analytics 10 A New Data Cleaning Architecture Data Cleaning Data

Can it Scale? Crowd Machine Learning Regex Time People are slow and expensive 11

Insight 1: Asynchrony Hides Latency 12

Insight 2: Sampling Hides Scale Query Error Time BlinkDB 13

Query Error Time Data Error BlinkDB Insight 2: Sampling Hides Scale 14

Query Error Time Data Error SampleClean Insight 2: Sampling Hides Scale BlinkDB 15

SampleClean Data Flow Dirty Data Dirty Sample Dirty Sample Clean Sample Clean Sample Query Data Cleaning Data Cleaning 16 SamplingAsynchrony

The SampleClean Architecture Data Cleaning Library Approximate Query Processing Asynchronous Pipelines Clean Sample Issue Queries, Get Results Declare Cleaning Operations Dirty Sample 17

The SampleClean Architecture Data Cleaning Library Approximate Query Processing Asynchronous Pipelines Clean Sample Issue Queries, Get Results Declare Cleaning Operations Dirty Sample 18

Approximate Query Processing Estimate early results and bound with error bars SampleClean: Fast and Accurate Query Processing on Dirty Data. SIGMOD 2014 BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. EuroSys 2013 Query Error Time

The SampleClean Architecture 20 Approximate Query Processing Asynchronous Pipelines Clean Sample Issue Queries, Get Results Declare Cleaning Operations Dirty Sample Data Cleaning Library

Crowds and Machines Work Together Extensible library of data cleaning tools Tools are: –Automated –Human-powered –Hybrid Crowd Machine Learning Regex Time 21

Active Learning and Crowds Choose informative training points Not Informative Are these the same? Stanford Department of IEOR UC Berkeley Stats Yes  No Informative Are these the same? Department of Mathematics Stanford University University of California Berkeley Department of Mathematics Yes  No 22

Active Learning and Crowds Choose informative training points Not Informative Are these the same? Stanford Department of IEOR UC Berkeley Stats Yes  No Informative Are these the same? Department of Mathematics Stanford University University of California Berkeley Department of Mathematics Yes  No 23

The SampleClean Architecture 24 Data Cleaning Library Clean Sample Issue Queries, Get Results Declare Cleaning Operations Dirty Sample Approximate Query Processing Asynchronous Pipelines

Putting it all together: Asynchronous Pipelines Users group data cleaning operations into pipelines 25

The SampleClean Architecture Data Cleaning Library Approximate Query Processing Asynchronous Pipelines Clean Sample Issue Queries, Get Results Declare Cleaning Operations Dirty Sample 26

Great, Now What? Prototype implementation complete! Significant research challenges remain: Crowd worker performance and quality Pipeline semantics and optimization Programming model and interface Open source release targeted for next year 27

Summary Data Cleaning is slow, costly, and domain-specific SampleClean brings data cleaning into the BDAS stack SampleClean uses asynchrony to hide latency, and sampling to hide scale SampleClean combines Algorithms, Machines, and People, all in one system 28

Asynchrony in Spark The Spark abstraction: blocking BSP So how do we achieve asynchrony? Multithreaded master Intermediate results materialized in Hive Standalone Finagle HTTP server for crowd work 29