A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.

Slides:



Advertisements
Similar presentations
Ali Ghodsi UC Berkeley & KTH & SICS
Advertisements

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
Large Scale Computing Systems
Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1.
THE DATACENTER NEEDS AN OPERATING SYSTEM MATEI ZAHARIA, BENJAMIN HINDMAN, ANDY KONWINSKI, ALI GHODSI, ANTHONY JOSEPH, RANDY KATZ, SCOTT SHENKER, ION STOICA.
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Cancer Genomics and Computing UC Berkeley CS David Patterson (UCB) and Taylor Sittler (UCSF) August 28, 2011.
Introduction to Data Science Kamal Al Nasr, Matthew Hayes and Jean-Claude Pedjeu Computer Science and Mathematical Sciences College of Engineering Tennessee.
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Tyson Condie.
PIQL: Success- Tolerant Query Processing in the Cloud Michael Armbrust, Kristal Curtis, Tim Kraska Armando Fox, Michael J. Franklin, David A. Patterson.
UC Berkeley 1 The Future of Cloud Computing Michael Franklin, UC Berkeley Reliable Adaptive Distributed Systems Lab Image: John Curley
Hosted on the Powerful Microsoft Azure Platform, Advent Countdown Lets Companies Run Reliable and Scalable Holiday Marketing Campaigns MICROSOFT AZURE.
Bleeding edge technology to transform Data into Knowledge HADOOP In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Chapter 3 DECISION SUPPORT SYSTEMS CONCEPTS, METHODOLOGIES, AND TECHNOLOGIES: AN OVERVIEW Study sub-sections: , 3.12(p )
LiquiData Platform Unleashes Powerful Cloud Analytics Capabilities with Integrated Reporting and Visualization from Diverse Sources of Data COMPANY PROFILE:
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
Increasing Manufacturing Uptime Is Made Easier with RtTech’s Industrial Facilities Application RtDuet, Powered by the Microsoft Azure Cloud MICROSOFT AZURE.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Bizfss File Sync and Sharing Solution, Built on Microsoft Azure, Allows Businesses to Sync, Share, Back Up Using Their Own Cloud Storage MICROSOFT AZURE.
CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”
Data Science Background and Course Software setup Week 1.
Datalayer Notebook Allows Data Scientists to Play with Big Data, Build Innovative Models, and Share Results Easily on Microsoft Azure MICROSOFT AZURE ISV.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Powered by Microsoft Azure, PointMatter Is a Flexible Solution to Move and Share Data between Business Groups and IT MICROSOFT AZURE ISV PROFILE: LOGICMATTER.
SampleClean: Bringing Data Cleaning into the BDAS Stack Sanjay Krishnan and Daniel Haas In Collaboration With: Juan Sanchez, Wenbo Tao, Jiannan Wang, Tim.
MLevel Is the Fully Microsoft Azure-Based, Industry-Leading Casual Learning Platform Used by Enterprises Worldwide to Make Learning Fun MICROSOFT AZURE.
SUPPLY CHAIN OF BIG DATA. WHAT IS BIG DATA?  A lot of data  Too much data for traditional methods  The 3Vs  Volume  Velocity  Variety.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
IoT Meets Big Data Standardization Considerations
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
+ Logentries Is a Real-Time Log Analytics Service for Aggregating, Analyzing, and Alerting on Log Data from Microsoft Azure Apps and Systems MICROSOFT.
A Platform for Fine-Grained Resource Sharing in the Data Center
Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
Built on the Powerful Microsoft Azure Platform, Forensic Advantage Helps Public Safety and National Security Agencies Collect, Analyze, Report, and Distribute.
Saasabi’s Analytical Processing Engine in the Cloud Makes Business Intelligence Affordable for Everyone COMPANY PROFILE: Saasabi Saasabi is a BizSpark.
Big Data Yuan Xue CS 292 Special topics on.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
An Introduction To Big Data For The SQL Server DBA.
The Future of Whole Human Genome Data Management and Analysis, Available on the Microsoft Azure Platform Today MICROSOFT AZURE APP BUILDER PROFILE: SPIRAL.
1 Cloud-Native Data Warehousing Bob Muglia. 2 Scenarios with affinity for cloud Gartner 2016 Predictions: By 2018, six billion connected things will be.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
What we mean by Big Data and Advanced Analytics
Big Data is a Big Deal!.
Big Data Enterprise Patterns
Big Data A Quick Review on Analytical Tools
Partner Logo Veropath Offers a Next-Gen Expense Management SaaS Technology Solution, Built Specifically to Harness Big Data Analytics Capabilities in Azure.
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
Hadoop Clusters Tess Fulkerson.
Data Warehousing and Data Mining
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
Overview of big data tools
The Future of Cloud Computing
Big DATA.
Data Analysis and R : Technology & Opportunity
V. Uddameri Texas Tech University
Big-Data Analytics with Azure HDInsight
Presentation transcript:

A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011

Big Data is Massive… Facebook: –130TB/day: user logs – TB/day: 83 million pictures Google: > 25 PB/day processed data Data generated by LHC: 1 PB/sec Total data created in 2010: 1.ZettaByte (1,000,000 PB)/year –~60% increase every year 2

…and Grows Bigger and Bigger! More and more devices More and more people Cheaper and cheaper storage –~50% increase in GB/$ every year 3

…and Grows Bigger and Bigger! Log everything! –Don’t always know what question you’ll need to answer Hard to decide what to delete –Thankless decision: people know only when you are wrong! –“Climate Research Unit (CRU) scientists admit they threw away key data used in global warming calculations” Stored data grows faster than GB/$ 4

What is Big Data? You don’t need to be big to have big data problem! –Inadequate tools to analyze data –Data management may dominate infrastructure cost 5 Data that is expensive to manage, and hard to extract value from

Big Data is not Cheap! Storing and managing 1PB data: $500K-$1M/ year –Facebook: 200 PB/year 6 “Typical” cloud-based service startup (e.g., Conviva) –Log storage dominates infrastructure cost Infrastructure cost ~1PB storage capacity

Hard to Extract Value from Data! Data is –Diverse, variety of sources –Uncurated, no schema, inconsistent semantics, syntax –Integration a huge challenge No easy way to get answers that are –High-quality –Timely Challenge: maximize value from data by getting best possible answers 7

Requires Multifaceted Approach Three dimensions to improve data analysis –Improving scale, efficiency, and quality of algorithms (Algorithms) –Scaling up datacenters (Machines) –Leverage human activity and intelligence (People) Need to adaptively and flexibly combine all three dimensions 8

Algorithms, Machines, People Today’s apps: fixed point in solution space 9 Algorithms Machines People Need techniques to dynamically pick best operating point search Watson/IBM

The AMP Lab 10 search Watson/IBM Machines People Algorithms sense of data at scale by tightly integrating algorithms, machines, and people Make sense of data at scale by tightly integrating algorithms, machines, and people

AMP Faculty and Sponsors Faculty –Alex Bayen (mobile sensing platforms) –Armando Fox (systems) –Michael Franklin (databases): Director –Michael Jordan (machine learning): Co-director –Anthony Joseph (security & privacy) –Randy Katz (systems) –David Patterson (systems) –Ion Stoica (systems): Co-director –Scott Shenker (networking) Sponsors: 11

Algorithms State-of-art Machine Learning (ML) algorithms do not scale –Prohibitive to process all data points 12 How do you know when to stop? true answer Estimate # of data points

Algorithms Given any problem, data and a budget –Immediate results with continuous improvement –Calibrate answer: provide error bars 13 Error bars on every answer! Estimate # of data points true answer

Algorithms Given any problem, data and a time budget –Immediate results with continuous improvement –Calibrate answer: provide error bars 14 Stop when error smaller than a given threshold Estimate # of data points time true answer

Algorithms Given any problem, data and a time budget –Automatically pick the best algorithm 15 Estimate time pick sophisticated pick simple error too high true answer sophisticated simple

Machines “The datacenter as a computer” still in its infancy –Special purpose clusters, e.g., Hadoop cluster –Highly variable performance –Hard to program –Hard to debug 16 = ?

Machines Make datacenter a real computer! 17 Node OS (e.g. Linux) Node OS (e.g. Linux) Node OS (e.g. Windows) Node OS (e.g. Windows) Node OS (e.g. Linux) Node OS (e.g. Linux) … Datacenter “OS” (e.g., Mesos) Share datacenter between multiple cluster computing apps Provide new abstractions and services AMP stack Existing stack

Machines Make datacenter a real computer! 18 Node OS (e.g. Linux) Node OS (e.g. Linux) Node OS (e.g. Windows) Node OS (e.g. Windows) Node OS (e.g. Linux) Node OS (e.g. Linux) … Datacenter “OS” (e.g., Mesos) Hadoop MPI Hypertbale … Cassandra Hive Support existing cluster computing apps AMP stack Existing stack

Machines Make datacenter a real computer! 19 Node OS (e.g. Linux) Node OS (e.g. Linux) Node OS (e.g. Windows) Node OS (e.g. Windows) Node OS (e.g. Linux) Node OS (e.g. Linux) … Spark SCADS … Datacenter “OS” (e.g., Mesos) Hadoop MPI Hypertbale … Cassandra Hive PIQL Support interactive and iterative data analysis (e.g., ML algorithms) Consistency adjustable data store Predictive & insightful query language AMP stack Existing stack

Machines Make datacenter a real computer! 20 Node OS (e.g. Linux) Node OS (e.g. Linux) Node OS (e.g. Windows) Node OS (e.g. Windows) Node OS (e.g. Linux) Node OS (e.g. Linux) … Spark SCADS … Datacenter “OS” (e.g., Mesos) Applications, tools Hadoop MPI Hypertbale … Cassandra Hive PIQL Advanced ML algorithms Interactive data mining Collaborative visualization AMP stack Existing stack

People Humans can make sense of messy data! 21

People Make people an integrated part of the system! –Leverage human activity –Leverage human intelligence (crowdsourcing): Curate and clean dirty data Answer imprecise questions Test and improve algorithms Challenge –Inconsistent answer quality in all dimensions (e.g., type of question, time, cost) 22 Machines + Algorithms data, activity Questions Answers

Real Applications Mobile Millennium Project –Alex Bayen, Civil and Environment Engineering, UC Berkeley Microsimulation of urban development –Paul Waddell, College of Environment Design, UC Berkeley Crowd based opinion formation –Ken Goldberg, Industrial Engineering and Operations Research, UC Berkeley Personalized Sequencing –Taylor Sittler, UCSF 23

Personalized Sequencing 24

Sequencing Microsimulation Mobile Millennium The AMP Lab 25 Machines People Algorithms sense of data at scale by tightly integrating algorithms, machines, and people Make sense of data at scale by tightly integrating algorithms, machines, and people

Big Data in 2020 Almost Certainly: Create a new generation of big data scientist A real datacenter OS ML becoming an engineering discipline People deeply integrated in big data analysis pipeline If We’re Lucky: System will know what to throw away Generate new knowledge that an individual person cannot

Summary Goal: Tame Big Data Problem –Get results with right quality at the right time Approach: Holistically integrate Algorithms, Machines, and People Huge research issues across many domains 27