1 CS 294: Big Data System Research: Trends and Challenges Fall 2015 (MW 9:30-11:00, 310 Soda Hall) Ion Stoica and Ali Ghodsi (http://www.cs.berkeley.edu/~istoica/classes/cs294/15/)

Slides:



Advertisements
Similar presentations
Ali Ghodsi UC Berkeley & KTH & SICS
Advertisements

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
Introduction to Advanced Computing Platforms for Data Analysis Ruoming Jin.
THE DATACENTER NEEDS AN OPERATING SYSTEM MATEI ZAHARIA, BENJAMIN HINDMAN, ANDY KONWINSKI, ALI GHODSI, ANTHONY JOSEPH, RANDY KATZ, SCOTT SHENKER, ION STOICA.
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)
INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 6 2/13/2015.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Berkeley Data Analytics Stack
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Why Spark on Hadoop Matters
CS 268: Project Suggestions Ion Stoica February 6, 2003.
SESSION 9 THE INTERNET AND THE NEW INFORMATION NEW INFORMATIONTECHNOLOGYINFRASTRUCTURE.
Introducing the Platform Lab John Ousterhout Stanford University.
Welcome to CS 395/495 Measurement and Analysis of Online Social Networks.
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
1 CS : Cloud Computing, Systems, Networking, and Frameworks Fall 2011 (MW 1:00-2:30, 293 Cory Hall) Ion Stoica (
INTRODUCTION TO CLOUD COMPUTING Cs 595 Lecture 5 2/11/2015.
Project Proposal (Title + Abstract) Due Wednesday, September 4, 2013.
Apache Spark and the future of big data applications Eric Baldeschwieler.
Clearstorydata.com Using Spark and Shark for Fast Cycle Analysis on Diverse Data Vaibhav Nivargi.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Tyson Condie.
Introduction To Windows Azure Cloud
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
BIG DATA APPLICATIONS & ANALYTICS LOOKING AT INDIVIDUAL HPCABDS SOFTWARE LAYERS 1/26/2015 Cloud Computing Software 1 Geoffrey Fox January BigDat.
Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica Good morning everyone. My name is Haoyuan,
1 Next Few Classes Networking basics Protection & Security.
DELIVERING THE ENTERPRISE FABRIC FOR BIG DATA Aiaz Kazi SVP, Platform Strategy and Adoption
1 CS6320 – SW Engineering of Web- Based Systems L. Grewe.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
CS5412: SHEDDING LIGHT ON THE CLOUDY FUTURE Ken Birman 1 Lecture XXV.
1 CS : Project Suggestions Ion Stoica and Ali Ghodsi ( September 14, 2015.
Datalayer Notebook Allows Data Scientists to Play with Big Data, Build Innovative Models, and Share Results Easily on Microsoft Azure MICROSOFT AZURE ISV.
Welcome to the Intermountain Big Data Conference! 2 Data Science and Machine Learning Tools from Python to R, with Hands-On R/Shiny U Student – Math major.
CMDBs: Above and Beyond…
Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Bridging the Gap Workshop George Brett Russ Hobby Internet2 Fall Member Meeting September 20, 2005 George Brett Russ Hobby Internet2 Fall Member Meeting.
Lesson Learned: Teacher time is $$$.  Laptops in class means no trips to computer labs  Fewer papers to hand out or collect.  Writing-revising process.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Panel Discussion Software Defined Ecosystems June BigSystem Software-Defined Ecosystems at HPDC Vancouver Canada Geoffrey Fox.
LIMPOPO DEPARTMENT OF ECONOMIC DEVELOPMENT, ENVIRONMENT AND TOURISM The heartland of southern Africa – development is about people! 2015 ICT YOUTH CONFERENCE.
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
Big Data Yuan Xue CS 292 Special topics on.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Stream Processing with Tamás István Ujj
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Big Data is a Big Deal!.
Give Your Data the Edge A Scalable Data Delivery Platform
Introduction to Spark Streaming for Real Time data analysis
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Berkeley Data Analytics Stack (BDAS) Overview
Applications SPIDAL MIDAS ABDS
CS 262a: Advanced Topics in Computer Systems
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
E2E Arguments & Project Suggestions (Lecture 4, cs262a)
Some Takeaways… (Lecture 26, cs262a)
Department of Intelligent Systems Engineering
Digital Science Center
Big-Data Analytics with Azure HDInsight
CS 239 – Big Data Systems Fall 2018
Presentation transcript:

1 CS 294: Big Data System Research: Trends and Challenges Fall 2015 (MW 9:30-11:00, 310 Soda Hall) Ion Stoica and Ali Ghodsi (

Big Data First papers:  2003: The Google file system paper  2004: The MapReduce paper Today every major system & networking conference has Big Data sessions

Big Data Impact Already helped create new business Already helped disrupt existing businesses  Retail  Rental  Taxi  home appliances  …

Big Data Stack Data Processing Layer Resource Management Layer Storage Layer

Hadoop Stack Data Processing Layer Resource Management Layer Storage Layer … Hadoop MR HivePig ImpalaStorm Hadoop Yarn HDFS, S3, …

The Berkeley AMPLab January 2011 – 2017  8 faculty  > 40 students  3 software engineer team Organized for collaboration 3 day retreats (twice a year) 220 campers (100+ companies ) AMPCamp3 (August, 2013)

The Berkeley AMPLab Governmental and industrial funding: Goal: Next generation of open source data analytics stack for industry & academia: Berkeley Data Analytics Stack (BDAS)

BDAS Stack Data Processing Layer Resource Management Layer Storage Layer Mesos Spark Streaming Shark SQL BlinkDB GraphX MLlib MLBase HDFS, S3, … Tachyon

Mesos HDFS, S3, … Tachyon Spark Streaming Shark SQL BlinkDB GraphX MLlib MLBase BDAS & Hadoop fitting together Hadoop Yarn HDFS, S3, …

Mesos HDFS, S3, … Tachyon How do BDAS & Hadoop fit together? Hadoop Yarn HDFS, S3, … Spark Streaming Shark SQL BlinkDB GraphX MLlib MLBase Spark Stramin g Shark SQL Graph X ML library BlinkDB MLbas e SparkHadoop MR HivePig Impal a Storm

Mesos HDFS, S3, … Tachyon How do BDAS & Hadoop fit together? Hadoop Yarn HDFS, S3, … Spark Stramin g Shark SQL Graph X ML library BlinkDB MLbas e SparkHadoop MR HivePig Impal a Storm

This Class Learn about state-of-art research in Big Data Work on an exciting project Hopefully start next generation of impactful projects

13 Grading Project: 60% Class presentations: 40%  Around 2 papers per student  See Randy’s guidelines for leading discussion on papers F07/LeadingPapers.pdfhttp://bnrg.eecs.berkeley.edu/~randy/Courses/CS294. F07/LeadingPapers.pdf

Administrative Information Class website: Office Hours (Soda 465D):  TBA Create an (anonymized) blog account for paper reviews if you don’t have one yet (e.g.,  Sent me an by Monday, August 31, with your blog url  Preferred for the class list 14

15 Papers Is the problem real? What is the solution’s main idea (nugget)? Why is solution different from previous work?  Are system assumptions different?  Is workload different?  Is problem new? Does the paper (or do you) identify any fundamental/hard trade-offs?

16 Papers (cont’d) Do you think the work will be influential in 10 years?  Why or why not? Predicting the future hard, but worth a try  Look at past examples for inspiration

17 Streaming Over TCP Countless papers:  Why cannot be done…  New protocols to do it… Today  Virtually all streaming over TCP  Trend to stream over HTTP!

18 Why did it Succeed?

19 Multicast Countless papers:  Why world will come to a standstill without multicast…  New protocols to do it… Today  Multicast is used only in enterprise settings at best  Overlay multicast widely used in the Internet CDN based, e.g., WorldCup, March Madness, Iinagurations,... P2P, mostly popular outside US (e.g., China)

20 Why Did it Fail?

21 Shared Memory Countless papers:  How shared memory simplifies programming parallel computers  Many, many systems proposed and build Today:  Message passing (MPI) took over as the de facto standard for writing parallel applications

22 Why Did it Fail?

23 Network Computer Big in 90s  Promoted by an alliance of Sun, Oracle, Acorn Promise: many of advantages of cloud computing  Easy to manage  Application sharing …… Failed miserably

24 Why Did it Fail?

Coming Back: ChromeOS Will it succeed this time? 25

26 What are Hard/Fundamental Tradeoffs? Brewer’s CAP conjecture: “Consistency, Availability, Partition-tolerance”, you can have only two in a distributed system In a in-order, reliable communication protocol cannot minimize overhead and latency simultaneously Hard to simultaneously maximize evolvability and performance