Big Data Platforms Mihai Budiu, Oct 6 2014. My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer.

Slides:



Advertisements
Similar presentations
Distributed Data-Parallel Programming using Dryad Andrew Birrell, Mihai Budiu, Dennis Fetterly, Michael Isard, Yuan Yu Microsoft Research Silicon Valley.
Advertisements

Cluster Computing with Dryad Mihai Budiu, MSR-SVC LiveLabs, March 2008.
Introduction to Data Center Computing Derek Murray October 2010.
Machine Learning in DryadLINQ Kannan Achan Mihai Budiu MSR-SVC, 1/30/
Distributed Data-Parallel Computing Using a High-Level Programming Language Yuan Yu Michael Isard Joint work with: Andrew Birrell, Mihai Budiu, Jon Currey,
LIBRA: Lightweight Data Skew Mitigation in MapReduce
The Kinect body tracking pipeline Oliver Williams, Mihai Budiu Microsoft Research, Silicon Valley With slides contributed by Johnny Lee, Jamie Shotton.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
PARALLELIZING LARGE-SCALE DATA- PROCESSING APPLICATIONS WITH DATA SKEW: A CASE STUDY IN PRODUCT-OFFER MATCHING Ekaterina Gonina UC Berkeley Anitha Kannan,
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC.
Distributed Computations
Hive: A data warehouse on Hadoop
Distributed computing using Dryad Michael Isard Microsoft Research Silicon Valley.
Dryad / DryadLINQ Slides adapted from those of Yuan Yu and Michael Isard.
Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Platinum Sponsors Titanium Sponsors. ETL Tool (SSIS, etc) EDW (SQL Svr, Teradata, etc) Extract Original Data Load Transformed Data Transform BI Tools.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Ch 4. The Evolution of Analytic Scalability
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Dryad and dataflow systems
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Cluster Computing with DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Intel Research Berkeley, Systems Seminar Series October 9, 2008.
Microsoft DryadLINQ --Jinling Li. What’s DryadLINQ? A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. [1]
Programming clusters with DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Association of C and C++ Users (ACCU) Mountain View, CA, April 13, 2011.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Introduction to Hadoop and HDFS
Training Kinect Mihai Budiu Microsoft Research, Silicon Valley UCSD CNS 2012 RESEARCH REVIEW February 8, 2012.
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
An Introduction to HDInsight June 27 th,
SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Artemis Logs Database View Data Collectio n GUI Dryad Overview Data collection Distributed system Plug-ins GUI Plug-ins Hunting for Bugs with Artemis System.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.
Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.
Matthew Winter and Ned Shawa
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC.
Large-scale Machine Learning using DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Ambient Intelligence: From Sensor Networks to Smart Environments.
Next Generation of Apache Hadoop MapReduce Owen
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Microsoft Ignite /28/2017 6:07 PM
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
CS239-Lecture 3 DryadLINQ Madan Musuvathi Visiting Professor, UCLA
Some slides adapted from those of Yuan Yu and Michael Isard
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
CSCI5570 Large Scale Data Processing Systems
Spark Presentation.
Introduction to HDFS: Hadoop Distributed File System
Parallel Computing with Dryad
Introduction to Spark.
湖南大学-信息科学与工程学院-计算机与科学系
Introduction to Apache
Spark and Scala.
Fast, Interactive, Language-Integrated Cluster Computing
Big-Data Analytics with Azure HDInsight
Server & Tools Business
Presentation transcript:

Big Data Platforms Mihai Budiu, Oct

My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer architecture Researcher at Microsoft Research Silicon Valley Computer security Cloud computing infrastructure: distributed computation platforms monitoring and debugging performance analysis Big data analysis and visualization Large scale machine learning 2

500 Years Ago 3 Tycho Brahe ( ) Johannes Kepler ( )

The Laws of Planetary Motion 4 Tycho’s measurementsKepler’s laws

The Large Hadron Collider 5 25 PB/year WLHC Grid: 200K computing cores

Genetic Code 6

Astronomy 7

Weather 8

The Webs 9 Internet Facebook friends graph

Big Data 10

Big Computers 11

Talk Outline 12 Motivation Dryad: A distributed runtime DryadLINQ: A compiler for Dryad Tools and applications Sketch: A billion-row spreadsheet

Design Space 13 Throughput (batch) Latency (interactive) Internet Data center Data- parallel Shared memory

Dryad Eurosys 2007 Continuously deployed in Microsoft since 2006 Execution engine of Bing analytics > 10 5 machines Many PB of data analyzed daily 14 Dryad painting by Evelyn de Morgan

Dryad = Execution Layer 15 Job (application) Dryad Cluster Pipeline Shell Machine ≈

2-D Piping Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50 16

Virtualized 2-D Pipelines 17

Virtualized 2-D Pipelines 18

Virtualized 2-D Pipelines 19

Virtualized 2-D Pipelines 20

Virtualized 2-D Pipelines 21 2D DAG multi-machine virtualized

Dryad Job Structure 22 grep sed sort awk perl grep sed sort awk Input files Vertices (processes) Output files Channels Stage

Dryad System Architecture 23 Files, TCP, FIFO, Network job schedule data plane control plane NS, Sched RE V VV job managercluster

GM code vertex code Staging 1. Build 2. Send.exe 3. Start manager 5. Generate graph 7. Serialize vertices 8. Monitor Vertex execution 4. Query cluster resources Name server 6. Initialize vertices Remote execution service

Talk Outline 25 Motivation Dryad: A distributed runtime DryadLINQ: A compiler for Dryad Tools and applications Sketch: A billion-row spreadsheet

Distributed Collections 26 Partition Collection.Net objects

LINQ 27 Dryad => DryadLINQ

28 LINQ =.Net+ Queries Collection collection; bool IsLegal(Key); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

Collection collection; bool IsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; 29 DryadLINQ = LINQ + Dryad C# collection results C# Vertex code Query plan (Dryad job) Data

Language Summary 30 Where Select GroupBy OrderBy Aggregate Join

Very expressive 31 var result = input.SelectMany(r => Mapper(r)).GroupBy(r => Key(r)).Select(g => Reducer(g)); Map-Reduce Distributed sorting Iterative machine-learning (EM)

Talk Outline 32 Motivation Dryad: A distributed runtime DryadLINQ: A compiler for Dryad Tools and applications Sketch: A billion-row spreadsheet

Debugging DryadLINQ jobs 33

Distributed performance counters 34

Training Kinect 35 Depth mapBody parts Classifier Xbox GPU

Learn from Many Examples 36 Decision Tree Classifier Machine learning

Talk Outline 37 Motivation Dryad: A distributed runtime DryadLINQ: A compiler for Dryad Tools and applications Sketch: A billion-row spreadsheet

Bandwidth hierarchy

Principles Visualizations are bounded data displays All computations are sketches Sketch is a runtime for (1)running streaming (sketching) algorithms (2)implementing visualizations with bounded data renderings 39

Streaming algorithms Sketches = randomized streaming algorithms Input = set of size n Result same independent of the order Memory = O(log(n)) Multi-pass Linear input transformations 40

4 billion rows on 155 machines

Spreadsheet operations Browsing/scrolling Filtering Using predicates Heavy hitters Sampling Searching Sorting Computing new columns Set operations (intersection, union, etc.) Charting 42

Histograms

Heat Maps

Sketch distributed service 45 data Sketch service data Sketch service data Sketch service data Sketch service

DataSets = distributed objects 46 Network 46 Client Servers DataSet Application TTTTTTTTTTT

Sketch Spreadsheet architecture 47 DataSet SQL ServerCSV FilesColumn storeCosmos Storage layer Table operations GUI Distributed objects Spreadsheet logic Spreadsheet display

DataSet API interface IDataSet { IDataSet Map (Func f); IDataSet > Zip(IDataSet other); R Sketch(ISketch sketch); } interface ISketch { R Create(T data); R Combine(List parts); } 48

DataSet Implementations 49 Application Network Client ParallelProxy GUI ParallelLocal ParallelLocal Parallel Dataset interface Rack aggregation Core parallelism Cluster parallelism RMI layer Proxy ref Parallel Server 0 Server 1 Server n Rack 0Rack r Address space T T TT T T

ProxyLocal ParallelProxyLocal Parallel TTSS f f Map(f)

Sketch(s) 51 ProxyLocal Parallel RR R R s.Combine TT s.Create interface ISketch { R Create(T data); R Combine(List parts); }

Zip 52 ProxyLocal ParallelProxyLocal Parallel TTSS ProxyLocal Parallel T,S

Histograms 53 CDF 2D histogram

Compute Computing a histogram 54 Client Server 1 Server n Histogram 1D + 2D composite sketch Data range sketch Render Display histogram User click trtr thth tata

Some numbers Window Server 2012 R2 8-core 2.1GHz AMD Opteron 2373 EE > 16GB RAM 3 x 1TB disks using RAID machines 5 racks 1Gbps Ethernet 55

56 Null Sketch Machines Time (ms)

Histogram computation 26M rows/machine Scale-out 57 machines Time (ms)

Conclusions Big data is here to stay Better tools are needed Quest for high-level abstractions for building distributed systems Execution graphs Distributed collections Higher-order transformations Distributed stateful objects Sketching algorithms 58

59

Execution Application Data-Parallel Computation 60 Storage Language Map- Reduce GFS BigTable Cosmos Azure SQL Server Dryad DryadLINQ Scope Sawzall,FlumeJava Hadoop HDFS S3 Pig, Hive ≈SQLLINQ, SQLSawzall, Java