Big Data Platforms Mihai Budiu, Oct
My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer architecture Researcher at Microsoft Research Silicon Valley Computer security Cloud computing infrastructure: distributed computation platforms monitoring and debugging performance analysis Big data analysis and visualization Large scale machine learning 2
500 Years Ago 3 Tycho Brahe ( ) Johannes Kepler ( )
The Laws of Planetary Motion 4 Tycho’s measurementsKepler’s laws
The Large Hadron Collider 5 25 PB/year WLHC Grid: 200K computing cores
Genetic Code 6
Astronomy 7
Weather 8
The Webs 9 Internet Facebook friends graph
Big Data 10
Big Computers 11
Talk Outline 12 Motivation Dryad: A distributed runtime DryadLINQ: A compiler for Dryad Tools and applications Sketch: A billion-row spreadsheet
Design Space 13 Throughput (batch) Latency (interactive) Internet Data center Data- parallel Shared memory
Dryad Eurosys 2007 Continuously deployed in Microsoft since 2006 Execution engine of Bing analytics > 10 5 machines Many PB of data analyzed daily 14 Dryad painting by Evelyn de Morgan
Dryad = Execution Layer 15 Job (application) Dryad Cluster Pipeline Shell Machine ≈
2-D Piping Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50 16
Virtualized 2-D Pipelines 17
Virtualized 2-D Pipelines 18
Virtualized 2-D Pipelines 19
Virtualized 2-D Pipelines 20
Virtualized 2-D Pipelines 21 2D DAG multi-machine virtualized
Dryad Job Structure 22 grep sed sort awk perl grep sed sort awk Input files Vertices (processes) Output files Channels Stage
Dryad System Architecture 23 Files, TCP, FIFO, Network job schedule data plane control plane NS, Sched RE V VV job managercluster
GM code vertex code Staging 1. Build 2. Send.exe 3. Start manager 5. Generate graph 7. Serialize vertices 8. Monitor Vertex execution 4. Query cluster resources Name server 6. Initialize vertices Remote execution service
Talk Outline 25 Motivation Dryad: A distributed runtime DryadLINQ: A compiler for Dryad Tools and applications Sketch: A billion-row spreadsheet
Distributed Collections 26 Partition Collection.Net objects
LINQ 27 Dryad => DryadLINQ
28 LINQ =.Net+ Queries Collection collection; bool IsLegal(Key); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};
Collection collection; bool IsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; 29 DryadLINQ = LINQ + Dryad C# collection results C# Vertex code Query plan (Dryad job) Data
Language Summary 30 Where Select GroupBy OrderBy Aggregate Join
Very expressive 31 var result = input.SelectMany(r => Mapper(r)).GroupBy(r => Key(r)).Select(g => Reducer(g)); Map-Reduce Distributed sorting Iterative machine-learning (EM)
Talk Outline 32 Motivation Dryad: A distributed runtime DryadLINQ: A compiler for Dryad Tools and applications Sketch: A billion-row spreadsheet
Debugging DryadLINQ jobs 33
Distributed performance counters 34
Training Kinect 35 Depth mapBody parts Classifier Xbox GPU
Learn from Many Examples 36 Decision Tree Classifier Machine learning
Talk Outline 37 Motivation Dryad: A distributed runtime DryadLINQ: A compiler for Dryad Tools and applications Sketch: A billion-row spreadsheet
Bandwidth hierarchy
Principles Visualizations are bounded data displays All computations are sketches Sketch is a runtime for (1)running streaming (sketching) algorithms (2)implementing visualizations with bounded data renderings 39
Streaming algorithms Sketches = randomized streaming algorithms Input = set of size n Result same independent of the order Memory = O(log(n)) Multi-pass Linear input transformations 40
4 billion rows on 155 machines
Spreadsheet operations Browsing/scrolling Filtering Using predicates Heavy hitters Sampling Searching Sorting Computing new columns Set operations (intersection, union, etc.) Charting 42
Histograms
Heat Maps
Sketch distributed service 45 data Sketch service data Sketch service data Sketch service data Sketch service
DataSets = distributed objects 46 Network 46 Client Servers DataSet Application TTTTTTTTTTT
Sketch Spreadsheet architecture 47 DataSet SQL ServerCSV FilesColumn storeCosmos Storage layer Table operations GUI Distributed objects Spreadsheet logic Spreadsheet display
DataSet API interface IDataSet { IDataSet Map (Func f); IDataSet > Zip(IDataSet other); R Sketch(ISketch sketch); } interface ISketch { R Create(T data); R Combine(List parts); } 48
DataSet Implementations 49 Application Network Client ParallelProxy GUI ParallelLocal ParallelLocal Parallel Dataset interface Rack aggregation Core parallelism Cluster parallelism RMI layer Proxy ref Parallel Server 0 Server 1 Server n Rack 0Rack r Address space T T TT T T
ProxyLocal ParallelProxyLocal Parallel TTSS f f Map(f)
Sketch(s) 51 ProxyLocal Parallel RR R R s.Combine TT s.Create interface ISketch { R Create(T data); R Combine(List parts); }
Zip 52 ProxyLocal ParallelProxyLocal Parallel TTSS ProxyLocal Parallel T,S
Histograms 53 CDF 2D histogram
Compute Computing a histogram 54 Client Server 1 Server n Histogram 1D + 2D composite sketch Data range sketch Render Display histogram User click trtr thth tata
Some numbers Window Server 2012 R2 8-core 2.1GHz AMD Opteron 2373 EE > 16GB RAM 3 x 1TB disks using RAID machines 5 racks 1Gbps Ethernet 55
56 Null Sketch Machines Time (ms)
Histogram computation 26M rows/machine Scale-out 57 machines Time (ms)
Conclusions Big data is here to stay Better tools are needed Quest for high-level abstractions for building distributed systems Execution graphs Distributed collections Higher-order transformations Distributed stateful objects Sketching algorithms 58
59
Execution Application Data-Parallel Computation 60 Storage Language Map- Reduce GFS BigTable Cosmos Azure SQL Server Dryad DryadLINQ Scope Sawzall,FlumeJava Hadoop HDFS S3 Pig, Hive ≈SQLLINQ, SQLSawzall, Java