Download presentation
Presentation is loading. Please wait.
Published byJeffry Reeves Modified over 9 years ago
1
Big Data Platforms Mihai Budiu, Oct 6 2014
2
My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer architecture Researcher at Microsoft Research Silicon Valley 2004-2014 Computer security Cloud computing infrastructure: distributed computation platforms monitoring and debugging performance analysis Big data analysis and visualization Large scale machine learning 2
3
500 Years Ago 3 Tycho Brahe (1546-1601) Johannes Kepler (1571-1630)
4
The Laws of Planetary Motion 4 Tycho’s measurementsKepler’s laws
5
The Large Hadron Collider 5 25 PB/year WLHC Grid: 200K computing cores
6
Genetic Code 6
7
Astronomy 7
8
Weather 8
9
The Webs 9 Internet Facebook friends graph
10
Big Data 10
11
Big Computers 11
12
Talk Outline 12 Motivation Dryad: A distributed runtime DryadLINQ: A compiler for Dryad Tools and applications Sketch: A billion-row spreadsheet
13
Design Space 13 Throughput (batch) Latency (interactive) Internet Data center Data- parallel Shared memory
14
Dryad Eurosys 2007 Continuously deployed in Microsoft since 2006 Execution engine of Bing analytics > 10 5 machines Many PB of data analyzed daily 14 Dryad painting by Evelyn de Morgan
15
Dryad = Execution Layer 15 Job (application) Dryad Cluster Pipeline Shell Machine ≈
16
2-D Piping Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50 16
17
Virtualized 2-D Pipelines 17
18
Virtualized 2-D Pipelines 18
19
Virtualized 2-D Pipelines 19
20
Virtualized 2-D Pipelines 20
21
Virtualized 2-D Pipelines 21 2D DAG multi-machine virtualized
22
Dryad Job Structure 22 grep sed sort awk perl grep sed sort awk Input files Vertices (processes) Output files Channels Stage
23
Dryad System Architecture 23 Files, TCP, FIFO, Network job schedule data plane control plane NS, Sched RE V VV job managercluster
24
GM code vertex code Staging 1. Build 2. Send.exe 3. Start manager 5. Generate graph 7. Serialize vertices 8. Monitor Vertex execution 4. Query cluster resources Name server 6. Initialize vertices Remote execution service
25
Talk Outline 25 Motivation Dryad: A distributed runtime DryadLINQ: A compiler for Dryad Tools and applications Sketch: A billion-row spreadsheet
26
Distributed Collections 26 Partition Collection.Net objects
27
LINQ 27 Dryad => DryadLINQ
28
28 LINQ =.Net+ Queries Collection collection; bool IsLegal(Key); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};
29
Collection collection; bool IsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; 29 DryadLINQ = LINQ + Dryad C# collection results C# Vertex code Query plan (Dryad job) Data
30
Language Summary 30 Where Select GroupBy OrderBy Aggregate Join
31
Very expressive 31 var result = input.SelectMany(r => Mapper(r)).GroupBy(r => Key(r)).Select(g => Reducer(g)); Map-Reduce Distributed sorting Iterative machine-learning (EM)
32
Talk Outline 32 Motivation Dryad: A distributed runtime DryadLINQ: A compiler for Dryad Tools and applications Sketch: A billion-row spreadsheet
33
Debugging DryadLINQ jobs 33
34
Distributed performance counters 34
35
Training Kinect 35 Depth mapBody parts Classifier Xbox GPU
36
Learn from Many Examples 36 Decision Tree Classifier Machine learning
37
Talk Outline 37 Motivation Dryad: A distributed runtime DryadLINQ: A compiler for Dryad Tools and applications Sketch: A billion-row spreadsheet
38
Bandwidth hierarchy
39
Principles Visualizations are bounded data displays All computations are sketches Sketch is a runtime for (1)running streaming (sketching) algorithms (2)implementing visualizations with bounded data renderings 39
40
Streaming algorithms Sketches = randomized streaming algorithms Input = set of size n Result same independent of the order Memory = O(log(n)) Multi-pass Linear input transformations 40
41
4 billion rows on 155 machines
42
Spreadsheet operations Browsing/scrolling Filtering Using predicates Heavy hitters Sampling Searching Sorting Computing new columns Set operations (intersection, union, etc.) Charting 42
43
Histograms
44
Heat Maps
45
Sketch distributed service 45 data Sketch service data Sketch service data Sketch service data Sketch service
46
DataSets = distributed objects 46 Network 46 Client Servers DataSet Application TTTTTTTTTTT
47
Sketch Spreadsheet architecture 47 DataSet SQL ServerCSV FilesColumn storeCosmos Storage layer Table operations GUI Distributed objects Spreadsheet logic Spreadsheet display
48
DataSet API interface IDataSet { IDataSet Map (Func f); IDataSet > Zip(IDataSet other); R Sketch(ISketch sketch); } interface ISketch { R Create(T data); R Combine(List parts); } 48
49
DataSet Implementations 49 Application Network Client ParallelProxy GUI ParallelLocal ParallelLocal Parallel Dataset interface Rack aggregation Core parallelism Cluster parallelism RMI layer Proxy ref Parallel Server 0 Server 1 Server n Rack 0Rack r Address space T T TT T T
50
ProxyLocal ParallelProxyLocal Parallel TTSS f f Map(f)
51
Sketch(s) 51 ProxyLocal Parallel RR R R s.Combine TT s.Create interface ISketch { R Create(T data); R Combine(List parts); }
52
Zip 52 ProxyLocal ParallelProxyLocal Parallel TTSS ProxyLocal Parallel T,S
53
Histograms 53 CDF 2D histogram
54
Compute Computing a histogram 54 Client Server 1 Server n Histogram 1D + 2D composite sketch Data range sketch Render Display histogram User click trtr thth tata
55
Some numbers Window Server 2012 R2 8-core 2.1GHz AMD Opteron 2373 EE > 16GB RAM 3 x 1TB disks using RAID-0 155 machines 5 racks 1Gbps Ethernet 55
56
56 Null Sketch Machines Time (ms)
57
Histogram computation 26M rows/machine Scale-out 57 machines Time (ms)
58
Conclusions Big data is here to stay Better tools are needed Quest for high-level abstractions for building distributed systems Execution graphs Distributed collections Higher-order transformations Distributed stateful objects Sketching algorithms 58
59
59
60
Execution Application Data-Parallel Computation 60 Storage Language Map- Reduce GFS BigTable Cosmos Azure SQL Server Dryad DryadLINQ Scope Sawzall,FlumeJava Hadoop HDFS S3 Pig, Hive ≈SQLLINQ, SQLSawzall, Java
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.