Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.

Similar presentations


Presentation on theme: "Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe."— Presentation transcript:

1 Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe Hellerstein

2 Big Data is Everywhere 72 Hours a Minute YouTube 28 Million Wikipedia Pages 900 Million Facebook Users 6 Billion Flickr Photos 2 “… data a new class of economic asset, like currency or gold.” “…growing at 50 percent a year…”

3 How will we design and implement Big learning systems? Big Learning 3

4 Shift Towards Use Of Parallelism in ML GPUsMulticoreClustersCloudsSupercomputers ML experts repeatedly solve the same parallel design challenges: Race conditions, distributed state, communication… Resulting code is very specialized: difficult to maintain, extend, debug… Graduate students Avoid these problems by using high-level abstractions 4

5 CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase 5

6 CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase 6 Embarrassingly Parallel independent computation 12.912.9 42.342.3 21.321.3 25.825.8 No Communication needed

7 CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase 7 12.912.9 42.342.3 21.321.3 25.825.8 24.124.1 84.384.3 18.418.4 84.484.4 Embarrassingly Parallel independent computation No Communication needed

8 CPU 1 CPU 2 CPU 3 CPU 4 MapReduce – Map Phase 8 12.912.9 42.342.3 21.321.3 25.825.8 17.517.5 67.567.5 14.914.9 34.334.3 24.124.1 84.384.3 18.418.4 84.484.4 Embarrassingly Parallel independent computation No Communication needed

9 CPU 1 CPU 2 MapReduce – Reduce Phase 9 12.912.9 42.342.3 21.321.3 25.825.8 24.124.1 84.384.3 18.418.4 84.484.4 17.517.5 67.567.5 14.914.9 34.334.3 22 26. 26 17 26. 31 Image Features Attractive Face Statistics Ugly Face Statistics U A A U U U A A U A U A Attractive FacesUgly Faces

10 MapReduce for Data-Parallel ML Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Cross Validation Feature Extraction MapReduce Computing Sufficient Statistics Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Graph Analysis PageRank Triangle Counting Collaborative Filtering Tensor Factorization Is there more to Machine Learning ? 10

11 Carnegie Mellon Exploit Dependencies

12 12 Hockey Scuba Diving Underwater Hockey Scuba Diving Hockey

13 Graphs are Everywhere Users Movies Netflix Collaborative Filtering Docs Words Wiki Text Analysis Social Network Probabilistic Analysis 13

14 Properties of Computation on Graphs Dependency Graph Iterative Computation My Interests Friends Interests Local Updates

15 ML Tasks Beyond Data-Parallelism Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Graph Analysis PageRank Triangle Counting Collaborative Filtering Tensor Factorization 15

16 ML Tasks Beyond Data-Parallelism Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Graph Analysis PageRank Triangle Counting Collaborative Filtering Tensor Factorization 16 Map Reduce?

17 Properties of Graph-Parallel Algorithms Dependency Graph Iterative Computation My Interests Friends Interests Local Updates

18 Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Data-Mining PageRank Triangle Counting Collaborative Filtering Tensor Factorization Map Reduce?

19 Carnegie Mellon What is GraphLab?

20 Shared Memory GraphLab Ideal Speedup #proc Speedup 4 8 12 16 4 8 12 16 BP+Parameter Learning Ideal Speedup #proc Speedup 4 8 12 16 4 8 12 16 Gibbs Sampling Ideal Speedup #proc Speedup 4 8 12 16 4 8 12 16 CoEM Ideal Speedup #proc Speedup 4 8 12 16 4 8 12 16 Lasso Ideal Speedup #proc Speedup 4 8 12 16 4 8 12 16 Compressed Sensing 20 GraphLab: A New Parallel Framework for Machine Learning Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein (UAI 2010)

21 Bayesian Tensor Factorization Gibbs Sampling Matrix Factorization Lasso SVM Belief Propagation PageRank CoEM SVD LDA …Many others… Linear Solvers Splash Sampler Alternating Least Squares 21 2010 Shared Memory

22 22 Limited CPU Power Limited Memory Limited Scalability

23 Distributed Cloud -Distributing State -Data Consistency -Fault Tolerance -Distributing State -Data Consistency -Fault Tolerance 23 Unlimited amount of computation resources! (up to funding limitations) Unlimited amount of computation resources! (up to funding limitations)

24 The GraphLab Framework Consistency Model Graph Based Data Representation Update Functions User Computation 24

25 Data Graph Data associated with vertices and edges Vertex Data: User profile Current interests estimates Edge Data: Relationship (friend, classmate, relative) Graph: Social Network 25

26 Distributed Graph Partition the graph across multiple machines. 26

27 Distributed Graph Ghost vertices maintain adjacency structure and replicate remote data. “ghost” vertices 27

28 Distributed Graph Cut efficiently using HPC Graph partitioning tools (ParMetis / Scotch / …) “ghost” vertices 28

29 The GraphLab Framework Consistency Model Graph Based Data Representation Update Functions User Computation 29

30 Pagerank(scope){ // Update the current vertex data // Reschedule Neighbors if needed if vertex.PageRank changes then reschedule_all_neighbors; } Update Functions User-defined program: applied to a vertex and transforms data in scope of vertex Dynamic computation Update function applied (asynchronously) in parallel until convergence Many schedulers available to prioritize computation Update function applied (asynchronously) in parallel until convergence Many schedulers available to prioritize computation 30 Why Dynamic?

31 BSP ML Problem: Synchronous Algorithms can be Inefficient Bulk Synchronous Asynchronous Splash BP 31

32 Asynchronous Belief Propagation Cumulative Vertex Updates Many Updates Few Updates Algorithm identifies and focuses on hidden sequential structure Algorithm identifies and focuses on hidden sequential structure Graphical Model Challenge = Boundaries 32

33 Shared Memory Dynamic Schedule e e f f g g k k j j i i h h d d c c b b a a CPU 1 CPU 2 a a h h a a b b b b i i 33 Process repeats until scheduler is empty Scheduler

34 Distributed Scheduling e e i i h h b b a a f f g g k k j j d d c c a a h h f f g g j j c c b b i i Each machine maintains a schedule over the vertices it owns. 34 Distributed Consensus used to identify completion

35 Ensuring Race-Free Code How much can computation overlap? 35

36 CPU 1 CPU 2 Common Problem: Write-Write Race 36 Processors running adjacent update functions simultaneously modify shared data: CPU1 writes:CPU2 writes: Final Value

37 The GraphLab Framework Consistency Model Graph Based Data Representation Update Functions User Computation 37

38 PageRank Revisited 38 Pagerank(scope) { … }

39 Racing PageRank This was actually encountered in user code. 39

40 Bugs 40 Pagerank(scope) { … }

41 Bugs 41 Pagerank(scope) { … } tmp

42 Throughput != Performance No Consistency Higher Throughput (#updates/sec) Potentially Slower Convergence of ML 42

43 Racing Collaborative Filtering 43

44 Serializability 44 For every parallel execution, there exists a sequential execution of update functions which produces the same result. CPU 1 CPU 2 Single CPU Single CPU Parallel Sequential time

45 Serializability Example 45 Read Write Update functions one vertex apart can be run in parallel. Edge Consistency Overlapping regions are only read. Overlapping regions are only read. Stronger / Weaker consistency levels available User-tunable consistency levels trades off parallelism & consistency Stronger / Weaker consistency levels available User-tunable consistency levels trades off parallelism & consistency

46 Stronger / Weaker Consistency 46 Write Does not read adjacent vertices Full Consistency Parallel Updates must be two vertices apart Vertex Consistency Parallel Updates can be adjacent.

47 Solution 1 Graph Coloring Distributed Consistency Solution 2 Distributed Locking

48 Edge Consistency via Graph Coloring Vertices of the same color are all at least one vertex apart. Therefore, All vertices of the same color can be run in parallel! 48

49 Chromatic Distributed Engine Time Execute tasks on all vertices of color 0 Execute tasks on all vertices of color 0 Ghost Synchronization Completion + Barrier Execute tasks on all vertices of color 1 Execute tasks on all vertices of color 1 Ghost Synchronization Completion + Barrier 49

50 Matrix Factorization Netflix Collaborative Filtering Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges Netflix Users Movies d 50 Users Movies

51 Netflix Collaborative Filtering 51 Ideal D=100 D=20 # machines Hadoop MPI GraphLab # machines

52 The Cost of Hadoop 52

53 CoEM (Rosie Jones, 2005) Named Entity Recognition Task the cat Australia Istanbul ran quickly travelled to is pleasant Hadoop95 Cores7.5 hrs Is “Cat” an animal? Is “Istanbul” a place? Vertices: 2 Million Edges: 200 Million 53 Distributed GL32 EC2 Nodes80 secs GraphLab16 Cores30 min 0.3% of Hadoop time

54 Distributed NER 54 Hadoop GraphLab Hadoop95 Cores7.5 hrs Distributed GL32 EC2 Nodes80 secs GraphLab16 Cores30 min 0.3% of Hadoop time

55 Problems Require a graph coloring to be available. Frequent Barriers make it extremely inefficient for highly dynamic systems where only a small number of vertices are active in each round. 55

56 Solution 1 Graph Coloring Distributed Consistency Solution 2 Distributed Locking

57 Edge Consistency can be guaranteed through locking. : RW Lock 57

58 Consistency Through Locking Acquire write-lock on center vertex, read-lock on adjacent. 58

59 Solution Pipelining CPU Machine 1 Machine 2 A A C C B B D D Consistency Through Locking Multicore Setting PThread RW-Locks Distributed Setting Distributed Locks Challenges Latency A A C C B B D D A A C C B B D D A A 59

60 No Pipelining lock scope 1 Process request 1 scope 1 acquired update_function 1 release scope 1 Process release 1 Time 60

61 Pipelining / Latency Hiding Hide latency using pipelining lock scope 1 Process request 1 scope 1 acquired update_function 1 release scope 1 Process release 1 lock scope 2 Time lock scope 3 Process request 2 Process request 3 scope 2 acquired scope 3 acquired update_function 2 release scope 2 61

62 Latency Hiding Hide latency using request buffering lock scope 1 Process request 1 scope 1 acquired update_function 1 release scope 1 Process release 1 lock scope 2 Time lock scope 3 Process request 2 Process request 3 scope 2 acquired scope 3 acquired update_function 2 release scope 2 No Pipelining472 s Pipelining10 s 47x Speedup Residual BP on 190K Vertex 560K Edge Graph 4 Machines 62

63 Video Cosegmentation Segments mean the same 1740 Frames Model: 10.5 million nodes, 31 million edges Probabilistic Inference Task 63

64 Video Coseg. Speedups 64 GraphLab Ideal # machines

65 The GraphLab Framework Consistency Model Graph Based Data Representation Update Functions User Computation 65

66 What if machines fail? How do we provide fault tolerance? What if machines fail? How do we provide fault tolerance?

67 Checkpoint 1: Stop the world 2: Write state to disk

68 Snapshot Performance 68 No Snapshot SnapshotOne slow machine Because we have to stop the world, One slow machine slows everything down! Because we have to stop the world, One slow machine slows everything down! Snapshot time Slow machine

69 How can we do better? Take advantage of consistency

70 Checkpointing 1985: Chandy-Lamport invented an asynchronous snapshotting algorithm for distributed systems. 70 snapshotted Not snapshotted

71 Checkpointing Fine Grained Chandy-Lamport. 71 Easily implemented within GraphLab as an Update Function!

72 Async. Snapshot Performance 72 No Snapshot SnapshotOne slow machine No penalty incurred by the slow machine!

73 Summary Extended GraphLab abstraction to distributed systems Two different methods of achieving consistency Graph Coloring Distributed Locking with pipelining Efficient implementations Asynchronous Fault Tolerance with fined-grained Chandy-Lamport Performance Useability Efficiency Scalabilitys 73

74 Carnegie Mellon What’s next?

75 Natural Graphs

76 Natural Graphs  Power Law Top 1% of vertices is adjacent to 53% of the edges! Yahoo! Web Graph: 1.4B Verts, 6.7B Edges 76 “Power Law”

77 OSDI PowerGraph Revised Abstraction Improved Distributed Graph Placement New distributed locking protocol (fined grained Chandy-Misra) Demonstrated scaling to billions of vertices and tens of billions of edges. … many more … 77

78 Carnegie Mellon University Release 2.1 + many toolkits http://graphlab.org PageRank: 40x faster than Hadoop 1B edges per second. Triangle Counting: 282x faster than Hadoop 400M triangles per second. Major Improvements to be published in OSDI 2012

79 Carnegie Mellon University Release 2.1 available now http://graphlab.org PageRank: 40x faster than Hadoop 1B edges per second. Triangle Counting: 282x faster than Hadoop 400M triangles per second 64 EC2 HPC Nodes Toolkits - Graph Analytics - Collaborative Filtering - Clustering - Topic Modeling - Graphical Models - Computer Vision Major Improvements to be published in OSDI 2012


Download ppt "Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe."

Similar presentations


Ads by Google