Towards a Collective Layer in the Big Data Stack Thilina Gunarathne Judy Qiu Dennis Gannon
Introduction Three disruptions – Big Data – MapReduce – Cloud Computing MapReduce to process the “Big Data” in cloud or cluster environments Generalizing MapReduce and integrating it with HPC technologies 2
Introduction Splits MapReduce into a Map and a Collective communication phase Map-Collective communication primitives – Improve the efficiency and usability – Map-AllGather, Map-AllReduce, MapReduceMergeBroadcast and Map-ReduceScatter patterns – Can be applied to multiple run times Prototype implementations for Hadoop and Twister4Azure – Up to 33% performance improvement for KMeansClustering – Up to 50% for Multi-dimensional scaling 3
Outline Introduction Background Collective communication primitives – Map-AllGather – Map-Reduce Performance analysis Conclusion 4
Outline Introduction Background Collective communication primitives – Map-AllGather – Map-Reduce Performance analysis Conclusion 5
Data Intensive Iterative Applications Growing class of applications – Clustering, data mining, machine learning & dimension reduction applications – Driven by data deluge & emerging computation fields – Lots of scientific applications k ← 0; MAX ← maximum iterations δ [0] ← initial delta value while ( k< MAX_ITER || f(δ [k], δ [k-1] ) ) foreach datum in data β[datum] ← process (datum, δ [k] ) end foreach δ [k+1] ← combine(β[]) k ← k+1 end while k ← 0; MAX ← maximum iterations δ [0] ← initial delta value while ( k< MAX_ITER || f(δ [k], δ [k-1] ) ) foreach datum in data β[datum] ← process (datum, δ [k] ) end foreach δ [k+1] ← combine(β[]) k ← k+1 end while 6
Data Intensive Iterative Applications Compute CommunicationReduce/ barrier New Iteration Larger Loop- Invariant Data Smaller Loop- Variant Data Broadcast 7
Iterative MapReduce MapReduceMergeBroadcast Extensions to support additional broadcast (+other) input data Map(,, list_of ) Reduce(, list_of, list_of ) Merge(list_of >,list_of ) MapCombineShuffleSortReduceMergeBroadcast 8
Twister4Azure – Iterative MapReduce Decentralized iterative MR architecture for clouds – Utilize highly available and scalable Cloud services Extends the MR programming model Multi-level data caching – Cache aware hybrid scheduling Multiple MR applications per job Collective communication primitives Outperforms Hadoop in local cluster by 2 to 4 times Sustain features of MRRoles4Azure – dynamic scheduling, load balancing, fault tolerance, monitoring, local testing/debugging 9
Outline Introduction Background Collective communication primitives – Map-AllGather – Map-Reduce Performance analysis Conclusion 10
Collective Communication Primitives for Iterative MapReduce Introducing All-to-All collective communications primitives to MapReduce Supports common higher-level communication patterns 11
Collective Communication Primitives for Iterative MapReduce Performance – Optimized group communication – Framework can optimize these operations transparently to the users Poly-algorithm (polymorphic) – Avoids unnecessary barriers and other steps in traditional MR and iterative MR – Scheduling using primitives Ease of use – Users do not have to manually implement these logic – Preserves the Map & Reduce API’s – Easy to port applications using more natural primitives 12
Goals Fit with MapReduce data and computational model – Multiple Map task waves – Significant execution variations and inhomogeneous tasks Retain scalability Programming model simple and easy to understand Maintain the same type of framework-managed excellent fault tolerance Backward compatibility with MapReduce model – Only flip a configuration option 13
Map-AllGather Collective Traditional iterative Map Reduce – The “reduce” step assembles the outputs of the Map Tasks together in order – “merge” assembles the outputs of the Reduce tasks – Broadcast the assembled output to all the workers. Map-AllGather primitive, – Broadcasts the Map Task outputs to all the computational nodes – Assembles them together in the recipient nodes – Schedules the next iteration or the application. Eliminates the need for reduce, merge, monolithic broadcasting steps and unnecessary barriers. Example : MDS BCCalc, PageRank with in-links matrix (matrix-vector multiplication) 14
Map-AllGather Collective 15
Map-AllReduce Map-AllReduce – Aggregates the results of the Map Tasks Supports multiple keys and vector values – Broadcast the results – Use the result to decide the loop condition – Schedule the next iteration if needed Associative commutative operations – Eg: Sum, Max, Min. Examples : Kmeans, PageRank, MDS stress calc 16
Map-AllReduce collective Map 1 Map 2 Map N (n+1) th Iteration Iterate Map 1 Map 2 Map N n th Iteration Op 17
Implementations H-Collectives : Map-Collectives for Apache Hadoop – Node-level data aggregations and caching – Speculative iteration scheduling – Hadoop Mappers with only very minimal changes – Support dynamic scheduling of tasks, multiple map task waves, typical Hadoop fault tolerance and speculative executions. – Netty NIO based implementation Map-Collectives for Twister4Azure iterative MapReduce – WCF Based implementation – Instance level data aggregation and caching 18
19 MPIHadoopH-CollectivesTwister4Azure All-to- One Gathershuffle-reduce* shuffle-reduce-merge Reduceshuffle-reduce* shuffle-reduce-merge One-to- All Broadcast shuffle-reduce- distributedcache merge-broadcast Scatter shuffle-reduce- distributedcache** merge-broadcast ** All-to-All AllGather Map-AllGather AllReduce Map-AllReduce Reduce- Scatter Map-ReduceScatter (future work) Map-ReduceScatter (future works) Synchron ization Barrier Barrier between Map & Reduce Barrier between Map & Reduce and between iterations Barrier between Map, Reduce, Merge and between iterations
Outline Introduction Background Collective communication primitives – Map-AllGather – Map-Reduce Performance analysis Conclusion 20
KMeansClustering Hadoop vs H-Collectives Map-AllReduce. 500 Centroids (clusters). 20 Dimensions. 10 iterations. Weak scaling Strong scaling 21
KMeansClustering Twister4Azure vs T4A-Collectives Map-AllReduce. 500 Centroids (clusters). 20 Dimensions. 10 iterations. Weak scaling Strong scaling 22
MultiDimensional Scaling Hadoop MDS – BCCalc onlyTwister4Azure MDS 23
Hadoop MDS Overheads Hadoop MapReduce MDS-BCCalc H-Collectives AllGather MDS-BCCalc H-Collectives AllGather MDS- BCCalc without speculative scheduling 24
Outline Introduction Background Collective communication primitives – Map-AllGather – Map-Reduce Performance analysis Conclusion 25
Conclusions Map-Collectives, collective communication operations for MapReduce inspired by MPI collectives – Improve the communication and computation performance Enable highly optimized group communication across the workers Get rid of unnecessary/redundant steps Enable poly-algorithm approaches – Improve usability More natural patterns Decrease the implementation burden Future where many MapReduce and iterative MapReduce frameworks support a common set of portable Map-Collectives Prototype implementations for Hadoop and Twister4Azure – Up to 33% to 50% speedups 26
Future Work Map-ReduceScatter collective – Modeled after MPI ReduceScatter – Eg: PageRank Explore ideal data models for the Map-Collectives model 27
Acknowledgements Prof. Geoffrey C Fox for his many insights and feedbacks Present and past members of SALSA group – Indiana University. Microsoft for Azure Cloud Academic Resources Allocation National Science Foundation CAREER Award OCI Persistent Systems for the fellowship 28
Thank You! 29
Backup Slides 30
Application Types Slide from Geoffrey Fox Advances in Clouds and their application to Data Intensive problems University of Southern California Seminar February Advances in Clouds and their application to Data Intensive problems 31
Feature Programming Model Data StorageCommunication Scheduling & Load Balancing HadoopMapReduceHDFSTCP Data locality, Rack aware dynamic task scheduling through a global queue, natural load balancing Dryad [1] DAG based execution flows Windows Shared directories Shared Files/TCP pipes/ Shared memory FIFO Data locality/ Network topology based run time graph optimizations, Static scheduling Twister [2] Iterative MapReduce Shared file system / Local disks Content Distribution Network/Direct TCP Data locality, based static scheduling MPI Variety of topologies Shared file systems Low latency communication channels Available processing capabilities/ User controlled 32
Feature Failure Handling MonitoringLanguage SupportExecution Environment Hadoop Re-execution of map and reduce tasks Web based Monitoring UI, API Java, Executables are supported via Hadoop Streaming, PigLatin Linux cluster, Amazon Elastic MapReduce, Future Grid Dryad [1] Re-execution of vertices C# + LINQ (through DryadLINQ) Windows HPCS cluster Twister [2] Re-execution of iterations API to monitor the progress of jobs Java, Executable via Java wrappers Linux Cluster, FutureGrid MPI Program level Check pointing Minimal support for task level monitoring C, C++, Fortran, Java, C# Linux/Windows cluster 33
Iterative MapReduce Frameworks Twister [1] – Map->Reduce->Combine->Broadcast – Long running map tasks (data in memory) – Centralized driver based, statically scheduled. Daytona [3] – Iterative MapReduce on Azure using cloud services – Architecture similar to Twister Haloop [4] – On disk caching, Map/reduce input caching, reduce output caching iMapReduce [5] – Async iterations, One to one map & reduce mapping, automatically joins loop-variant and invariant data 34