Big Data Challenge Mega 10^6 Giga 10^9 Tera 10^12 Peta 10^15 Pig Latin
Linux HPC Bare-system Linux HPC Bare-system Amazon Cloud Windows Server HPC Bare-system Windows Server HPC Bare-system Virtualization Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling) Kernels, Genomics, Proteomics, Information Retrieval, Polar Science, Scientific Simulation Data Analysis and Management, Dissimilarity Computation, Clustering, Multidimensional Scaling, Generative Topological Mapping CPU Nodes Virtualization Applications Programming Model Infrastructure Hardware Azure Cloud Security, Provenance, Portal High Level Language Distributed File Systems Data Parallel File System Grid Appliance GPU Nodes Support Scientific Simulations (Data Mining and Data Analysis) Runtime Storage Services and Workflow Object Store
Judy Qiu Bingjing Zhang School of Informatics and Computing Indiana University
Contents Data mining and Data analysis Applications Twister: Runtime for Iterative MapReduce Twister Tutorial Examples K-means Clustering Multidimensional Scaling Matrix Multiplication
SALSASALSA Intel’s Application Stack
Motivation Iterative algorithms are commonly used in many domains Traditional MapReduce and classical parallel runtimes cannot solve iterative algorithms efficiently – Hadoop: Repeated data access to HDFS, no optimization to data caching and data transfers – MPI: no nature support of fault tolerance and programming interface is complicated
What is Iterative MapReduce Iterative MapReduce – Mapreduce is a Programming Model instantiating the paradigm of bringing computation to data – Iterative Mapreduce extends Mapreduce programming model and support iterative algorithms for Data Mining and Data Analysis Interoperability – Is it possible to use the same computational tools on HPC and Cloud? – Enabling scientists to focus on science not programming distributed systems Reproducibility – Is it possible to use Cloud Computing for Scalable, Reproducible Experimentation? – Sharing results, data, and software
Distinction on static and variable data Configurable long running (cacheable) map/reduce tasks Pub/sub messaging based communication/data transfers Broker Network for facilitating communication Version 0.9 (hands-on) Version beta
Twister [1] – Map->Reduce->Combine->Broadcast – Long running map tasks (data in memory) – Centralized driver based, statically scheduled. Daytona [2] – Microsoft Iterative MapReduce on Azure using cloud services – Architecture similar to Twister Twister4Azure [3] – A decentralized Map/reduce input caching, reduce output caching Haloop [4] – On disk caching, Map/reduce input caching, reduce output caching Spark [5] – Iterative Mapreduce Using Resilient Distributed Dataset to ensure the fault tolerance Pregel [6] – Graph processing from Google Mahout [7] – Apache project for supporting data mining algorithms Iterative MapReduce Frameworks
Others Network Levitated Merge [8] – RDMA/infiniband based shuffle & merge Mate-EC2 [9] – Local reduction object Asynchronous Algorithms in MapReduce [10] – Local & global reduce MapReduce online [11] – online aggregation, and continuous queries – Push data from Map to Reduce Orchestra [12] – Data transfer improvements for MR iMapReduce [13] – Async iterations, One to one map & reduce mapping, automatically joins loop-variant and invariant data CloudMapReduce [14] & Google AppEngine MapReduce [15] – MapReduce frameworks utilizing cloud infrastructure services
Applications & Different Interconnection Patterns Map OnlyClassic MapReduce Iterative MapReduce Twister Loosely Synchronous CAP3 Analysis Document conversion (PDF -> HTML) Brute force searches in cryptography Parametric sweeps High Energy Physics (HEP) Histograms SWG gene alignment Distributed search Distributed sorting Information retrieval Expectation maximization algorithms Clustering Linear Algebra Many MPI scientific applications utilizing wide variety of communication constructs including local interactions - CAP3 Gene Assembly - PolarGrid Matlab data analysis - Information Retrieval - HEP Data Analysis - Calculation of Pairwise Distances for ALU Sequences - Kmeans - Deterministic Annealing Clustering - Multidimensional Scaling MDS - Solving Differential Equations and - particle dynamics with short range forces Input Output map Input map reduce Input map reduce iterations Pij Domain of MapReduce and Iterative ExtensionsMPI
Parallel Data Analysis using Twister MDS (Multi Dimensional Scaling) Clustering (Kmeans) SVM (Scalable Vector Machine) Indexing …
MDS projection of 100,000 protein sequences showing a few experimentally identified clusters in preliminary work with Seattle Children’s Research Institute Application #1 Twister MDS Output
Data Intensive Kmeans Clustering ─ Image Classification: 1.5 TB ; 500 features per image;10k clusters 1000 Map tasks; 1GB data transfer per Map task node Application #2
Data Deluge MapReduce Classic Parallel Runtimes (MPI) Experiencing in many domains Data Centered, QoS Efficient and Proven techniques Input Output map Input map reduce Input map reduce iterations Pij Expand the Applicability of MapReduce to more classes of Applications Map-OnlyMapReduce Iterative MapReduce More Extensions Iterative MapReduce
Twister Programming Model Main program may contain many MapReduce invocations or iterative MapReduce invocations configureMaps(..) configureReduce(..) runMapReduce(..) while(condition){ } //end while updateCondition() close() Combine() operation Reduce() Map() Worker Nodes Communications/data transfers via the pub-sub broker network & direct TCP Iterations Local Disk Cacheable map/reduce tasks Main program’s process space monitorTillCompletion(..)
Twister Design Features Configure() Long running map/reduce task threads (cached) Static Data Loaded only once Map(Key, Value) Reduce (Key, List ) Direct data transfer via pub/sub Combine ( Map ) Main Program Combine operation to collect all reduce outputs Iteration Direct TCP Broadcast/Scatter transfer of dynamic KeyValue pairs Fault detection and recovery support between iterations Concepts and Features in Twister
Twister APIs 1.configureMaps() 2.configureMaps(String partitionFile) 3.configureMaps(List values) 4.configureReduce(List values) 5.addToMemCache(Value value) 6.addToMemCache(List values) 7.cleanMemCache() 8.runMapReduce() 9.runMapReduce(List pairs) 10.runMapReduceBCast(Value value) 11.map(MapOutputCollector collector, Key key, Value val) 12.reduce(ReduceOutputCollector collector, Key key, List values) 13.combine(Map keyValues)
Twister Architecture Worker Node Local Disk Worker Pool Twister Daemon Master Node Twister Driver Main Program B B B B B B B B Pub/Sub Broker Network and Collective Communication Service Worker Node Local Disk Worker Pool Twister Daemon Scripts perform: Data distribution, data collection, and partition file creation map reduce Cacheable tasks
Data Storage and Caching Use local disk to store static data files Use Data Manipulation Tool to manage data Use partition file to locate data Data on disk can be cached into local memory Support using NFS (version 0.9)
Data Manipulation Tool Provides basic functionality to manipulate data across the local disks of the compute nodes Data partitions are assumed to be files (Contrast to fixed sized blocks in Hadoop) Supported commands: – mkdir, rmdir, put, putall, get, ls – Copy resources – Create Partition File Node 0Node 1Node n A common directory in local disks of individual nodes e.g. /tmp/twister_data Data Manipulation Tool Partition File
Partition file allows duplicates to show replica availability One data file and its replicas may reside in multiple nodes Provide information required for caching File NoNode IPDaemon NoFile Partition Path /tmp/zhangbj/data/kmeans/km_142.bin /tmp/zhangbj/data/kmeans/km_382.bin /tmp/zhangbj/data/kmeans/km_370.bin /tmp/zhangbj/data/kmeans/km_57.bin
Pub/Sub Messaging For small control messages only Currently support – NaradaBrokering: single broker and manual configuration – ActiveMQ: multiple brokers and auto configuration
Broadcast Use addToMemCache operation to broadcast dynamic data required by all tasks in each iteration Replace original broker-based methods to direct TCP data transfers Algorithm auto selection Topology-aware broadcasting MemCacheAddress memCacheKey = driver.addToMemCache(centroids); TwisterMonitor monitor = driver.runMapReduceBCast(memCacheKey);
Broadcasting Data could be large Chain & MST Map Collectives Local merge Reduce Collectives Collect but no merge Combine Direct download or Gather Map Tasks Map Collective Reduce Tasks Reduce Collective Gather Map Collective Reduce Tasks Reduce Collective Map Tasks Map Collective Reduce Tasks Reduce Collective Broadcast Twister Communications
Twister Broadcast Comparison Sequential vs. Parallel implementations
Twister Broadcast Comparison: Ethernet vs. InfiniBand
Failure Recovery Recover at iteration boundaries Does not handle individual task failures Any Failure (hardware/daemons) result the following failure handling sequence – Terminate currently running tasks (remove from memory) – Poll for currently available worker nodes (& daemons) – Configure map/reduce using static data (re-assign data partitions to tasks depending on the data locality) – Re-execute the failed iteration
Twister MDS demo
Gene Sequences (N = 1 Million) Distance Matrix Interpolative MDS with Pairwise Distance Calculation Multi- Dimensional Scaling (MDS) Visualization 3D Plot Reference Sequence Set (M = 100K) N - M Sequence Set (900K) Select Referenc e Reference Coordinates x, y, z N - M Coordinates x, y, z Pairwise Alignment & Distance Calculation O(N 2 )
Input DataSize: 680k Sample Data Size: 100k Out-Sample Data Size: 580k Test Environment: PolarGrid with 100 nodes, 800 workers. 100k sample data 680k data
Dimension Reduction Algorithms Multidimensional Scaling (MDS) [1] o Given the proximity information among points. o Optimization problem to find mapping in target dimension of the given data based on pairwise proximity information while minimize the objective function. o Objective functions: STRESS (1) or SSTRESS (2) o Only needs pairwise distances ij between original points (typically not Euclidean) o d ij (X) is Euclidean distance between mapped (3D) points Generative Topographic Mapping (GTM) [2] o Find optimal K-representations for the given data (in 3D), known as K-cluster problem (NP-hard) o Original algorithm use EM method for optimization o Deterministic Annealing algorithm can be used for finding a global solution o Objective functions is to maximize log- likelihood: [1] I. Borg and P. J. Groenen. Modern Multidimensional Scaling: Theory and Applications. Springer, New York, NY, U.S.A., [2] C. Bishop, M. Svens´en, and C. Williams. GTM: The generative topographic mapping. Neural computation, 10(1):215–234, 1998.
High Performance Data Visualization.. Developed parallel MDS and GTM algorithm to visualize large and high-dimensional data Processed 0.1 million PubChem data having 166 dimensions Parallel interpolation can process up to 2M PubChem points MDS for 100k PubChem data 100k PubChem data having 166 dimensions are visualized in 3D space. Colors represent 2 clusters separated by their structural proximity. GTM for 930k genes and diseases Genes (green color) and diseases (others) are plotted in 3D space, aiming at finding cause-and-effect relationships. GTM with interpolation for 2M PubChem data 2M PubChem data is plotted in 3D with GTM interpolation approach. Red points are 100k sampled data and blue points are 4M interpolated points. [3] PubChem project,
Master Node Twister Driver Twister-MDS ActiveMQ Broker ActiveMQ Broker MDS Monitor PlotViz I. Send message to start the job II. Send intermediate results Client Node Twister-MDS Demo Input: 30k metagenomics data MDS reads pairwise distance matrix of all sequences Output: 3D coordinates visualized in PlotViz
Twister-MDS Demo Movie
Questions?
Twister Hands-on Session
SALSASALSA 39 How it works? wikipedia
SALSASALSA 40 C1 C2 C3 … … … Ck Partition C1 C2 C3 … … … Ck Partition C1 C2 C3 … … … Ck Partition … Broadcast C1C2C3………Ck x1y1count1 x2y2count2 x3y3count3 … … … xkykcountk Parallelization of K-means Clustering
SALSASALSA 41 * Do * Broadcast Cn * * [Perform in parallel] –the map() operation * for each Vi * for each Cn,j * Dij <= Euclidian (Vi,Cn,j) * Assign point Vi to Cn,j with minimum Dij * for each Cn,j * newCn,j <=Sum(Vi in j'th cluster) * newCountj <= Sum(1 if Vi in j'th cluster) * * [Perform Sequentially] –the reduce() operation * Collect all Cn * Calculate new cluster centers Cn+1 * Diff<= Euclidian (Cn, Cn+1) * * while (Diff <THRESHOLD) K-means Clustering Algorithm for MapReduce Map Reduce Vi refers to the ith vector Cn,j refers to the jth cluster center in nth * iteration Dij refers to the Euclidian distance between ith vector and jth * cluster center K is the number of cluster centers E-Step M-Step Global reduction
Twister K-means Execution C1 C2 C3 … Ck C1C2 C3… Ck C1C2 C3… Ck C1C2 C3… Ck C1C2 C3… Ck C1C2 C3… Ck
K-means Clustering Code Walk-through Main Program – Job Configuration – Execution iterations KMeansMapTask – Configuration – Map KMeansReduceTask – Reduce KMeansCombiner – Combine
Code Walkthrough – Main Program
Code Walkthrough – Map
Code Walkthrough – Reduce/Combine
/root/software/ twister/ bin/ start_twister.sh twister.sh create_partition_file.sh samples/kmeans/bin/ run_kmeans.sh twisterKmeans_init_clusters.txt data/ (input data directory) data/ (Twister common data directory) apache-activemq-5.4.2/bin/ activemq File Structure
Download Resources Download Virtual Box and Appliance Download Virtual Box Step 1 Download Appliance albox/chef_ubuntu.ova Step 2 Import Appliance Step 3
Prepare the Environment Start the VM Step 4 Open 3 terminals (Putty or equal) and connect to the VM (IP: , username: root, password: school2012) Step 5
Start ActiveMQ in Terminal 1 cd $ACTIVEMQ_HOME/bin./activemq console Step 6
Start Twister in Terminal 2 cd $TWISTER_HOME/bin./start_twister.sh Step 7
Data Contents Data Files Contents
Distribute Data in Terminal 2 Create a directory to hold the data Distribute data cd $TWISTER_HOME/bin./twister.sh mkdir kmeans Step 8./twister.sh put ~/software/twister- 0.9/samples/kmeans/bin/data/ kmeans km 1 1 Step 9
Create Partition File in Terminal 2 Partition File Content cd $TWISTER_HOME/bin./create_partition_file.sh kmeans km_../samples/kmeans/bin/kmeans.pf Step 9
Run K-means Clustering in Terminal 3 cd $TWISTER_HOME/samples/kmeans/bin./run_kmeans.sh twisterKmeans_init_clusters.txt 80 kmeans.pf Step 10
Matrix Multiplication
Iterations Configure Matrix Multiplication Flow in Twister
SALSA HPC Group School of Informatics and Computing Indiana University