Download presentation
Presentation is loading. Please wait.
Published byOctavia Morris Modified over 9 years ago
1
Venkatram Ramanathan 1
2
Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet Transform On FREERIDE Co-clustering on FREERIDE Conclusion 2
3
Performance Increase: Increased number of cores with lower clock frequencies Cost Effective Scalability of performance HPC Environments – Cluster of Multi- Cores 3
4
Multi-Level Parallelism Within Cores in a node – Shared Memory Parallelism - Pthreads, OpenMP Within Nodes – Distributed Memory Parallelism - MPI Achieving Programmability and Performance – Major Challenge 4
5
Possible solution Use higher-level/restricted APIs Reduction based APIs Map-Reduce Higher-level API Program Cluster of Multi-Cores with 1 API Expressive Power Considered Limited Expressing computations using reduction-based APIs 5
6
Two Algorithms Wavelet Transform Co-Clustering Expressed as reduction structures and parallelized on FREERIDE Speedup of 42 on 64 cores for Wavelet Transform Speedup of 21 on 32 cores for Co- clustering 6
7
MapReduce Map (in_key,in_value) -> list(out_key,intermediate_value) Reduce(out_key,list(intermediate_value) -> list(out_value) FREERIDE Users explicitly declare Reduction Object and update it Map and Reduce steps combined Each data element – processed and reduced before next element is processed 7
8
8
9
9
10
Wavelet Transform – Important tool in Medical Imaging fMRI – probing mechanism for brain activation Seeks to study behavior across spatio- temporal data 10
11
Discrete Wavelet Transform Defined for input having 2^n numbers Convolution along Time domain results in 2^n output values Has following steps Pair up Input values Store difference Pass the sum Repeat till there are 2^n – 1 differences and 1 sum 11
12
Serial Wavelet Transform Algorithm Input: a1, a2, a3, a4, a5, a6, a7, a8 Output: a1-a2, a3-a4, a5-a6, a7-a8 a1+a2-a3-a4, a5+a6-a7-a8 a1+a2+a3+a4-a5-a6-a7-a8 a1+a2+a3+a4+a5+a6+a7+a8 12
13
Time Series length = T; Number of Nodes = P Time Series Per Node = T/P If P is power of 2, T/P-1 final values of output calculated locally T-P final values produced without communication Remaining P values require Inter-process Communication Allocate reduction object of size P on each Node Each node updates Reduction Object with its contribution Global reduction The last P values can be calculated. Since output – out of order, index on the output where each final value needs to go can be calculated 13
14
14
15
Input data distributed among nodes Threads share data Size of reduction object - #Threads x #Nodes Each thread computes local final values Updates reduction object at ThreadID+(#Threads x NodeID) Global Combination Calculate last #Threads x #Nodes values from the data in reduction object 15
16
16
17
Computation of the last #Threads x #Nodes values – Parallelized Local Reduction step Global Reduction step- Global Array Size of Reduction Object Local Reduction Step : #Threads Global Reduction Step: #Nodes 17
18
18
19
Index if Iteration I = 0 Index if Iteration I > 0 term is local index of value calculated in current iteration Chunkid is ThreadID+(NodeID x #Threads) I is current iteration 19
20
Experimental Setup: Cluster of Multi-core machines Intel Xeon CPU E5345 – quad core Clock Frequency 2.33 GHz Main Memory 6 GB Datasets Varying p, dimension of spatial cube and s, time-steps in time series p = 10; s = 262144(DS1) p = 32; s = 2048 (DS2) p = 32; s = 4096 (DS3) p = 32; s = 8192 (DS4) p = 39; s = 8192 (DS5) 20
21
21
22
22
23
23
24
Clustering - Grouping together of “similar” objects Hard Clustering -- Each object belongs to a single cluster Soft Clustering -- Each object is probabilistically assigned to clusters 24
25
Co-clustering clusters both words and documents simultaneously 25
26
Involves simultaneous clustering of rows to row clusters and columns to column clusters Maximizes Mutual Information Uses Kullback-Leibler Divergence 26
27
27
28
28
29
Input matrix and its transpose pre-computed Input matrix and transpose Divided into files Distributed among nodes Each node - same amount of row and column data rowCL and colCL – replicated on all nodes Initial clustering Round robin fashion - consistency across nodes 29
30
In Preprocessing, pX and pY – normalized by total sum Wait till all nodes process to normalize Each node calculates pX and pY with local data Reduction object updated partial sum, pX and pY values Accumulated partial sums - total sum pX and pY normalized xnorm and ynorm calculated in second iteration as they need total sum 30
31
Compressed Matrix of size #rowclusters x #colclusters, calculated with local data Sum of values of values of each row cluster across each column cluster Final compressed matrix -sum of local compressed matrices Local compressed matrices – updated in reduction object Produces final compressed matrix on accumulation Cluster Centroids calculated 31
32
Reassign clustering Determined by Kullback-Leibler divergence Reduction object updated Compute compressed matrix Update reduction object Column Clustering – similar Objective function – finalize Next iteration 32
33
33
34
34
35
Algorithm - same for shared memory, distributed memory and hybrid parallelization Experiments conducted 2 clusters env1 Intel Xeon E5345 Quad Core Clock Frequency 2.33 GHz Main Memory 6 GB 8 nodes env2 AMD Opteron 8350 CPU 8 Cores Main Memory 16 GB 4 Nodes 35
36
2 Datasets 1 GB Dataset Matrix Dimensions 16k x 16k 4 GB Dataset Matrix Dimensions 32k x 32k Datasets and transpose Split into 32 files each (row partitioning) Distributed among nodes Number of row and column clusters: 4 36
37
37
38
38
39
39
40
Preprocessing stage – bottleneck for smaller dataset – not compute intensive Speedup with Preprocessing : 12.17 Speedup without Preprocessing: 18.75 Preprocessing stage scales well for Larger dataset – more computation Speedup is the same with and without preprocessing. Speedup for larger dataset : 20.7 40
41
Parallelized two data intensive applications, namely Wavelet Transform Co-clustering Representing the algorithms as generalized reduction structures Implementing them on FREERIDE Wavelet Transform - speedup 42 on 64 cores Co-clustering - speedup 21 on 32 cores. 41
42
42
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.