CO-CLUSTERING USING CUDA
Co-Clustering Explained Problem: Large binary matrix of samples (rows) and features (columns) What samples should be grouped together? Why? What are shared features? Co-clustering provides you the “why” explicitly Correlated sample/feature pair Row cluster: s 1 and s 3 are in a group Column cluster: distinguishing features are 2,3, and 5
Co-Clustering - Details Using Information Theoretic Co-clustering, as parallelized for Hadoop architecture in: Disco: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining, Papadimitriou et.al, Data Mining 2008 Disco: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining Partition entire matrix into row groups, col groups Minimize length of encoding of resulting partitioned matrix Competing code length factors: number of row groups & col groups, homogeneity of clusters Iterate over rows, rearrange and sub-partition to find better encoding using heuristic Repeat for columns, then rows again, until local optimum is found Complexity: O(n*fp*(row_groups+col_groups) 2 *iters) Credit: Chakrabarti et. al, KDD 2004
Implementation - Basics Initial matrix generation : CPU Initial random row/column group assignment: CPU Memory structures very simple, arrays of ints
Implementation – Stats step 1 Statistics calculations: Calculates statistics for each row of each column group Statistic is number of 1’s in a column group Straight-forward parallelization (each thread works on one row at a time), global memory Column Groups Row Groups Stat(Row 3, ColumnGroup 3) = 1
Room For Improvement Calculate row statistics according to histogram algorithm from text book Block columns Assign one thread block to each block Compute shared memory histograms within block Merge back to global memory when finished
Implementation – Stats step 2 Calculates cost for each row group of each column group Essentially a reduce on the per- row data Block the rows, assign block to thread block Use shared memory and atomics to build histogram of all rows in a given row group Merge shared histogram with global histogram for that row group Iterate over all row groups Column Groups Row Groups Stat(RowGroup 1, ColumnGroup 3) = 2
Implementation – Row/Col Group Optimization For each row, find optimal group it could belong to Parallelized straight-forwardly, one row per thread, loop and stride to get all rows Each row calculation goes through all row groups, determines global cost of moving to that row group Move all rows to their optimal group Recompute statistics Repeat for column groups Continue alternating row/column groupings until convergence
Room For Improvement Parallelization could be more sophisticated Could block the rows and compute the cost of the row joining each row group in parallel Using shared memory atomics to identify minimum cost In practice, this algorithm heavily favors a small number of row and column groups The parllelization would be therefore be small
Implementation Outer Loop After local minimum is found, change initial number of row and column groups and retry Change number of row groups or number of column groups, up or down Continue changing number of row or column groups in that direction until cost fails to decrease Try both directions in both dimensions before stopping Outer loop performed on CPU
Room for Improvement Outer loop could parallelize inner loop actions over different GPUs Each could explore the different dimensions and directions in parallel
Implementation – CPU + Validation CPU implementation performed all steps described earlier, but sequentially Validation Used CPU implementation of statistics calculations to validate GPU stats calculations CPU and GPU log implementations differ, so validated cost calculations by allowing for a tolerance of 5% btw results Did not have time to validate the overall algorithm or visualize the outputs to it to see if coclusters produced were reasonable
Timing Measurements Time was measured by clock_t/CLOCKS_PER_SEC under CPU implementation Measured by cuda events under GPU implementation
Development Lessons Learned CUDA and structured data is a bad idea Even structs of arrays are impossible to deal with Host-side pointer math on device pointers does not work CUDA API has REALLY unfriendly error messages Take care to do very, very little through that API __device__ variables declared globally must be passed to kernels Runtime errors otherwise You can malloc and free shared memory in device code as of 3.2
Development Lessons Learned Cont Visual Studio CUDA integration leaves a lot to be desired All optimizations removed, still can’t set breakpoints everywhere Many variables show as freed No in-IDE, real-time, in editor compile errors But, Visual Studio does give nice auto-complete, auto-definition navigation No CUDA linker => separate files must be directly #include’d
Experiment - Environment Float.cs.drexel.edu CPU: 4 quad-core Intel Xeon L5360 Ghz GPU: 2 Nvidia GeForce GTX 580
Experiment - Description Sequential (CPU) and Parallel (GPU) tested on square matrices of order 100, 1000, and Larger matrices caused memory problems GPU tested with varying block and thread counts Num blocks: 10, 100, 5000 Num threads: 10, 100, 1024 (max) Resulting co-clusters usually stayed in the row/column group range, regardless of matrix order Row and column groupings are important in the calculation of matrix statistics, rows and columns are blocked by these
Experiment Results Num Threads
Experiment Results For small number of blocks, 100 thread performance peaks at num_blocks * num_threads = matrix_order I would expect this to be the optimal configuration, when num_blocks ~= num_row_groups ~= num_col_groups Slowdown occurs when matrix order exceeds total number of threads and more must be done serially
Experiment - Results Num Threads
Experiment Results Num Threads
Experiment Results Interestingly, the maximum speedup was the same in all block counts Roughly speaking, as long as num_blocks * num_threads >= matrix order, max speedup of ~70 is achieved 10 threads never got there, due to block scheduling overhead? Possibly cost of copying to shared memory for block processing was not recouped in 10 thread case? Maxing out thread count is counter-productive in smaller matrices Hypothesis: When block count is excessive (as for small matrices), scheduling of large blocks of threads that return immediately is costly
Experiment Results Num Threads
Experiment Results Num Threads
Experiment Results Num Threads
Experiment Results Efficiency is consistently highest for the smaller numbers of blocks and smaller numbers of threads within those blocks Hypothesis: Overhead of starting blocks and threads must be high enough to result in diminishing returns when adding blocks and threads