CO-CLUSTERING USING CUDA. Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped.

CO-CLUSTERING USING CUDA

Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped together? Why?  What are shared features?  Co-clustering provides you the “why” explicitly  Correlated sample/feature pair Row cluster: s 1 and s 3 are in a group Column cluster: distinguishing features are 2,3, and 5

Co-Clustering - Details  Using Information Theoretic Co-clustering, as parallelized for Hadoop architecture in: Disco: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining, Papadimitriou et.al, Data Mining 2008 Disco: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining  Partition entire matrix into row groups, col groups Minimize length of encoding of resulting partitioned matrix Competing code length factors: number of row groups & col groups, homogeneity of clusters Iterate over rows, rearrange and sub-partition to find better encoding using heuristic Repeat for columns, then rows again, until local optimum is found Complexity: O(n*fp*(row_groups+col_groups) 2 *iters) Credit: Chakrabarti et. al, KDD 2004

Implementation - Basics  Initial matrix generation : CPU  Initial random row/column group assignment: CPU  Memory structures very simple, arrays of ints

Implementation – Stats step 1  Statistics calculations:  Calculates statistics for each row of each column group Statistic is number of 1’s in a column group Straight-forward parallelization (each thread works on one row at a time), global memory 2 3 1 3 2 3511435114 Column Groups Row Groups Stat(Row 3, ColumnGroup 3) = 1

Room For Improvement  Calculate row statistics according to histogram algorithm from text book  Block columns  Assign one thread block to each block  Compute shared memory histograms within block  Merge back to global memory when finished

Implementation – Stats step 2  Calculates cost for each row group of each column group Essentially a reduce on the per- row data Block the rows, assign block to thread block Use shared memory and atomics to build histogram of all rows in a given row group Merge shared histogram with global histogram for that row group Iterate over all row groups 2 3 1 3 2 3511435114 Column Groups Row Groups Stat(RowGroup 1, ColumnGroup 3) = 2

Implementation – Row/Col Group Optimization  For each row, find optimal group it could belong to  Parallelized straight-forwardly, one row per thread, loop and stride to get all rows  Each row calculation goes through all row groups, determines global cost of moving to that row group  Move all rows to their optimal group  Recompute statistics  Repeat for column groups  Continue alternating row/column groupings until convergence

Room For Improvement  Parallelization could be more sophisticated  Could block the rows and compute the cost of the row joining each row group in parallel  Using shared memory atomics to identify minimum cost  In practice, this algorithm heavily favors a small number of row and column groups  The parllelization would be therefore be small

Implementation Outer Loop  After local minimum is found, change initial number of row and column groups and retry  Change number of row groups or number of column groups, up or down  Continue changing number of row or column groups in that direction until cost fails to decrease  Try both directions in both dimensions before stopping  Outer loop performed on CPU

Room for Improvement  Outer loop could parallelize inner loop actions over different GPUs  Each could explore the different dimensions and directions in parallel

Implementation – CPU + Validation  CPU implementation performed all steps described earlier, but sequentially  Validation  Used CPU implementation of statistics calculations to validate GPU stats calculations  CPU and GPU log implementations differ, so validated cost calculations by allowing for a tolerance of 5% btw results  Did not have time to validate the overall algorithm or visualize the outputs to it to see if coclusters produced were reasonable

Timing Measurements  Time was measured by clock_t/CLOCKS_PER_SEC under CPU implementation  Measured by cuda events under GPU implementation

Development Lessons Learned  CUDA and structured data is a bad idea  Even structs of arrays are impossible to deal with  Host-side pointer math on device pointers does not work  CUDA API has REALLY unfriendly error messages  Take care to do very, very little through that API  __device__ variables declared globally must be passed to kernels  Runtime errors otherwise  You can malloc and free shared memory in device code as of 3.2

Development Lessons Learned Cont  Visual Studio CUDA integration leaves a lot to be desired  All optimizations removed, still can’t set breakpoints everywhere  Many variables show as freed  No in-IDE, real-time, in editor compile errors  But, Visual Studio does give nice auto-complete, auto-definition navigation  No CUDA linker => separate files must be directly #include’d

Experiment - Environment  Float.cs.drexel.edu  CPU: 4 quad-core Intel Xeon L5360 processors @2.13 Ghz  GPU: 2 Nvidia GeForce GTX 580 GPUs @1544Mhz

Experiment - Description  Sequential (CPU) and Parallel (GPU) tested on square matrices of order 100, 1000, and 10000  Larger matrices caused memory problems  GPU tested with varying block and thread counts  Num blocks: 10, 100, 5000  Num threads: 10, 100, 1024 (max)  Resulting co-clusters usually stayed in the 50-200 row/column group range, regardless of matrix order  Row and column groupings are important in the calculation of matrix statistics, rows and columns are blocked by these

Experiment Results Num Threads

Experiment Results  For small number of blocks, 100 thread performance peaks at num_blocks * num_threads = matrix_order  I would expect this to be the optimal configuration, when num_blocks ~= num_row_groups ~= num_col_groups  Slowdown occurs when matrix order exceeds total number of threads and more must be done serially

Experiment - Results Num Threads

Experiment Results  Interestingly, the maximum speedup was the same in all block counts  Roughly speaking, as long as num_blocks * num_threads >= matrix order, max speedup of ~70 is achieved 10 threads never got there, due to block scheduling overhead? Possibly cost of copying to shared memory for block processing was not recouped in 10 thread case?  Maxing out thread count is counter-productive in smaller matrices  Hypothesis: When block count is excessive (as for small matrices), scheduling of large blocks of threads that return immediately is costly

Experiment Results  Efficiency is consistently highest for the smaller numbers of blocks and smaller numbers of threads within those blocks  Hypothesis: Overhead of starting blocks and threads must be high enough to result in diminishing returns when adding blocks and threads

CO-CLUSTERING USING CUDA. Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped.

Similar presentations

Presentation on theme: "CO-CLUSTERING USING CUDA. Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CO-CLUSTERING USING CUDA. Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped.

Similar presentations

Presentation on theme: "CO-CLUSTERING USING CUDA. Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped."— Presentation transcript:

Similar presentations

About project

Feedback