Parallel C3M1 Aylin Tokuç Erkan Okuyan Özlem Gür Aylin Tokuç Erkan Okuyan Özlem Gür
Parallel C3M2 Outline Basics of Parallel computing Sequential C3M Parallel C3M Basics of Parallel computing Sequential C3M Parallel C3M
3 Parallel Computation Decomposition: The process of dividing a computation into smaller parts. Task: Programmer defined units of computation into which the main computation is subdivided by means of decomposition. Decomposition: The process of dividing a computation into smaller parts. Task: Programmer defined units of computation into which the main computation is subdivided by means of decomposition.
Parallel C3M4 Parallel Computation Primary Considerations Load Balancing Minimizing Communication Task Dependency Optimization Load Balancing Minimizing Communication Task Dependency Optimization
Parallel C3M5 Parallel Computation Load Balancing
Parallel C3M6 Parallel Computation Minimizing Communication
Parallel C3M7 Parallel Computation Task Dependency Optimization
Parallel C3M8 C3M Algorithm 1- Determine the cluster seeds of the database. 2- if d, is not a cluster seed then Find the cluster seed (if any) that maximally covers d 3- If there remain unclustered documents, group them into a ragbag cluster.
Parallel C3M9 C3M Formulas
Parallel C3M10 C3M – Sample Matrices
Parallel C3M11 Parallel C3M- Distribution Distribute rows among processors Load balancing by cyclic block distribution Distribute rows among processors Load balancing by cyclic block distribution
Parallel C3M12 Local Calculations All processors calculate α, partial β and P i Current Method for Weighted Matrix: too costly Need coloumn vectors (but row- wise partitioned)
Parallel C3M13 Seed Powers P i Seed power P i, should be small for a document whose terms appear in too many documents or too few documents. Seed power P i, should be bigger for a document whose terms appear in a moderate number of documents. Seed power P i, should be small for a document whose terms appear in too many documents or too few documents. Seed power P i, should be bigger for a document whose terms appear in a moderate number of documents.
Parallel C3M14 Minimize Communication - Proposed Heuristic # of non-zeros All processors calculate α, partial β and β’
Parallel C3M15 Effectiveness of Heuristic A matlab script is written to compare the effectiveness of the proposed heuristic. Correlation Coeeficient = 0.95 A matlab script is written to compare the effectiveness of the proposed heuristic. Correlation Coeeficient = 0.95
Parallel C3M16 Communication btw Processors Partial β and β’ vectors are exchanged btw processors to calculate the final β and β’ vectors. Then, all processor calculate c ii =δ i Partial β and β’ vectors are exchanged btw processors to calculate the final β and β’ vectors. Then, all processor calculate c ii =δ i
Parallel C3M17 # of Clusters Processors exchange local δ All processors calculate n c Processors exchange local δ All processors calculate n c
Parallel C3M18 Cluster-head Selection Calculate seed power of local documents Exchange largest n c seed powers. Calculate largest n c seed powers among all P i and find cluster heads.
Parallel C3M19 Clustering Non-seed Docs Exchange seed documents Cluster non-seed documents (as in sequential C3M) in each processor. Exchange seed documents Cluster non-seed documents (as in sequential C3M) in each processor.
Parallel C3M20 Future Work Term Based Clustering Overlapping Clusters Term Based Clustering Overlapping Clusters
Parallel C3M21 C3M Summary Load Balancing with cyclic block distribution Communication minimization by a new heuristic Task dependency minimized with block distirbution & heuristic. Load Balancing with cyclic block distribution Communication minimization by a new heuristic Task dependency minimized with block distirbution & heuristic.
Parallel C3M22 References Concepts and the effectiveness of the cover coefficient-based clustering methodology, F. Can, E. A. Ozkarahan Parallelizing the Buckshot Algorithm for Efficient Document Clustering, Eric C. Jensen, Steven M. Beitzel, Angelo J. Pilotto, Nazli Goharian, Ophir Frieder Clustering and Classification of Large Document Bases in a Parallel Environment, Anthony S. Ruocco, Ophir Frieder Efficient Clustering of Very Large Document Collections, I.S. Dhillon, J. Fan, Y. Guan Concepts and the effectiveness of the cover coefficient-based clustering methodology, F. Can, E. A. Ozkarahan Parallelizing the Buckshot Algorithm for Efficient Document Clustering, Eric C. Jensen, Steven M. Beitzel, Angelo J. Pilotto, Nazli Goharian, Ophir Frieder Clustering and Classification of Large Document Bases in a Parallel Environment, Anthony S. Ruocco, Ophir Frieder Efficient Clustering of Very Large Document Collections, I.S. Dhillon, J. Fan, Y. Guan
Parallel C3M23 Questions?
Parallel C3M24 The End Thank you for your patience