DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 吳宏君 陳威遠 洪浩哲
Outline Introduction Distributed Mining Process Co-clustering Experiments Related Work Conclusions Discussion 1
Outline Introduction Distributed Mining Process Co-clustering Experiments Related Work Conclusions Discussion 2
Introduction Background Goal Map-Reduce 3
Background Huge datasets are becoming prevalent – Real-world application produce huge volumes of messy data (terabytes, or more) – pre-processing the raw data is important Map-reduce Tool – A simple but powerful execution engine – Unconcerned about data models and storage schemes 4
Goal Focus on co-clustering or bi-clustering of pairwise relationships from the raw data – Co-clustering searches for matrices of rows and columns that are inter-related Proposes a comprehensive Distributed Co- clustering (DisCo) solution from raw data to the end clusters. – Which involves data gathering, pre-processing, analysis, and presentation – Apply Map-Reduce(Hadoop) machine both as a programming model and implementation testbed. 5
Map-Reduce Distributed, scalable, fault-tolerant data storage, management and processing tools – Distributed execution engine for select-project via sequential scan, followed by hashed partitioning and sort-merge group-by. – Suited for data already stored on a distributed file system – Map-Reduce can transparently use any number of machines 6
Map-Reduce 7
8
Outline Introduction Distributed Mining Process Co-clustering Experiments Related Work Conclusions Discussion 9
Distributed Mining Process 10
Distributed Mining Process Data pre-processing – Building the graph from raw data – Pre-computing the transpose Extract SrcIP + DstIP and build adjacency matrix 11
Distributed Mining Process Data pre-processing – Building the graph from raw data – Pre-computing the transpose During co-clustering optimization, we need to iterate over both rows and columns. Pre-compute the adjacency lists for both the original graph as well as its transpose 12
Outline Introduction Distributed Mining Process Co-clustering Experiments Related Work Conclusions Discussion 13
Co-clustering Definitions and overview – Co-clustering allows simultaneous clustering of the rows and columns of a matrix – Input format: a matrix of m-rows and n-columns – co-clustering algorithm employs a checkerboard the original adjacency matrix → a grid of sub- matrices 14
Co-clustering 15
Co-clustering Goal – Find good group assignment vectors such that error function is minimized. 16
Co-clustering 17
Co-clustering Co-clustering(A,k,l) k=2l=2 A= 18
Co-clustering c(1)=1c(2)=1c(3)=1c(4)=2c(5)=2 r(4)=2 r(3)=1 r(2)=1 r(1)=1 19
Co-clustering c(1)=1c(2)=1c(3)=1c(4)=2c(5)=2 r(4)=2 r(3)=1 r(2)=1 r(1)= r(2)=2 20
Co-clustering c(1)=1c(2)=1c(3)=1c(4)=2c(5)= c(2)=2 21
Co-clustering Co-clustering with Map-Reduce One iteration over rows as a Map-Reduce job 22
Co-clustering Co-clustering with Map-Reduce 23
Co-clustering Map 1 -> 2,4,5 2 -> 1,3 3 -> 2,4,5 4 -> 1,3 Co-clustering with Map-Reduce r, c, G random initialization based on parameter k, l 24
Co-clustering Map 1 -> 2,4,5 2 -> 1,3 3 -> 2,4,5 4 -> 1,3 Co-clustering with Map-Reduce k=2, l=2 r = { 1,1,1,2} c = {1,1,1,2,2} 25
Co-clustering Co-clustering with Map-Reduce Fix column Row iteration 1 -> 2,4,5 ( Key, value)=(1,{2,4,5})
Co-clustering Co-clustering with Map-Reduce 2 -> 1,3 ( Key, value)=(2,{1,3})
Co-clustering Co-clustering with Map-Reduce 28
Co-clustering Co-clustering with Map-Reduce p=1 key (intermediated) value (intermediated) 29
Co-clustering Co-clustering with Map-Reduce ( 1, {(2,4),(1,3)}) ( 2, {(4,0),(2,4)})
Co-clustering 31
Co-clustering Performance tuning – Parameter has to do with thread pool sizes – Parameters are Number of map tasks 32
Outline Introduction Distributed Mining Process Co-clustering Experiments – Setup – Scalability and performance Related Work Conclusions Discussion 33
Experiments Setup 39 nodes in cluster Machines locates in 4 blade server Hadoop Distributed File System(HDFS) capacity: 2.4TB Sun JDK 1.6.0_03 Datasets: CPU2 * Intel Xeon 2.66GHz(two dual-core) Memory8GB OSLinux Red Hat Enterprise Linux 34
Experiments(cont’d) Scalability and performance Performance: the effect of parameters 1)maximum number of concurrent map tasks per node 2)number of reducer tasks 3)minimum input split size Scalability: wall-clock time vs. number of node 35
Experiments(cont’d) Preprocessing ISS Data Optimal values about Map-Reduce 6 concurrent map tasks / node 5 reduce tasks 256MB of input split size MB Figure 8 36
Experiments(cont’d) Co-clustering TREC Data when job size ↓ framework overheads ↑ Two observation 1)20±2 sec/iteration is better than a machine with 48GB RAM. 2)As the dataset size ↑, the implementation will achieve linear scaleup. 20±2 sec/iteration 37
Experiments(cont’d) Behavior of the co-clustering iteration no. of concurrent maps no. of reduce tasks input split size are almost identical with Figure 8 38
Outline Introduction Distributed Mining Process Co-clustering Experiments Related Work – Map-Reduce framework – Co-clustering Conclusions Discussion 39
Related Work Map-Reduce framework Simple but powerful Use distributed file system (GFS, HDFS…) Block-addressable storage & centralized metadata server 40
Related Work(cont’d) Co-clustering Cluster shapes Checkerboard partitions Properties of input data Optimization objective 41
Outline Introduction Distributed Mining Process Co-clustering Experiments Related Work Conclusions Discussion 42
Conclusions Designing a holistic approach to data mining Distributed infrastructure Map-Reduce Co-clustering Distributed Co-clustering framework Using relatively low-cost components Performance scales almost linearly with machine/disk ↑ Demonstrate result on real-world data sets 43
Outline Introduction Distributed Mining Process Co-clustering Experiments Related Work Conclusions Discussion 44
Discussion In distributed file system, how to deal with the situation if there are some tasks fail With the developing of hardware, will the performance increase linearly? Lack of experimental record. 45
Discussion 將 input split size 增加為 HDFS block 大小的數倍, 會導致更難以在 local 的 data copies 放置 map task , why? 46
Q & A 47
Thanks for your attention! 48