Download presentation
Presentation is loading. Please wait.
Published byLorraine Atkinson Modified over 9 years ago
1
Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku
2
Introduction Related Work Distributed Mining Process Co-clustering Huge Datasets Experiments Conclusions
3
Problems Huge datasets Natural sources of data are impure form Proposed Method A comprehensive Distributed Co-clustering (DisCo) solution Using Hadoop DisCo is a scalable framework under which various co-clustering algorithms can be implemented
4
Map-Reduce framework employs a distributed storage cluster block-addressable storage a centralized metadata server a convenient data access storage API for Map-Reduce tasks
5
Co-clustering Algorithm cluster shapes checkerboard partitions single bi-cluster Exclusive row and column partitions overlapping partitions Optimization criteria code length
6
Identifying the source and obtaining the data Transform raw data into the appropriate format for data analysis Visual results, or turned into the input for other applications.
7
Data pre-processing Processing 350 GB raw network event log Needs over 5 hours to extract source/destination IP pairs Achieve much better performance on a few commodity nodes running Hadoop Setting up Hadoop required minimal effort
8
Specifically for co-clustering, there are two main preprocessing tasks: Building the graph from raw data Pre-computing the transpose During co-clustering optimization, we need to iterate over both rows and columns. Need to pre-compute the adjacency lists for both the original graph as well as its transpose
9
Definitions and overview Matrices are denoted by boldface capital letters Vectors are denoted by boldface lowercase letters a ij :the (i, j)-th element of matrix A Co-clustering algorithms employs a checkerboard the original adjacency matrix a grid of sub- matrices An m x n matrix, a co-clustering is a pair of row and column labeling vectors r(i):the i-th row of the matrix G: the k×ℓ group matrix A A a a
10
g pq gives the sufficient statistics for the (p, q) sub-matrix
11
Map function
12
Reduce function
13
Global sync
14
Setup 39 nodes Two dual-core processors 8GM RAM Linux RHEL4 4Gbps Ethernets SATA, 65MB/sec or roughly 500 Mbps The total capacity of our HDFS cluster was just 2.4 terabytes HDFS block size was set to 64MB (default value) JAVA Sun JDK version 1.6.0_03
15
The pre-processing step on the ISS data Default values 39 nodes 6 concurrent maps per node 5 reduce tasks 256MB input split size
17
Using relatively low-cost components I/O rates that exceed those of high-performance storage systems. Performance scales almost linearly with the number of machines/disks.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.