Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳
INTRODUCTION large dataset of moderate-to-high dimensional elements serial subspace clustering algorithms TB 、 PB e.g.,Twitter crawl: > 12TB Yahoo! operational data: 5PB 方法: combine a fast, scalable serial algorithm and makes it run efficiently in parallel
INTRODUCTION bottleneck: I/O, network Best of both Worlds -- BoW automatically spots the bottleneck and picks a good strategy serial clustering methods as a plugged-in clustering subroutine
RELATED WORK MapReduce-- 简化的分布式编程模式,用于大规模数据集 的并行运算 mapper, reducer map stage : input file and outputs(key, value)pairs shuffle stage : transfers the mappers'output to the reducers based on the key reduce stage: processes the received pairs and outputs thefinal result
BoW ParC :数据划分,合并结果 SnI :先抽样,牺牲 I/O 减少 network cost trade-off
ParC--Parallel Clustering 划分数据、分配数据到不同的机器 每台机器在分配到的数据中聚类,得到簇称为 β-clusters 合并 β-clusters 得到最终的类
SnI--Sample and Ignore 抽样,聚类得到 clusters 排除属于 clusters 空间内的数据 ParC
COST-BASED OPTIMIZATION ParC Cost : Map Cost : Shuffle Cost: Reduce Cost:
SnI Cost :
Bow compute ParC Cost->costC compute SnI Cost->costCs if costC > costCs then clusters = result of SnI else clusters = result of ParC
EXPERIMENTAL RESULTS 采用 Hadoop M45 : 1.5PB storage , 1TB memory , DISC/Cloud : 512 cores , 64 machines , 1TB RAM , 256TB disk storage ,
Quality of results 聚类的平均准确率、召回率 模拟数据
Scale-up results 增加 reducer
Scale-up results 增加数据, r=128 , m=700
Accuracy of our cost equations
感谢聆听 ! Thanks for your time