Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳

INTRODUCTION large dataset of moderate-to-high dimensional elements serial subspace clustering algorithms TB 、 PB e.g.,Twitter crawl: > 12TB Yahoo! operational data: 5PB 方法： combine a fast, scalable serial algorithm and makes it run efficiently in parallel

INTRODUCTION bottleneck: I/O, network Best of both Worlds -- BoW automatically spots the bottleneck and picks a good strategy serial clustering methods as a plugged-in clustering subroutine

RELATED WORK MapReduce-- 简化的分布式编程模式，用于大规模数据集的并行运算 mapper, reducer map stage ： input file and outputs(key, value)pairs shuffle stage ： transfers the mappers'output to the reducers based on the key reduce stage: processes the received pairs and outputs thefinal result

BoW ParC ：数据划分，合并结果 SnI ：先抽样，牺牲 I/O 减少 network cost trade-off

ParC--Parallel Clustering 划分数据、分配数据到不同的机器每台机器在分配到的数据中聚类，得到簇称为 β-clusters 合并 β-clusters 得到最终的类

SnI--Sample and Ignore 抽样，聚类得到 clusters 排除属于 clusters 空间内的数据 ParC

COST-BASED OPTIMIZATION ParC Cost ： Map Cost ： Shuffle Cost: Reduce Cost:

SnI Cost ：

Bow compute ParC Cost->costC compute SnI Cost->costCs if costC > costCs then clusters = result of SnI else clusters = result of ParC

EXPERIMENTAL RESULTS 采用 Hadoop M45 ： 1.5PB storage ， 1TB memory ， DISC/Cloud ： 512 cores ， 64 machines ， 1TB RAM ， 256TB disk storage ，

Quality of results 聚类的平均准确率、召回率模拟数据

Scale-up results 增加 reducer

Scale-up results 增加数据， r=128 ， m=700

Accuracy of our cost equations

感谢聆听 ! Thanks for your time

Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

Similar presentations

Presentation on theme: "Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

Similar presentations

Presentation on theme: "Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳."— Presentation transcript:

Similar presentations

About project

Feedback