DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 0356169 吳宏君 0350741 陳威遠 0356042 洪浩哲.

Slides:

Advertisements

Similar presentations

A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or 1.

Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

Overview of MapReduce and Hadoop

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.

MapReduce ： Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.

Ch 4. The Evolution of Analytic Scalability

Investigation of Data Locality in MapReduce

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel(University of Wisconsin-Madison) Eugene J. Shekita, Yuanyuan.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

Storage in Big Data Systems

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

MapReduce How to painlessly process terabytes of data.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

Matchmaking: A New MapReduce Scheduling Technique

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.

Web Log Data Analytics with Hadoop

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Next Generation of Apache Hadoop MapReduce Owen

Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.

Matrix Multiplication in Hadoop

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

Big Data is a Big Deal!.

Distributed Network Traffic Feature Extraction for a Real-time IDS

Hadoop Clusters Tess Fulkerson.

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

Distributed Systems CS

Ch 4. The Evolution of Analytic Scalability

Data-Intensive Computing: From Clouds to GPU Clusters

CS 345A Data Mining MapReduce This presentation has been altered.

Distributed Systems CS

Presentation transcript:

DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 吳宏君陳威遠洪浩哲

Outline  Introduction  Distributed Mining Process  Co-clustering  Experiments  Related Work  Conclusions  Discussion 1

Outline  Introduction  Distributed Mining Process  Co-clustering  Experiments  Related Work  Conclusions  Discussion 2

Introduction Background Goal Map-Reduce 3

Background Huge datasets are becoming prevalent – Real-world application produce huge volumes of messy data (terabytes, or more) – pre-processing the raw data is important Map-reduce Tool – A simple but powerful execution engine – Unconcerned about data models and storage schemes 4

Goal Focus on co-clustering or bi-clustering of pairwise relationships from the raw data – Co-clustering searches for matrices of rows and columns that are inter-related Proposes a comprehensive Distributed Co- clustering (DisCo) solution from raw data to the end clusters. – Which involves data gathering, pre-processing, analysis, and presentation – Apply Map-Reduce(Hadoop) machine both as a programming model and implementation testbed. 5

Map-Reduce Distributed, scalable, fault-tolerant data storage, management and processing tools – Distributed execution engine for select-project via sequential scan, followed by hashed partitioning and sort-merge group-by. – Suited for data already stored on a distributed file system – Map-Reduce can transparently use any number of machines 6

Map-Reduce 7

8

Outline  Introduction  Distributed Mining Process  Co-clustering  Experiments  Related Work  Conclusions  Discussion 9

Distributed Mining Process 10

Distributed Mining Process Data pre-processing – Building the graph from raw data – Pre-computing the transpose Extract SrcIP + DstIP and build adjacency matrix 11

Distributed Mining Process Data pre-processing – Building the graph from raw data – Pre-computing the transpose During co-clustering optimization, we need to iterate over both rows and columns. Pre-compute the adjacency lists for both the original graph as well as its transpose 12

Outline  Introduction  Distributed Mining Process  Co-clustering  Experiments  Related Work  Conclusions  Discussion 13

Co-clustering Definitions and overview – Co-clustering allows simultaneous clustering of the rows and columns of a matrix – Input format: a matrix of m-rows and n-columns – co-clustering algorithm employs a checkerboard the original adjacency matrix → a grid of sub- matrices 14

Co-clustering 15

Co-clustering Goal – Find good group assignment vectors such that error function is minimized. 16

Co-clustering 17

Co-clustering Co-clustering(A,k,l) k=2l=2 A= 18

Co-clustering c(1)=1c(2)=1c(3)=1c(4)=2c(5)=2 r(4)=2 r(3)=1 r(2)=1 r(1)=1 19

Co-clustering c(1)=1c(2)=1c(3)=1c(4)=2c(5)=2 r(4)=2 r(3)=1 r(2)=1 r(1)= r(2)=2 20

Co-clustering c(1)=1c(2)=1c(3)=1c(4)=2c(5)= c(2)=2 21

Co-clustering Co-clustering with Map-Reduce One iteration over rows as a Map-Reduce job 22

Co-clustering Co-clustering with Map-Reduce 23

Co-clustering Map 1 -> 2,4,5 2 -> 1,3 3 -> 2,4,5 4 -> 1,3 Co-clustering with Map-Reduce r, c, G random initialization based on parameter k, l 24

Co-clustering Map 1 -> 2,4,5 2 -> 1,3 3 -> 2,4,5 4 -> 1,3 Co-clustering with Map-Reduce k=2, l=2 r = { 1,1,1,2} c = {1,1,1,2,2} 25

Co-clustering Co-clustering with Map-Reduce Fix column Row iteration 1 -> 2,4,5 ( Key, value)=(1,{2,4,5})

Co-clustering Co-clustering with Map-Reduce 2 -> 1,3 ( Key, value)=(2,{1,3})

Co-clustering Co-clustering with Map-Reduce 28

Co-clustering Co-clustering with Map-Reduce p=1 key (intermediated) value (intermediated) 29

Co-clustering Co-clustering with Map-Reduce ( 1, {(2,4),(1,3)}) ( 2, {(4,0),(2,4)})

Co-clustering 31

Co-clustering Performance tuning – Parameter has to do with thread pool sizes – Parameters are Number of map tasks 32

Outline  Introduction  Distributed Mining Process  Co-clustering  Experiments – Setup – Scalability and performance  Related Work  Conclusions  Discussion 33

Experiments  Setup 39 nodes in cluster Machines locates in 4 blade server Hadoop Distributed File System(HDFS) capacity: 2.4TB Sun JDK 1.6.0_03 Datasets: CPU2 * Intel Xeon 2.66GHz(two dual-core) Memory8GB OSLinux Red Hat Enterprise Linux 34

Experiments(cont’d)  Scalability and performance Performance: the effect of parameters 1)maximum number of concurrent map tasks per node 2)number of reducer tasks 3)minimum input split size Scalability: wall-clock time vs. number of node 35

Experiments(cont’d)  Preprocessing ISS Data Optimal values about Map-Reduce 6 concurrent map tasks / node 5 reduce tasks 256MB of input split size MB Figure 8 36

Experiments(cont’d)  Co-clustering TREC Data when job size ↓ framework overheads ↑  Two observation 1)20±2 sec/iteration is better than a machine with 48GB RAM. 2)As the dataset size ↑, the implementation will achieve linear scaleup. 20±2 sec/iteration 37

Experiments(cont’d)  Behavior of the co-clustering iteration no. of concurrent maps no. of reduce tasks input split size are almost identical with Figure 8 38

Outline  Introduction  Distributed Mining Process  Co-clustering  Experiments  Related Work – Map-Reduce framework – Co-clustering  Conclusions  Discussion 39

Related Work  Map-Reduce framework Simple but powerful Use distributed file system (GFS, HDFS…) Block-addressable storage & centralized metadata server 40

Related Work(cont’d)  Co-clustering Cluster shapes Checkerboard partitions Properties of input data Optimization objective 41

Outline  Introduction  Distributed Mining Process  Co-clustering  Experiments  Related Work  Conclusions  Discussion 42

Conclusions  Designing a holistic approach to data mining Distributed infrastructure Map-Reduce Co-clustering  Distributed Co-clustering framework Using relatively low-cost components Performance scales almost linearly with machine/disk ↑ Demonstrate result on real-world data sets 43

Outline  Introduction  Distributed Mining Process  Co-clustering  Experiments  Related Work  Conclusions  Discussion 44

Discussion  In distributed file system, how to deal with the situation if there are some tasks fail  With the developing of hardware, will the performance increase linearly?  Lack of experimental record. 45

Discussion 將 input split size 增加為 HDFS block 大小的數倍，會導致更難以在 local 的 data copies 放置 map task ， why? 46

Q & A 47

Thanks for your attention! 48