Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.

Slides:



Advertisements
Similar presentations
Computations with Big Image Data Phuong Nguyen Sponsor: NIST 1.
Advertisements

A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or 1.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, Yuanyuan Tian.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Linear Clustering Algorithm BY Horne Ken & Khan Farhana & Padubidri Shweta.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
The Chinese University of Hong Kong. Research on Private cloud : Eucalyptus Research on Hadoop MapReduce & HDFS.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Investigation of Data Locality in MapReduce
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
The Hadoop Distributed File System
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Storage in Big Data Systems
DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 吳宏君 陳威遠 洪浩哲.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Independent Component Analysis (ICA) A parallel approach.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.
A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.
Web Log Data Analytics with Hadoop
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
Matrix Multiplication in Hadoop
By: Joel Dominic and Carroll Wongchote 4/18/2012.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Distributed Network Traffic Feature Extraction for a Real-time IDS
Rainfall data analysis and Storm Prediction System
Introduction to HDFS: Hadoop Distributed File System
Selectivity Estimation of Big Spatial Data
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan
MapReduce: Data Distribution for Reduce
CS110: Discussion about Spark
CS 345A Data Mining MapReduce This presentation has been altered.
Distributed Systems CS
Presentation transcript:

Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku

 Introduction  Related Work  Distributed Mining Process  Co-clustering Huge Datasets  Experiments  Conclusions

 Problems  Huge datasets  Natural sources of data are impure form  Proposed Method  A comprehensive Distributed Co-clustering (DisCo) solution  Using Hadoop  DisCo is a scalable framework under which various co-clustering algorithms can be implemented

 Map-Reduce framework  employs a distributed storage cluster  block-addressable storage  a centralized metadata server  a convenient data access  storage API for Map-Reduce tasks

 Co-clustering  Algorithm cluster shapes checkerboard partitions single bi-cluster Exclusive row and column partitions overlapping partitions  Optimization criteria code length

Identifying the source and obtaining the data Transform raw data into the appropriate format for data analysis Visual results, or turned into the input for other applications.

 Data pre-processing  Processing 350 GB raw network event log Needs over 5 hours to extract source/destination IP pairs  Achieve much better performance on a few commodity nodes running Hadoop  Setting up Hadoop required minimal effort

 Specifically for co-clustering, there are two main preprocessing tasks: Building the graph from raw data Pre-computing the transpose  During co-clustering optimization, we need to iterate over both rows and columns.  Need to pre-compute the adjacency lists for both the original graph as well as its transpose

 Definitions and overview  Matrices are denoted by boldface capital letters  Vectors are denoted by boldface lowercase letters  a ij :the (i, j)-th element of matrix A  Co-clustering algorithms employs a checkerboard the original adjacency matrix  a grid of sub- matrices  An m x n matrix, a co-clustering is a pair of row and column labeling vectors  r(i):the i-th row of the matrix  G: the k×ℓ group matrix A A a a

 g pq gives the sufficient statistics for the (p, q) sub-matrix

 Map function

 Reduce function

 Global sync

 Setup  39 nodes  Two dual-core processors  8GM RAM  Linux RHEL4  4Gbps Ethernets  SATA, 65MB/sec or roughly 500 Mbps  The total capacity of our HDFS cluster was just 2.4 terabytes  HDFS block size was set to 64MB (default value)  JAVA  Sun JDK version 1.6.0_03

 The pre-processing step on the ISS data  Default values  39 nodes  6 concurrent maps per node  5 reduce tasks  256MB input split size

 Using relatively low-cost components  I/O rates that exceed those of high-performance storage systems.  Performance scales almost linearly with the number of machines/disks.