HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or 1.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
BY VAIBHAV NACHANKAR ARVIND DWARAKANATH Evaluation of Hbase Read/Write (A study of Hbase and it’s benchmarks)
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech Feb. 18, 2015 presentation for.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Introduction of Apache Hama Edward J. Yoon, October 11, 2011.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 吳宏君 陳威遠 洪浩哲.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Mining High Utility Itemset in Big Data
SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Spatial Tajo Supporting Spatial Queries on Apache Tajo Slideshare Shorten URL : goo.gl/j0VLXpgoo.gl/j0VLXp.
Hadoop System simulation with Mumak Fei Dong, Tianyu Feng, Hong Zhang Dec 8, 2010.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
Scaling up R computation with high performance computing resources.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
TensorFlow– A system for large-scale machine learning
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Distributed Network Traffic Feature Extraction for a Real-time IDS
Spark Presentation.
CLOUDERA TRAINING For Apache HBase
Introduction to HDFS: Hadoop Distributed File System
Hadoop Clusters Tess Fulkerson.
CS110: Discussion about Spark
Scalable Parallel Interoperable Data Analytics Library
Ch 4. The Evolution of Analytic Scalability
Introduction to Apache
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Pig Hive HBase Zookeeper
Presentation transcript:

HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng IEEE 2007 Dec 3, 2014 Kyung-Bin Lim

2 / 35 Outline  Introduction  Methodology  Experiments  Conclusion

3 / 35 Apache HAMA  Easy-of-use tool for data-intensive scientific computation  Massive matrix/graph computations are often used as primary functionalities  Fundamental design is changed from MapReduce with matrix computation to BSP with graph processing  Mimic of Pregel running on HDFS – Use zookeeper as a synchronization barrier

4 / 35 Our Focus  This paper is a story about previous version 0.1 of HAMA – Latest version: 0.7.0, Mar released  Only Focus on matrix computation with MapReduce  Shows simple case studies

5 / 35 The HAMA Architecture  We propose distributed scientific framework called HAMA (based on HPMR) – Provide transparent matrix/graph primitives

6 / 35 The HAMA Architecture  HAMA API: Easy-to-use Interface  HAMA Core: Provides matrix/graph primitives  HAMA Shell: Interactive User Console

7 / 35 Contributions of HAMA  Compatibility – Take advantage of all Hadoop features  Scalability – Scalable due to compatibility  Flexibility – Multiple Compute Engines Configurable  Applicability – HAMA’s primitives can be applied to various applications

8 / 35 Outline  Introduction  Methodology  Experiments  Conclusion

9 / 35 Case Study  With case study approach, we introduce two basic primitives with MapReduce model running on HAMA – Matrix multiplication and finding linear solution  And compare with MPI versions of these primitives

10 / 35 Case Study  Representing matrices – As a defaults, HAMA use HBase (NoSQL database)  HBase is modeled after Google’s Bigtable  Column oriented, semi-structured distributed database with high scalability

11 / 35 Case Study – Multiplication: Iterative Way  Iterative approach (Algorithm)

12 / 35 Case Study – Multiplication: Iterative Way  Simple, naïve strategy  Works well with sparse matrix  Sparse matrix: most entries are 0

13 / 35 Multiplication: Iterative Way

14 / 35 Multiplication: Iterative Way

15 / 35 Multiplication: Iterative Way

16 / 35 Multiplication: Iterative Way

17 / 35 Multiplication: Iterative Way

18 / 35 Multiplication: Iterative Way

19 / 35 Case Study – Multiplication: Block Way  Multiplication can be done using sub-matrix  Works well with dense matrix

20 / 35 Case Study – Multiplication: Block Way  Block Approach – Minimize data movement (network cost)

21 / 35 Case Study – Multiplication: Block Way  Block Approach (Algorithm)

22 / 35 Case Study – Finding Linear Solution  Ax =b – x = ?  A: known square symmetric positive-definite matrix  b: known vector  Use Conjugate Gradient approach

23 / 35 Case Study – Finding Linear Solution  Finding Linear Solution – Cramer’s rule – Conjugate Gradient Method

24 / 35 Case Study – Finding Linear Solution  Cramer’s rule

25 / 35 Case Study – Finding Linear Solution  Conjugate Gradient Method – Find a direction (conjugate direction) – Find a step size (Line search)

26 / 35 Case Study – Finding Linear Solution  Conjugate Gradient Method (Algorithm)

27 / 35 Outline  Introduction  Methodology  Experiments  Conclusion

28 / 35 Evaluations  TUSCI (TU Berlin SCI) Cluster – 16 nodes, two Intel P4 Xeon processors, 1GB memory – Connected with SCI (Scalable Coherent Interface) network interface in a 2D torus topology – Running in OpenCCS (similar environment of HOD)  Test sets

29 / 35 HPMR’s Enhancements  Prefetching – Increase Data Locality  Pre-shuffling – Reduces Amount of intermediate outputs to shuffle

30 / 35 Evaluations  The comparison of average execution time and scaleup with Matrix Multiplication

31 / 35 Evaluations  The comparison of average execution time and scaleup with CG

32 / 35 Evaluations  The comparison of average execution time with CG, when a single node is overloaded

33 / 35 Outline  Introduction  Methodology  Experiments  Conclusion

34 / 35 Conclusion  HAMA provides the easy-of-use tool for data-intensive computations – Matrix computation with MapReduce – Graph computation with BSP