Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Slides:



Advertisements
Similar presentations
Introduction to Grid Application On-Boarding Nick Werstiuk
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Scalable High Performance Dimension Reduction
SALSA HPC Group School of Informatics and Computing Indiana University.
Twister4Azure Iterative MapReduce for Windows Azure Cloud Thilina Gunarathne Indiana University Iterative MapReduce for Azure Cloud.
Spark: Cluster Computing with Working Sets
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
SCALABLE PARALLEL COMPUTING ON CLOUDS : EFFICIENT AND SCALABLE ARCHITECTURES TO PERFORM PLEASINGLY PARALLEL, MAPREDUCE AND ITERATIVE DATA INTENSIVE COMPUTATIONS.
Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.
High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis Jong Youl Choi, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
FLANN Fast Library for Approximate Nearest Neighbors
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Christopher Jeffers August 2012
SALSASALSA Twister: A Runtime for Iterative MapReduce Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
DISTRIBUTED COMPUTING
“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
An Architecture for Distributed High Performance Video Processing in the Cloud 作者 :Pereira, R.; Azambuja, M.; Breitman, K.; Endler, M. 出處 :2010 IEEE 3rd.
SALSA HPC Group School of Informatics and Computing Indiana University.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Memcached Integration with Twister Saliya Ekanayake - Jerome Mitchell - Yiming Sun -
Next Generation of Apache Hadoop MapReduce Owen
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
BIG DATA/ Hadoop Interview Questions.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Image taken from: slideshare
An Open Source Project Commonly Used for Processing Big Data Sets
Spark Presentation.
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
Hadoop Clusters Tess Fulkerson.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
Applying Twister to Scientific Applications
湖南大学-信息科学与工程学院-计算机与科学系
Ch 4. The Evolution of Analytic Scalability
Data-Intensive Computing: From Clouds to GPU Clusters
Parallel Applications And Tools For Cloud Computing Environments
Group 15 Swathi Gurram Prajakta Purohit
Fast, Interactive, Language-Integrated Cluster Computing
Map Reduce, Types, Formats and Features
Presentation transcript:

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010

Introduction  Twister MapReduce Framework  In last paper Twister: A Runtime for Iterative MapReduce (J. Ekanayake, et.al. 2010), we showed that many scientific problems such as Kmeans, PageRank, can be solved in efficient way by Twister.  What’s next…

The Focus of This Paper  The demands of new parallel algorithms and distributed computing technologies which can efficiently solve data-intensive and computing-intensive scientific problems keep growing.  Here we focus on extending the applicability of Twister to more kinds of scientific problems.

Outline  Twister MapReduce Framework  Iterative workflow  Architecture  Twister applications  Twister non-iterative applications  Twister iterative applications

Twister Iterative Workflow  Configure once and run many times  Static data  Fault tolerance

Twister Architecture  The client  Publish/subscribe message broker  Daemons  Twister File Tool

Twister Non-iterative Applications  “Map and then Reduce”  Word Count  Smith Waterman Gotoh(SW-G) distance  “Map only”  Binary invoking method Many modern stand-alone scientific programs are complex and updated frequently with new features. Rewriting the parallel version of such a standalone program may require too much effort to keep pace with required new features.  Twister BLAST  Twister MDS Interpolation  Twister GTM Interpolation

Twister BLAST  BLAST software  Other parallel BLAST applications  Twister BLAST

BLAST Software  BLAST+ vs. BLAST  BLAST+ is recommended by NCBI  Computing-intensive  100% CPU usage  Memory usage is based on the size of input query and the number of hits  multi-thread mode vs. multi-process mode  Performance occasionally drops down  Less memory usage

Parallel BLAST Applications  MPI BLAST  Rewrites original BLAST, links with NCBI library and MPI library  Query partition and database segmentation  Uses less memory  No fault tolerance solution  Cloud BLAST  Based on Hadoop  Binary invoking  Only query partition, database encapsulated into images  Azure BLAST  Build on Azure platform directly  Only query partition

Twister BLAST  A parallel BLAST application which is supported by Twister  Uses the state-of-the-art binary invoking parallelism.  Utilize stand-alone BLAST software since it is highly optimized  Brings scalability and simplicity to program and database maintenance.

Database Management  In normal cluster  Replicated to all the nodes, in order to support BLAST binary execution  Compressed before replication  Transported through Twister File Tool  In cloud  Encapsulated into the image

Twister-BLAST Architecture

Twister MDS Interpolation

Parallelization  The implementation is vector-based  The raw dataset is read instead of the Euclidean distance matrix to save the disk space.  The sampled data size is 100k, pre-processed by original MDS program.  “Map only”  Data are processed by methods which encapsulate the MDS interpolation algorithm in each map task.  The out of sample data points are only depends on the sampled data.

Twister GTM Interpolation  GTM (Generative Topographic Mapping)  GTM Interpolation first performs normal GTM on samples.  The normal GTM will give features of the whole dataset  The out of sample data points will processed based on the sampled data points’ result.  Since the out-of-sample data are not involved in the computing intensive learning process, GTM Interpolation can be very fast.

Parallelization  “Map only”  Split the raw data file from the out-of-sample data file.  Each partition will have a mapper created to process that chunk.  Twister will invoke GTM Interpolation on each chunk to process the data.

Twister Iterative Application  Twister MDS  MDS (Multidimensional scaling)  The pairwise Euclidean distance within the target dimension of each pair is approximated to the corresponding original proximity value through a non-linear optimization algorithm.  Stress function  The objective function needs minimized.  SMACOF (Scaling by Majorizing a Complicated Function)  Iterative Majorization to minimize Stress function

Parallelization through Twister  Data distribution  Original distance matrix is partitioned into chunks by rows, and distributed to all compute nodes  Configure once and run many times  Data chunks are configured to Map tasks in one to one mapping and then held in memory and used throughout the subsequent iterations  Three MapReduce jobs are configured and reused in each iteration  Data transmission  runMapReduceBCast method (Broadcasting data to Map tasks)  Combine method (collect the result back to client)

Workflow Chart

Performance  Parallel efficiency drops greatly when the number of cores exceeds a certain value.  The cost of data broadcasting increases as the number of cores grows.  Through using more than one broker can ease this problem, we show Twister is limited by the availability of messaging middleware.

Improvement  In the latest code…  Twister can automatically use direct TCP channels to transfer large intermediate data items (brokers are used to send a unique_id, then the receiving end download data via TCP).  For smaller (<1KB) data items it uses message brokers.  With this improvement, a better performance was achieved for the matrix multiplication.  This work is still in progress…

Conclusion  New implementations extend the scope of existing applications using Twister.  With iterative MapReduce functions, data partitioning, caching, and reusable configuration, Twister can solve problems in a flexible and efficient fashion.

Conclusion cont’d  Twister’s performance largely depends on the performance of the specific messaging middleware adopted.  An interesting research issue of balancing the requirement of employing an iterative algorithm with the capability of the messaging middleware.  Twister scripts can simulate some functions of distributed file systems but needs further optimization.  In future work, we will integrate Twister with a customized messaging middleware and a distributed file system.

BLAST Performance on IU PolarGrid Node

Twister-BLAST vs. Hadoop-BLAST Performance Test

Twister MDS Interpolation Performance Test

Twister GTM Interpolation Performance Test

Twister MDS Performance Test A metagenomics dataset comprised of data points with nearly 1 billion pair-wise distances to 3-D in 100 iterations In performance test, a Twister environment with one ActiveMQ message broker was established, and Twister MDS was run with 100 iterations.