Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.

Slides:

Advertisements

Similar presentations

epiC: an Extensible and Scalable System for Processing Big Data

Advertisements

Spark: Cluster Computing with Working Sets

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

Felix Halim, Roland H.C. Yap, Yongzheng Wu

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

Distributed Computations

CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.

MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.

HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

Collective Communication

Extreme scale parallel and distributed systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward.

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Data Engineering How MapReduce Works

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

SALSASALSA Harp: Collective Communication on Hadoop Judy Qiu, Indiana University.

SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.

Part III BigData Analysis Tools (YARN) Yuan Xue

Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Parallel Programming Models

SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS

Verification of Data-Dependent Properties of MPI-Based Parallel Scientific Software Anastasia Mironova.

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Hadoop Aakash Kag What Why How 1.

Topo Sort on Spark GraphX Lecturer: 苟毓川

Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Spark Presentation.

Hadoop-Harp Applications Performance Analysis on Big Red II

Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Interactive Website (

Distinguishing Parallel and Distributed Computing Performance

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

I590 Data Science Curriculum August

Applications SPIDAL MIDAS ABDS

COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman

Applying Twister to Scientific Applications

High Performance Big Data Computing in the Digital Science Center

Convergence of HPC and Clouds for Large-Scale Data enabled Science

Data Science Curriculum March

湖南大学-信息科学与工程学院-计算机与科学系

Distributed Systems CS

Agent-based Model Simulation with Twister

Scalable Parallel Interoperable Data Analytics Library

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

CPSC 457 Operating Systems

Distinguishing Parallel and Distributed Computing Performance

Parallel Applications And Tools For Cloud Computing Environments

HPML Conference, Lyon, Sept 2018

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Twister2: Design and initial implementation of a Big Data Toolkit

Indiana University, Bloomington

Twister2: Design of a Big Data Toolkit

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.

MAPREDUCE TYPES, FORMATS AND FEATURES

PHI Research in Digital Science Center

CS639: Data Management for Data Science

Fast, Interactive, Language-Integrated Cluster Computing

COS 518: Distributed Systems Lecture 11 Mike Freedman

CS639: Data Management for Data Science

Convergence of Big Data and Extreme Computing

Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,

Map Reduce, Types, Formats and Features

Presentation transcript:

Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication operations MPI contains abundant and highly-optimized collective communication operations but is limited on data abstractions To improve the expressiveness and performance in big data processing… We introduce Harp library, which provides data abstractions and related communication abstractions and transform map-reduce programming model to map-collecitve model.

Features Hadoop Plugin (on Hadoop 1.2.1 and Hadoop 2.2.0) Hierarchical data abstraction on arrays, key-values and graphs for easy programming expressiveness. Collective communication model to support various communication operations on the data abstractions. Caching with buffer management for memory allocation required from computation and communication BSP style parallelism Fault tolerance with check-pointing

Architecture MapReduce Applications Map-Collective Applications YARN MapReduce V2 Harp MapReduce Applications Map-Collective Applications Application Framework Resource Manager

Collective Communication Parallelism Model Shuffle M Collective Communication R Map-Collective Model MapReduce Model

Hierarchical Data Abstraction and Collective Communication Vertex Table Key-Value Partition Array Commutable Key-Values Vertices, Edges, Messages Double Array Int Array Long Array Array Partition < Array Type > Struct Object Vertex Partition Edge Partition Array Table <Array Type> Message Partition Key-Value Table Byte Array Message Table Edge Table Broadcast, Send, Gather Broadcast, Allgather, Allreduce, Regroup-(combine/reduce), Message-to-Vertex, Edge-to-Vertex Broadcast, Send Partition Basic Types

Performance on Madrid Cluster (8 nodes)