Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.

Slides:



Advertisements
Similar presentations
epiC: an Extensible and Scalable System for Processing Big Data
Advertisements

Spark: Cluster Computing with Working Sets
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Felix Halim, Roland H.C. Yap, Yongzheng Wu
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Distributed Computations
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Collective Communication
Extreme scale parallel and distributed systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Data Engineering How MapReduce Works
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
SALSASALSA Harp: Collective Communication on Hadoop Judy Qiu, Indiana University.
SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.
Part III BigData Analysis Tools (YARN) Yuan Xue
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Parallel Programming Models
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
Verification of Data-Dependent Properties of MPI-Based Parallel Scientific Software Anastasia Mironova.
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
Topo Sort on Spark GraphX Lecturer: 苟毓川
Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Spark Presentation.
Hadoop-Harp Applications Performance Analysis on Big Red II
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Interactive Website (
Distinguishing Parallel and Distributed Computing Performance
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
I590 Data Science Curriculum August
Applications SPIDAL MIDAS ABDS
COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman
Applying Twister to Scientific Applications
High Performance Big Data Computing in the Digital Science Center
Convergence of HPC and Clouds for Large-Scale Data enabled Science
Data Science Curriculum March
湖南大学-信息科学与工程学院-计算机与科学系
Distributed Systems CS
Agent-based Model Simulation with Twister
Scalable Parallel Interoperable Data Analytics Library
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
CPSC 457 Operating Systems
Distinguishing Parallel and Distributed Computing Performance
Parallel Applications And Tools For Cloud Computing Environments
HPML Conference, Lyon, Sept 2018
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Twister2: Design and initial implementation of a Big Data Toolkit
Indiana University, Bloomington
Twister2: Design of a Big Data Toolkit
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.
MAPREDUCE TYPES, FORMATS AND FEATURES
PHI Research in Digital Science Center
CS639: Data Management for Data Science
Fast, Interactive, Language-Integrated Cluster Computing
COS 518: Distributed Systems Lecture 11 Mike Freedman
CS639: Data Management for Data Science
Convergence of Big Data and Extreme Computing
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
Map Reduce, Types, Formats and Features
Presentation transcript:

Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication operations MPI contains abundant and highly-optimized collective communication operations but is limited on data abstractions To improve the expressiveness and performance in big data processing… We introduce Harp library, which provides data abstractions and related communication abstractions and transform map-reduce programming model to map-collecitve model.

Features Hadoop Plugin (on Hadoop 1.2.1 and Hadoop 2.2.0) Hierarchical data abstraction on arrays, key-values and graphs for easy programming expressiveness. Collective communication model to support various communication operations on the data abstractions. Caching with buffer management for memory allocation required from computation and communication BSP style parallelism Fault tolerance with check-pointing

Architecture MapReduce Applications Map-Collective Applications YARN MapReduce V2 Harp MapReduce Applications Map-Collective Applications Application Framework Resource Manager

Collective Communication Parallelism Model Shuffle M Collective Communication R Map-Collective Model MapReduce Model

Hierarchical Data Abstraction and Collective Communication Vertex Table Key-Value Partition Array Commutable Key-Values Vertices, Edges, Messages Double Array Int Array Long Array Array Partition < Array Type > Struct Object Vertex Partition Edge Partition Array Table <Array Type> Message Partition Key-Value Table Byte Array Message Table Edge Table Broadcast, Send, Gather Broadcast, Allgather, Allreduce, Regroup-(combine/reduce), Message-to-Vertex, Edge-to-Vertex Broadcast, Send Partition Basic Types

Performance on Madrid Cluster (8 nodes)