Indiana University, Bloomington

Slides:

Advertisements

Similar presentations

MapReduce Online Veli Hasanov Fatih University.

Advertisements

Spark: Cluster Computing with Working Sets

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Collective Communication

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

A Distributed Algorithm for 3D Radar Imaging PATRICK LI SIMON SCOTT CS 252 MAY 2012.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.

Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.

Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.

High Performance Processing of Streaming Data Workshops on Dynamic Data Driven Applications Systems(DDDAS) In conjunction with: 22nd International Conference.

Streaming Applications for Robots with Real Time QoS Oct Supun Kamburugamuve Indiana University.

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.

Towards High Performance Processing of Streaming Data May Supun Kamburugamuve, Saliya Ekanayake, Milinda Pathirage and Geoffrey C. Fox Indiana.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

High Performance Processing of Streaming Data in the Cloud AFOSR FA : Cloud-Based Perception and Control of Sensor Nets and Robot Swarms 01/27/2016.

Cloud-based Parallel Implementation of SLAM for Mobile Robots Supun Kamburugamuve, Hengjing He, Geoffrey Fox, David Crandall School of Informatics & Computing.

A Parallel Communication Infrastructure for STAPL

Parallel Programming Models

Panel: Beyond Exascale Computing

SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS

Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes for an HPC Enhanced Cloud and Fog Spanning IoT Big Data and Big Simulations.

SPIDAL Analytics Performance February 2017

Digital Science Center II

Department of Intelligent Systems Engineering

Geoffrey Fox, Martin Swany Co-founder of Streamlio

Big Data and High-Performance Technologies for Natural Computation

Sub-millisecond Stateful Stream Querying over

Performance Evaluation of Adaptive MPI

Department of Intelligent Systems Engineering

Interactive Website (

Distinguishing Parallel and Distributed Computing Performance

Big Data Processing Issues taking care of Application Requirements, Hardware, HPC, Grid (distributed), Edge and Cloud Computing Geoffrey Fox, November.

Using SCTP to hide latency in MPI programs

Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes from Cloud to Edge Applications The 15th IEEE International Symposium on.

IEEE BigData 2016 December 5-8, Washington D.C.

Digital Science Center I

Twister2: A High-Performance Big Data Programming Environment

Applying Twister to Scientific Applications

High Performance Big Data Computing in the Digital Science Center

Data Science Curriculum March

湖南大学-信息科学与工程学院-计算机与科学系

Department of Intelligent Systems Engineering

AI First High Performance Big Data Computing for Industry 4.0

13th Cloud Control Workshop, June 13-15, 2018

A Tale of Two Convergences: Applications and Computing Platforms

Distinguishing Parallel and Distributed Computing Performance

Scalable Parallel Interoperable Data Analytics Library

Distinguishing Parallel and Distributed Computing Performance

HPC Cloud and Big Data Testbed

High Performance Big Data Computing

10th IEEE/ACM International Conference on Utility and Cloud Computing

Discussion: Cloud Computing for an AI First Future

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Twister2: Design and initial implementation of a Big Data Toolkit

Twister2: Design of a Big Data Toolkit

Department of Intelligent Systems Engineering

2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.

Introduction to Twister2 for Tutorial

PHI Research in Digital Science Center

Big Data, Simulations and HPC Convergence

Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.

Emulating Massively Parallel (PetaFLOPS) Machines

Research in Digital Science Center

Convergence of Big Data and Extreme Computing

Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,

Map Reduce, Types, Formats and Features

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Indiana University, Bloomington Twister:Net - Communication Library for Big Data Processing in HPC and Cloud Environment Supun Kamburugamuve, Pulasthi Wickramasinghe, Kannan Govindarajan, Ahmet Uyar, Gurhan Gunduz, Vibhatha Abeykoon and Geoffrey C. Fox Indiana University, Bloomington July 6th, 2018

Overview No library implements optimized dataflow communication operations Need both BSP and Dataflow Style communications MPI / RDMA / TCP Every major big data project implements communications on their own Different semantics for same operation Communication is implemented at a higher level – Only point to point operations are considered at the communication level Data driven higher level abstractions Handling both streaming and batch

Twister2 – HPC and Big Data

Requirements Modeling dataflow graph Preserving data locality Streaming computation requirements Complete graph is executing at any given time Data locality Tasks running in different nodes Two tasks reads data and distributed them for processing Communication Operations Streaming - All the tasks of execution graph running

Dataflow Communications Non-blocking Dynamic data sizes Streaming model Batch case is modeled as a finite stream The communications are between a set of tasks in an arbitrary task graph Key based communications Communications spilling to disks

Twister:Net Standalone Communication library capturing dataflow communications Big data computation model Communication can be invoked by multiple tasks in a process AllReduce in an arbitrary task graph (right) compared to MPI (left)

Twister:Net Communication operators are state full Buffer data, handle imbalanced communications, act as a combiner Thread safe Initialization MPI TCP / ZooKeeper Buffer management The messages are serialized by the library Back-pressure Uses flow control by the underlying channel Architecture Reduce Gather Partition Broadcast AllReduce AllGather Keyed-Partition Keyed-Reduce KeyedGather Batch and Streaming versions Optimized operation vs Serial

Twister:Net Model Messages Callback Receiver Communication Operation

Latency and bandwidth between two tasks running in two nodes Bandwidth & Latency Latency and bandwidth between two tasks running in two nodes Bandwidth utilization of Flink, Twister2 and OpenMPI over 1Gbps, 10Gbps and IB with Flink on IPoIB Latency of MPI and Twister:Net with different message sizes on a two-node setup

Flink, BSP and DFW Performance Total time for Flink and Twister:Net for Reduce and Partition operations in 32 nodes with 640-way parallelism. The time is for 1 million messages in each parallel unit, with the given message size Latency for Reduce and Gather operations in 32 nodes with 256-way parallelism. The time is for 1 million messages in each parallel unit, with the given message size. For BSP-Object case we do two MPI calls with MPIAllReduce / MPIAllGather first to get the lengths of the messages and the actual call. InfiniBand network is used.

Latency vs Heron Speedup of about 100 Latency of Heron and Twister:Net DFW for Reduce, Broadcast and Partition operations in 16 nodes with 256-way parallelism

Direct Heron Communication Improvements Stream Manager Stream Manager Sub Task Sub Task Sub Task Sub Task Sub Task Sub Task Sub Task Sub Task Sub Task Sub Task Sub Task Sub Task Sub Task Sub Task Sub Task Sub Task Container Container Container Container This architecture doesn’t work for parallel communications, Stream Manger is a bottleneck Removed Stream Managers and allowed tasks to communicate

Heron direct implementation Broadcast Latency of broadcast with 16 nodes and 64, 128 and 256 parallel tasks. Message size varying from 1K to 32K Reduce Latency of reduce with 16 nodes and 64, 128 and 256 parallel tasks. Message size varying from 1K to 32K

K-Means algorithm performance AllReduce Communication Left: K-means job execution time on 16 nodes with varying centers, 2 million points with 320-way parallelism. Right: K-Means wth 4,8 and 16 nodes where each node having 20 tasks. 2 million points with 16000 centers used.

Sorting Records Dataflow Partition operation

Sorting Records For DFW case, a single node can get congested if many processes send message simultaneously. Left: Terasort time on a 16 node cluster with 384 parallelism. BSP and DFW shows the communication time. Right: Terasort on 32 nodes with .5 TB and 1TB datasets. Parallelism of 320. Right 16 node cluster (Victor), Left 32 node cluster (Juliet) with InfiniBand. Partition the data using a sample and regroup BSP algorithm waits for others to send messages in a ring topology and can be in-efficient compared to DFW case where processes do not wait.

Conclusions We have created a model for dataflow communications We have implemented optimized versions of these communications for common patterns The results show we can be much better in performance to big data systems 2/15/2019

Questions