Twister2: Design of a Big Data Toolkit

Slides:



Advertisements
Similar presentations
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Advertisements

SALSA HPC Group School of Informatics and Computing Indiana University.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
High Performance Processing of Streaming Data Workshops on Dynamic Data Driven Applications Systems(DDDAS) In conjunction with: 22nd International Conference.
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Next Generation of Apache Hadoop MapReduce Owen
SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.
Towards High Performance Processing of Streaming Data May Supun Kamburugamuve, Saliya Ekanayake, Milinda Pathirage and Geoffrey C. Fox Indiana.
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes for an HPC Enhanced Cloud and Fog Spanning IoT Big Data and Big Simulations.
SPIDAL Analytics Performance February 2017
Digital Science Center II
Department of Intelligent Systems Engineering
Next Generation IoT and Data-based Grid
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Big Data and High-Performance Technologies for Natural Computation
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Department of Intelligent Systems Engineering
Interactive Website (
Research in Digital Science Center
Distinguishing Parallel and Distributed Computing Performance
Big Data Processing Issues taking care of Application Requirements, Hardware, HPC, Grid (distributed), Edge and Cloud Computing Geoffrey Fox, November.
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes from Cloud to Edge Applications The 15th IEEE International Symposium on.
IEEE BigData 2016 December 5-8, Washington D.C.
Digital Science Center I
Introduction to Spark.
Twister2: A High-Performance Big Data Programming Environment
I590 Data Science Curriculum August
Applying Twister to Scientific Applications
High Performance Big Data Computing in the Digital Science Center
Data Science Curriculum March
HPC-enhanced IoT and Data-based Grid
Department of Intelligent Systems Engineering
Biology MDS and Clustering Results
Tutorial Overview February 2017
Department of Intelligent Systems Engineering
AI First High Performance Big Data Computing for Industry 4.0
13th Cloud Control Workshop, June 13-15, 2018
A Tale of Two Convergences: Applications and Computing Platforms
Martin Swany Gregor von Laszewski Thomas Sterling Clint Whaley
Distinguishing Parallel and Distributed Computing Performance
Research in Digital Science Center
CS110: Discussion about Spark
Scalable Parallel Interoperable Data Analytics Library
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Distinguishing Parallel and Distributed Computing Performance
HPC Cloud and Big Data Testbed
High Performance Big Data Computing
10th IEEE/ACM International Conference on Utility and Cloud Computing
Digital Science Center III
Twister2: Design and initial implementation of a Big Data Toolkit
Indiana University, Bloomington
Department of Intelligent Systems Engineering
2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.
Introduction to Twister2 for Tutorial
$1M a year for 5 years; 7 institutions Active:
PHI Research in Digital Science Center
Panel on Research Challenges in Big Data
Big Data, Simulations and HPC Convergence
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
High-Performance Big Data Computing
Big Data and High-Performance Technologies for Natural Computation
Research in Digital Science Center
Convergence of Big Data and Extreme Computing
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
Presentation transcript:

Twister2: Design of a Big Data Toolkit Supun Kamburugamuve, Kannan Govindarajan, Pulasthi Wickramasinghe, Vibhatha Abeykon, Geoffrey Fox Digital Science Center Indiana University Bloomington skamburu@indiana.edu, ExaMPI 2017 `

Motivation Use of public clouds increasing rapidly Edge computing adding another dimension Clouds becoming diverse with subsystems containing GPU’s, FPGA’s, high performance networks, storage, memory Rich software stacks HPC (High Performance Computing) for Parallel Computing Apache for Big Data Software Stack ABDS – much more popular than HPC Big data systems are characterized by Low-performance High-usability Event driven computing model is becoming mainstream HPC – Asynchronous many task systems (AMT) All major big data frameworks Services in the form of Function as a Service (FaaS)

Big Data Landscape

Comparing Spark Flink and MPI On Global Machine Learning GML. Note Spark and Flink are successful on LML not GML and currently LML is more common than GML

Multidimensional Scaling - 3 Nested Parallel Sections Flink Spark MPI MPI Factor of 20-200 Faster than Spark/Flink MDS execution time on 16 nodes with 20 processes in each node with varying number of points MDS execution time with 32000 points on varying number of nodes. Each node runs 20 parallel tasks

Terasort Sorting 1TB of data records Transfer data using MPI Terasort execution time in 64 and 32 nodes. Only MPI shows the sorting time and communication time as other two frameworks doesn't provide a viable method to accurately measure them. Sorting time includes data save time. MPI-IB - MPI with Infiniband Partition the data using a sample and regroup

Heron High Performance Interconnects Infiniband & Intel Omni-Path integrations Using Libfabric as a library Natively integrated to Heron through Stream Manager without needing to go through JNI Latency of the Topology A with 1 spout and 7 bolt instances arranged in a chain with varying parallelism and message sizes. c) and d) are with 128k and 128bytes messages. The results are on KNL cluster. Topology A. A long topology with 8 Stages Yahoo Streaming Bench Topology on Haswell cluster

Layers of Parallel Applications Data partitioning and placement Manage distributed data Communication Task System Data Management Three main abstractions Computation Graph Execution (Threads/Processes), Scheduling of Executions Internode and Intracore Communication Network layer What we need to write a parallel application

K-means Computation Graph All Reduce K-Means compute Iterate Workflow Nodes Internal Execution Nodes Dataflow Communication Map (nearest centroid calculation) Reduce (update centroids) Data Set <Points> Data Set <Initial Centroids> Data Set <Updated Centroids> Broadcast Graph for K-means K-Means dataflow graph in Spark, MPI K-Means in MPI

Apache Big data Systems Top down designs targeting one type of applications Users want to use them for every type of application Monolithic designs with fixed choices Harder to change Low performance Software engineering Not targeting advanced hardware Only high level APIs/abstractions available Harder for an advanced user to optimize an application

Requirements Large scale simulation requirements are well understood We identify 4 types of applications Data pipelines Streaming Machine learning Function as a Service Big Data requirements are not clear but there are a few key use types Pleasingly parallel processing (including local machine learning LML) as of different tweets from different users with perhaps MapReduce style of statistics and visualizations; possibly Streaming Database model with queries again supported by MapReduce for horizontal scaling Global Machine Learning GML with single job using multiple nodes as classic parallel computing Deep Learning certainly needs HPC

Twiste2 Approach Clearly define functional layers Develop base layers as independent components Use interoperable common abstractions but multiple polymorphic implementations. Allow users to pick and choose according to requirements Communication + Data Management Communication + Static graph Use HPC features when possible

Twiste2 Components

Twiste2 Components

Different applications at different layers

Communication Models MPI Characteristics: Tightly synchronized applications Efficient communications (ns latency) with use of advanced hardware In place communications and computations (Process scope for state) Dataflow: Model a communication as part of a graph Nodes - computation Tasks, Edges - asynchronous communications A computation is activated when its input data dependencies are satisfied Streaming dataflow: Pub-Sub with data partitioned into streams Streams are unbounded, ordered data tuples Order of events important and group data into time windows Machine Learning dataflow: Iterative computations and keep track of state There is both Model and Data, but only communicate the model Collective communication operations such as AllReduce, AllGather Can use in-place MPI style communication

HPC Runtime versus ABDS distributed Computing Model on Data Analytics Hadoop writes to disk and is slowest; Spark and Flink spawn many processes and do not support AllReduce directly; MPI does in-place combined reduce/broadcast and is fastest Need Polymorphic Reduction capability choosing best implementation Use HPC architecture with Mutable model Immutable data

Communication Requirements Need data driven higher level abstractions Both BSP and Dataflow Style communications MPI / RDMA / TCP MPI requirements Need MPI to work with Yarn/Mesos (Use MPI only as a communication library) Make MPI work with dynamic environments where processes are added / removed while an application is running We don’t need fault tolerance at MPI level

Harp Plugin for Hadoop: Important part of Twister2 Work of Judy Qiu 2/16/2019

Task System Generate computation graph dynamically Dynamic scheduling of tasks Allow fine grained control of the graph Generate computation graph statically Dynamic or static scheduling Suitable for streaming and data query applications Hard to express complex computations, especially with loops Hybrid approach Combine both static and dynamic graphs 2/16/2019

Summary of Twister2: Next Generation HPC Cloud + Edge + Grid We suggest an event driven computing model built around Cloud and HPC and spanning batch, streaming, and edge applications Highly parallel on cloud; possibly sequential at the edge We have built a high performance data analysis library SPIDAL We have integrated HPC into many Apache systems with HPC-ABDS We have done a preliminary analysis of the different runtimes of Hadoop, Spark, Flink, Storm, Heron, Naiad, DARMA (HPC Asynchronous Many Task) There are different technologies for different circumstances but can be unified by high level abstractions such as communication collectives Obviously MPI best for parallel computing (by definition) Apache systems use dataflow communication which is natural for them No standard dataflow library (why?). Add Dataflow primitives in MPI-4? MPI could adopt some of tools of Big Data as in Coordination Points (dataflow nodes), State management with RDD (datasets)