Twister2 for BDEC2 https://twister2.gitbook.io/twister2/ Poznan, Poland Geoffrey Fox, May 15, 2019 gcf@indiana.edu, http://www.dsc.soic.indiana.edu/, http://spidal.org/

Slides:

Advertisements

Similar presentations

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox

Advertisements

Big Data Open Source Software and Projects ABDS in Summary XIII: Level 14A I590 Data Science Curriculum August Geoffrey Fox

HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Big Data Open Source Software and Projects ABDS in Summary XVIII: Layer 14A Data Science Curriculum March Geoffrey Fox

Towards High Performance Processing of Streaming Data May Supun Kamburugamuve, Saliya Ekanayake, Milinda Pathirage and Geoffrey C. Fox Indiana.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.

Big Data Analytics and HPC Platforms

Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes for an HPC Enhanced Cloud and Fog Spanning IoT Big Data and Big Simulations.

PROTECT | OPTIMIZE | TRANSFORM

Department of Intelligent Systems Engineering

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Machine Learning Library for Apache Ignite

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Spark Presentation.

Big Data and High-Performance Technologies for Natural Computation

NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.

Data Platform and Analytics Foundational Training

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Department of Intelligent Systems Engineering

Interactive Website (

Distinguishing Parallel and Distributed Computing Performance

Big Data Processing Issues taking care of Application Requirements, Hardware, HPC, Grid (distributed), Edge and Cloud Computing Geoffrey Fox, November.

ETL Architecture for Real-Time BI

Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes from Cloud to Edge Applications The 15th IEEE International Symposium on.

Benchmarking Modern Distributed Stream Processing Systems

Department of Intelligent Systems Engineering

Introduction to Spark.

Twister2: A High-Performance Big Data Programming Environment

I590 Data Science Curriculum August

Distributed Systems CS

High Performance Big Data Computing in the Digital Science Center

Data Science Curriculum March

Department of Intelligent Systems Engineering

Department of Intelligent Systems Engineering

AI First High Performance Big Data Computing for Industry 4.0

13th Cloud Control Workshop, June 13-15, 2018

A Tale of Two Convergences: Applications and Computing Platforms

Martin Swany Gregor von Laszewski Thomas Sterling Clint Whaley

Distinguishing Parallel and Distributed Computing Performance

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data

Distinguishing Parallel and Distributed Computing Performance

HPC Cloud and Big Data Testbed

High Performance Big Data Computing

10th IEEE/ACM International Conference on Utility and Cloud Computing

Architecture for Real-Time ETL

Spark and Scala.

Twister2: Design and initial implementation of a Big Data Toolkit

Indiana University, Bloomington

Indiana University Gregor von Laszewski.

Twister2: Design of a Big Data Toolkit

Department of Intelligent Systems Engineering

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.

Introduction to Twister2 for Tutorial

Distributed Systems CS

Big Data System Environments

PHI Research in Digital Science Center

Building HPC Big Data Systems

Big Data, Simulations and HPC Convergence

Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.

CS639: Data Management for Data Science

Research in Digital Science Center

High-Performance Big Data Computing

Research in Digital Science Center

Research in Digital Science Center

Research in Digital Science Center

Convergence of Big Data and Extreme Computing

Presentation transcript:

Twister2 for BDEC2 https://twister2.gitbook.io/twister2/ Poznan, Poland Geoffrey Fox, May 15, 2019 gcf@indiana.edu, http://www.dsc.soic.indiana.edu/, http://spidal.org/ Digital Science Center 5/10/2019

Twister2 Highlights I “Big Data Programming Environment” such as Hadoop, Spark, Flink, Storm, Heron but uses HPC wherever appropriate and outperforms Apache systems – often by large factors Runs preferably under Kubernetes Mesos Nomad but Slurm supported Highlight is high performance dataflow supporting iteration, fine-grain, coarse grain, dynamic, synchronized, asynchronous, batch and streaming Three distinct communication environments DFW Dataflow with distinct source and target tasks; data not message level; Data-level Communications spilling to disks as needed BSP for parallel programming; MPI is default. Inappropriate for dataflow Storm API for streaming events with pub-sub such as Kafka Rich state model (API) for objects supporting in-place, distributed, cached, RDD (Spark) style persistence with Tsets (see Pcollections in Beam, Datasets in Flink, Streamlets in Storm, Heron)

Software model supported by Twister2 Continually interacting in the Intelligent Aether with 5/10/2019

Twister2 Highlights II Can be a pure batch engine Not built on top of a streaming engine Can be a pure streaming engine supporting Storm/Heron API Not built on on top of a batch engine Fault tolerance (June 2019) as in Spark or MPI today; dataflow nodes define natural synchronization points Many API’s: Data, Communication, Task High level hiding communication and decomposition (as in Spark) and low level (as in MPI) DFW supports MPI and MapReduce primitives: (All)Reduce, Broadcast, (All)Gather, Partition, Join with and without keys Component based architecture -- it is a toolkit Defines the important layers of a distributed processing engine Implements these layers cleanly aiming at high performance data analytics

Parallel SVM using SGD execution time for 320K data points with 2000 features and 500 iterations, on 16 nodes with varying parallelism Times Spark RDD > Twister2 Tset > Twister2 Task > MPI

Twister2 Status 100,000 lines of new open source Code: mainly Java but significant Python https://twister2.gitbook.io/twister2/tutorial Operational with documentation and examples End of June 2019: Fault tolerance, Apache BEAM Linkage, More applications Fall2019: Python API, C++ Implementation (why Python hard) Not scheduled: TensorFlow Integration, SQL API, Native MPI Two IU application foci are integration of Machine Learning with nano and bio modelling MLforHPC and Streaming using Storm API 5/10/2019