Twister2 for BDEC2 https://twister2.gitbook.io/twister2/ Poznan, Poland Geoffrey Fox, May 15, 2019 gcf@indiana.edu, http://www.dsc.soic.indiana.edu/, http://spidal.org/

Slides:



Advertisements
Similar presentations
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Advertisements

Big Data Open Source Software and Projects ABDS in Summary XIII: Level 14A I590 Data Science Curriculum August Geoffrey Fox
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Big Data Open Source Software and Projects ABDS in Summary XVIII: Layer 14A Data Science Curriculum March Geoffrey Fox
Towards High Performance Processing of Streaming Data May Supun Kamburugamuve, Saliya Ekanayake, Milinda Pathirage and Geoffrey C. Fox Indiana.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Big Data Analytics and HPC Platforms
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes for an HPC Enhanced Cloud and Fog Spanning IoT Big Data and Big Simulations.
PROTECT | OPTIMIZE | TRANSFORM
Department of Intelligent Systems Engineering
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Machine Learning Library for Apache Ignite
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Big Data and High-Performance Technologies for Natural Computation
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Data Platform and Analytics Foundational Training
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Department of Intelligent Systems Engineering
Interactive Website (
Distinguishing Parallel and Distributed Computing Performance
Big Data Processing Issues taking care of Application Requirements, Hardware, HPC, Grid (distributed), Edge and Cloud Computing Geoffrey Fox, November.
ETL Architecture for Real-Time BI
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes from Cloud to Edge Applications The 15th IEEE International Symposium on.
Benchmarking Modern Distributed Stream Processing Systems
Department of Intelligent Systems Engineering
Introduction to Spark.
Twister2: A High-Performance Big Data Programming Environment
I590 Data Science Curriculum August
Distributed Systems CS
High Performance Big Data Computing in the Digital Science Center
Data Science Curriculum March
Department of Intelligent Systems Engineering
Department of Intelligent Systems Engineering
AI First High Performance Big Data Computing for Industry 4.0
13th Cloud Control Workshop, June 13-15, 2018
A Tale of Two Convergences: Applications and Computing Platforms
Martin Swany Gregor von Laszewski Thomas Sterling Clint Whaley
Distinguishing Parallel and Distributed Computing Performance
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Distinguishing Parallel and Distributed Computing Performance
HPC Cloud and Big Data Testbed
High Performance Big Data Computing
10th IEEE/ACM International Conference on Utility and Cloud Computing
Architecture for Real-Time ETL
Spark and Scala.
Twister2: Design and initial implementation of a Big Data Toolkit
Indiana University, Bloomington
Indiana University Gregor von Laszewski.
Twister2: Design of a Big Data Toolkit
Department of Intelligent Systems Engineering
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.
Introduction to Twister2 for Tutorial
Distributed Systems CS
Big Data System Environments
PHI Research in Digital Science Center
Building HPC Big Data Systems
Big Data, Simulations and HPC Convergence
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
CS639: Data Management for Data Science
Research in Digital Science Center
High-Performance Big Data Computing
Research in Digital Science Center
Research in Digital Science Center
Research in Digital Science Center
Convergence of Big Data and Extreme Computing
Presentation transcript:

Twister2 for BDEC2 https://twister2.gitbook.io/twister2/ Poznan, Poland Geoffrey Fox, May 15, 2019 gcf@indiana.edu, http://www.dsc.soic.indiana.edu/, http://spidal.org/ Digital Science Center 5/10/2019

Twister2 Highlights I “Big Data Programming Environment” such as Hadoop, Spark, Flink, Storm, Heron but uses HPC wherever appropriate and outperforms Apache systems – often by large factors Runs preferably under Kubernetes Mesos Nomad but Slurm supported Highlight is high performance dataflow supporting iteration, fine-grain, coarse grain, dynamic, synchronized, asynchronous, batch and streaming Three distinct communication environments DFW Dataflow with distinct source and target tasks; data not message level; Data-level Communications spilling to disks as needed BSP for parallel programming; MPI is default. Inappropriate for dataflow Storm API for streaming events with pub-sub such as Kafka Rich state model (API) for objects supporting in-place, distributed, cached, RDD (Spark) style persistence with Tsets (see Pcollections in Beam, Datasets in Flink, Streamlets in Storm, Heron)

Software model supported by Twister2 Continually interacting in the Intelligent Aether with 5/10/2019

Twister2 Highlights II Can be a pure batch engine Not built on top of a streaming engine Can be a pure streaming engine supporting Storm/Heron API Not built on on top of a batch engine Fault tolerance (June 2019) as in Spark or MPI today; dataflow nodes define natural synchronization points Many API’s: Data, Communication, Task High level hiding communication and decomposition (as in Spark) and low level (as in MPI) DFW supports MPI and MapReduce primitives: (All)Reduce, Broadcast, (All)Gather, Partition, Join with and without keys Component based architecture -- it is a toolkit Defines the important layers of a distributed processing engine Implements these layers cleanly aiming at high performance data analytics

Parallel SVM using SGD execution time for 320K data points with 2000 features and 500 iterations, on 16 nodes with varying parallelism Times Spark RDD > Twister2 Tset > Twister2 Task > MPI

Twister2 Status 100,000 lines of new open source Code: mainly Java but significant Python https://twister2.gitbook.io/twister2/tutorial Operational with documentation and examples End of June 2019: Fault tolerance, Apache BEAM Linkage, More applications Fall2019: Python API, C++ Implementation (why Python hard) Not scheduled: TensorFlow Integration, SQL API, Native MPI Two IU application foci are integration of Machine Learning with nano and bio modelling MLforHPC and Streaming using Storm API 5/10/2019