SPIDAL Analytics Performance February 2017

Slides:

Advertisements

Similar presentations

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.

Advertisements

HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Data Science at Digital Science October Geoffrey Fox Judy Qiu

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Parallelization of 2D Lid-Driven Cavity Flow

CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.

Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

SPIDAL Java High Performance Data Analytics with Java on Large Multicore HPC Clusters

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

HPC need and potential of ANSYS CFD and mechanical products at CERN A. Rakai EN-CV-PJ2 5/4/2016.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

SPIDAL Java High Performance Data Analytics with Java on Large Multicore HPC Clusters

1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.

Geoffrey Fox Panel Talk: February

SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS

HPC Roadshow Overview of HPC systems and software available within the LinkSCEEM project.

Community Grids Laboratory

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Status and Challenges: January 2017

Pathology Spatial Analysis February 2017

LinkSCEEM-2: A computational resource for the development of Computational Sciences in the Eastern Mediterranean Mostafa Zoubi SESAME Outreach SESAME,

Welcome: Intel Multicore Research Conference

GdX - Grid eXplorer parXXL: A Fine Grained Development Environment on Coarse Grained Architectures PARA 2006 – UMEǺ Jens Gustedt - Stéphane Vialle - Amelia.

HPC Cloud Convergence February 2017 Software: MIDAS HPC-ABDS

Big Data and High-Performance Technologies for Natural Computation

Implementing parts of HPC-ABDS in a multi-disciplinary collaboration

NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.

Department of Intelligent Systems Engineering

Interactive Website (

Distinguishing Parallel and Distributed Computing Performance

Tutorial February 2017 Software: MIDAS HPC-ABDS

NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.

IEEE BigData 2016 December 5-8, Washington D.C.

Department of Intelligent Systems Engineering

Digital Science Center I

I590 Data Science Curriculum August

Applications SPIDAL MIDAS ABDS

High Performance Big Data Computing in the Digital Science Center

Data Science Curriculum March

HPC-enhanced IoT and Data-based Grid

Tutorial Overview February 2017

Data Science for Life Sciences Research & the Public Good

Distinguishing Parallel and Distributed Computing Performance

Scalable Parallel Interoperable Data Analytics Library

Distinguishing Parallel and Distributed Computing Performance

HPML Conference, Lyon, Sept 2018

Evaluation of Java Message Passing in High Performance Data Analytics

10th IEEE/ACM International Conference on Utility and Cloud Computing

CARLA Buenos Aires, Argentina - Sept , 2017

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Hybrid Programming with OpenMP and MPI

Towards High Performance Data Analytics with Java

Indiana University, Bloomington

Twister2: Design of a Big Data Toolkit

IBM Power Systems.

Department of Intelligent Systems Engineering

$1M a year for 5 years; 7 institutions Active:

PHI Research in Digital Science Center

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Big Data, Simulations and HPC Convergence

Big Data and High-Performance Technologies for Natural Computation

Research in Digital Science Center

Convergence of Big Data and Extreme Computing

Presentation transcript:

SPIDAL Analytics Performance February 2017 NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Software: MIDAS HPC-ABDS SPIDAL Analytics Performance February 2017

Performance Evaluation K-Means Clustering MPI Java and C – LRT-FJ and LRT-BSP Flink K-Means Spark K-Means DA-MDS MPI Java – LRT-FJ and LRT-BSP DA-PWC MPI Java – LRT-FJ MDSasChisq HPC Cluster 128 Intel Haswell nodes with 2.3GHz nominal frequency 96 nodes with 24 cores on 2 sockets (12 cores each) 32 nodes with 36 cores on 2 sockets (18 cores each) 128GB memory per node 40Gbps Infiniband Software RHEL 6.8 Java 1.8 OpenHFT JavaLang 6.7.2 Habanero Java 0.1.4 OpenMPI 1.10.1

K-Means Clustering Java K-Means LRT-BSP affinity CE vs NE performance for 1mil points with 1k,10k,50k,100k, and 500k centers on 16 nodes over varying threads and processes. Java vs C K-Means LRT-BSP affinity CE performance for 1 mil points with 1k,10k,50k,100k, and 500k centers on 16 nodes over varying threads and processes. 1. The gap between different data sizes follow O(N) Java K-Means 1 mil points with 1k,10k, and 100k centers performance on 16 nodes for LRT-FJ and LRT-BSP over varying threads and processes. The affinity pattern is CE. Java K-Means 1 mil points with 50k, and 500k centers performance on 16 nodes for LRT-FJ and LRT-BSP over varying threads and processes. The affinity pattern is CE.

K-Means Clustering All-Procs. Threads internal. Hybrid #processes = total parallelism. Each pinned to a separate core. Balanced across the two sockets. Threads internal. 1 process per node, pinned to T cores, where T = #threads. Threads pinned to separate cores. Threads balanced across the two sockets. Hybrid 2 processes per node, pinned to T cores. Total parallelism = Tx2 where T is 8 or 12. Threads pinned to separate cores and balanced across the two sockets. Java and C K-Means 1 mil points with 100k centers performance on 16 nodes for LRT-FJ and LRT-BSP over varying intra-node parallelisms. The affinity pattern is CE.

K-Means Clustering Flink and Spark Map (nearest centroid calculation) Reduce (update centroids) Data Set <Points> Data Set <Initial Centroids> Data Set <Updated Centroids> Broadcast Note the differences in communication architectures Note times are in log scale Bars indicate compute only times, which is similar across these frameworks Overhead is dominated by communications in Flink and Spark K-Means total and compute times for 100k 2D points and 1k,2k,4k,8k, and 16k centroids for Spark, Flink, and MPI Java LRT-BSP CE. Run on 1 node as 24x1. K-Means total and compute times for 1 million 2D points and 1k,10,50k,100k, and 500k centroids for Spark, Flink, and MPI Java LRT-BSP CE. Run on 16 nodes as 24x1.

DA-MDS Java DA-MDS 50k on 16 of 24-core nodes

DA-MDS 15x 74x 2.6x Java DA-MDS 200k on 16 of 24-core nodes Linux perf statistics for DA-MDS run of 18x2 on 32 nodes. Affinity pattern is CE. Java DA-MDS speedup comparison for LRT-FJ and LRT-BSP

DA-MDS Performance with and without optimizations. The bottom row is communication performance Java DA-MDS 100k on 48 of 24-core nodes Java DA-MDS 200k on 48 of 24-core nodes Note. Linear increase in communication time when going from 100k to 200k with shared memory communication. Java DA-MDS 100k communication on 48 of 24-core nodes Java DA-MDS 200k communication on 48 of 24-core nodes

DA-MDS Performance with and without optimizations The best speedup with varying problem size and cores on 24-core and 36-core nodes Java DA-MDS 400k on 48 of 24-core nodes Note. 400K LRT-FJ CE could NOT be done due to reaching memory limits Java DA-MDS speedup with varying data sizes Java DA-MDS speedup on 36-core nodes

DA-MDS Speedup with different optimizations Speedup compared to 1 process per node on 48 nodes Best MPI; inter and intra node BSP Threads are better than FJ and at best match Java MPI MPI; inter/intra node; Java not optimized Best FJ Threads intra node; MPI inter node

Unoptimized MDSasChisq and DA-PWC Java MDSasChisq performance on 32 of 24-core nodes Java MDSasChisq speedup on 32 of 24-core nodes Note. 400K LRT-FJ CE could NOT be done due to reaching memory limits Java DA-PWC performance on 32 of 24-core nodes