SPIDAL Analytics Performance February 2017

Slides:



Advertisements
Similar presentations
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
Advertisements

HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Data Science at Digital Science October Geoffrey Fox Judy Qiu
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Parallelization of 2D Lid-Driven Cavity Flow
CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
SPIDAL Java High Performance Data Analytics with Java on Large Multicore HPC Clusters
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
HPC need and potential of ANSYS CFD and mechanical products at CERN A. Rakai EN-CV-PJ2 5/4/2016.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
SPIDAL Java High Performance Data Analytics with Java on Large Multicore HPC Clusters
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Geoffrey Fox Panel Talk: February
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
HPC Roadshow Overview of HPC systems and software available within the LinkSCEEM project.
Community Grids Laboratory
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Status and Challenges: January 2017
Pathology Spatial Analysis February 2017
LinkSCEEM-2: A computational resource for the development of Computational Sciences in the Eastern Mediterranean Mostafa Zoubi SESAME Outreach SESAME,
Welcome: Intel Multicore Research Conference
GdX - Grid eXplorer parXXL: A Fine Grained Development Environment on Coarse Grained Architectures PARA 2006 – UMEǺ Jens Gustedt - Stéphane Vialle - Amelia.
HPC Cloud Convergence February 2017 Software: MIDAS HPC-ABDS
Big Data and High-Performance Technologies for Natural Computation
Implementing parts of HPC-ABDS in a multi-disciplinary collaboration
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Department of Intelligent Systems Engineering
Interactive Website (
Distinguishing Parallel and Distributed Computing Performance
Tutorial February 2017 Software: MIDAS HPC-ABDS
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
IEEE BigData 2016 December 5-8, Washington D.C.
Department of Intelligent Systems Engineering
Digital Science Center I
I590 Data Science Curriculum August
Applications SPIDAL MIDAS ABDS
High Performance Big Data Computing in the Digital Science Center
Data Science Curriculum March
HPC-enhanced IoT and Data-based Grid
Tutorial Overview February 2017
Data Science for Life Sciences Research & the Public Good
Distinguishing Parallel and Distributed Computing Performance
Scalable Parallel Interoperable Data Analytics Library
Distinguishing Parallel and Distributed Computing Performance
HPML Conference, Lyon, Sept 2018
Evaluation of Java Message Passing in High Performance Data Analytics
10th IEEE/ACM International Conference on Utility and Cloud Computing
CARLA Buenos Aires, Argentina - Sept , 2017
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Hybrid Programming with OpenMP and MPI
Towards High Performance Data Analytics with Java
Indiana University, Bloomington
Twister2: Design of a Big Data Toolkit
IBM Power Systems.
Department of Intelligent Systems Engineering
$1M a year for 5 years; 7 institutions Active:
PHI Research in Digital Science Center
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Big Data, Simulations and HPC Convergence
Big Data and High-Performance Technologies for Natural Computation
Research in Digital Science Center
Convergence of Big Data and Extreme Computing
Presentation transcript:

SPIDAL Analytics Performance February 2017 NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Software: MIDAS HPC-ABDS SPIDAL Analytics Performance February 2017

Performance Evaluation K-Means Clustering MPI Java and C – LRT-FJ and LRT-BSP Flink K-Means Spark K-Means DA-MDS MPI Java – LRT-FJ and LRT-BSP DA-PWC MPI Java – LRT-FJ MDSasChisq HPC Cluster 128 Intel Haswell nodes with 2.3GHz nominal frequency 96 nodes with 24 cores on 2 sockets (12 cores each) 32 nodes with 36 cores on 2 sockets (18 cores each) 128GB memory per node 40Gbps Infiniband Software RHEL 6.8 Java 1.8 OpenHFT JavaLang 6.7.2 Habanero Java 0.1.4 OpenMPI 1.10.1

K-Means Clustering Java K-Means LRT-BSP affinity CE vs NE performance for 1mil points with 1k,10k,50k,100k, and 500k centers on 16 nodes over varying threads and processes. Java vs C K-Means LRT-BSP affinity CE performance for 1 mil points with 1k,10k,50k,100k, and 500k centers on 16 nodes over varying threads and processes. 1. The gap between different data sizes follow O(N) Java K-Means 1 mil points with 1k,10k, and 100k centers performance on 16 nodes for LRT-FJ and LRT-BSP over varying threads and processes. The affinity pattern is CE. Java K-Means 1 mil points with 50k, and 500k centers performance on 16 nodes for LRT-FJ and LRT-BSP over varying threads and processes. The affinity pattern is CE.

K-Means Clustering All-Procs. Threads internal. Hybrid #processes = total parallelism. Each pinned to a separate core. Balanced across the two sockets. Threads internal. 1 process per node, pinned to T cores, where T = #threads. Threads pinned to separate cores. Threads balanced across the two sockets. Hybrid 2 processes per node, pinned to T cores. Total parallelism = Tx2 where T is 8 or 12. Threads pinned to separate cores and balanced across the two sockets. Java and C K-Means 1 mil points with 100k centers performance on 16 nodes for LRT-FJ and LRT-BSP over varying intra-node parallelisms. The affinity pattern is CE.

K-Means Clustering Flink and Spark Map (nearest centroid calculation) Reduce (update centroids) Data Set <Points> Data Set <Initial Centroids> Data Set <Updated Centroids> Broadcast Note the differences in communication architectures Note times are in log scale Bars indicate compute only times, which is similar across these frameworks Overhead is dominated by communications in Flink and Spark K-Means total and compute times for 100k 2D points and 1k,2k,4k,8k, and 16k centroids for Spark, Flink, and MPI Java LRT-BSP CE. Run on 1 node as 24x1. K-Means total and compute times for 1 million 2D points and 1k,10,50k,100k, and 500k centroids for Spark, Flink, and MPI Java LRT-BSP CE. Run on 16 nodes as 24x1.

DA-MDS Java DA-MDS 50k on 16 of 24-core nodes

DA-MDS 15x 74x 2.6x Java DA-MDS 200k on 16 of 24-core nodes Linux perf statistics for DA-MDS run of 18x2 on 32 nodes. Affinity pattern is CE. Java DA-MDS speedup comparison for LRT-FJ and LRT-BSP

DA-MDS Performance with and without optimizations. The bottom row is communication performance Java DA-MDS 100k on 48 of 24-core nodes Java DA-MDS 200k on 48 of 24-core nodes Note. Linear increase in communication time when going from 100k to 200k with shared memory communication. Java DA-MDS 100k communication on 48 of 24-core nodes Java DA-MDS 200k communication on 48 of 24-core nodes

DA-MDS Performance with and without optimizations The best speedup with varying problem size and cores on 24-core and 36-core nodes Java DA-MDS 400k on 48 of 24-core nodes Note. 400K LRT-FJ CE could NOT be done due to reaching memory limits Java DA-MDS speedup with varying data sizes Java DA-MDS speedup on 36-core nodes

DA-MDS Speedup with different optimizations Speedup compared to 1 process per node on 48 nodes Best MPI; inter and intra node BSP Threads are better than FJ and at best match Java MPI MPI; inter/intra node; Java not optimized Best FJ Threads intra node; MPI inter node

Unoptimized MDSasChisq and DA-PWC Java MDSasChisq performance on 32 of 24-core nodes Java MDSasChisq speedup on 32 of 24-core nodes Note. 400K LRT-FJ CE could NOT be done due to reaching memory limits Java DA-PWC performance on 32 of 24-core nodes