SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Spark: Cluster Computing with Working Sets

Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

3.5 Interprocess Communication

Computer System Architectures Computer System Software

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Department of Computer Science and Software Engineering

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

Background Computer System Architectures Computer System Software.

SPIDAL Java High Performance Data Analytics with Java on Large Multicore HPC Clusters

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

SPIDAL Java High Performance Data Analytics with Java on Large Multicore HPC Clusters

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

Chapter 4 – Thread Concepts

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

These slides are based on the book:

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

SPIDAL Analytics Performance February 2017

Chapter 4: Threads.

Chapter 4: Threads.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

CS5102 High Performance Computer Systems Thread-Level Parallelism

Distributed Processors

Pathology Spatial Analysis February 2017

Chapter 4 – Thread Concepts

CS399 New Beginnings Jonathan Walpole.

CS 147 – Parallel Processing

Chapter 4: Multithreaded Programming

NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.

Task Scheduling for Multicore CPUs and NUMA Systems

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Interactive Website (

Distinguishing Parallel and Distributed Computing Performance

Threads and Cooperation

Linchuan Chen, Xin Huo and Gagan Agrawal

IEEE BigData 2016 December 5-8, Washington D.C.

Digital Science Center I

CMSC 611: Advanced Computer Architecture

Parallel Programming in Contemporary Programming Languages (Part 2)

High Performance Big Data Computing in the Digital Science Center

Threads and Data Sharing

MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.

Distinguishing Parallel and Distributed Computing Performance

Scalable Parallel Interoperable Data Analytics Library

CS703 - Advanced Operating Systems

Distinguishing Parallel and Distributed Computing Performance

Distributed Systems CS

Hybrid Programming with OpenMP and MPI

Indiana University, Bloomington

Multithreaded Programming

Chapter 4: Threads & Concurrency

Chapter 4: Threads.

PHI Research in Digital Science Center

Chapter 4: Threads.

Big Data, Simulations and HPC Convergence

CSC Multiprocessor Programming, Spring, 2011

Convergence of Big Data and Extreme Computing

Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,

Presentation transcript:

SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Software: MIDAS HPC-ABDS SPIDAL Java Optimized February 2017

SPIDAL Java From Saliya Ekanayake, Virginia Tech Learn more at SPIDAL Java paper Java Thread and Process Performance paper SPIDAL Examples Github Machine Learning with SPIDAL cookbook SPIDAL Java cookbook Slide 3: Factors that affect parallel Java performance Slide 4: Performance chart Slides 5: Overview of thread models and affinity Slides 6 – 7: Threads in detail Slides 8 – 9: Affinity in detail Slides 10 –13: Performance charts Slide 14 – 15: Improving Inter-Process Communication (IPC) Slide 16 – 17: Other factors – Serialization, GC, Cache, I/O

Performance Factors Threads Can threads “magically” speedup your application? Affinity How to place threads/processes across cores? Why should we care? Communication Why Inter-Process Communication (IPC) is expensive? How to improve? Other factors Garbage collection Serialization/Deserialization Memory references and cache Data read/write

Java MPI performs better than FJ Threads 128 24 core Haswell nodes on SPIDAL 200K DA-MDS Code Speedup compared to 1 process per node on 48 nodes Best MPI; inter and intra node BSP Threads are better than FJ and at best match Java MPI MPI; inter/intra node; Java not optimized Best FJ Threads intra node; MPI inter node

Investigating Process and Thread Models Fork Join (FJ) Threads lower performance than Bulk Synchronous Parallel (BSP) LRT is Long Running Threads Results Large effects for Java Best affinity is process and thread binding to cores - CE At best LRT mimics performance of “all processes” 6 Thread/Process Affinity Models LRT-FJ LRT-BSP Serial work Non-trivial parallel work Busy thread synchronization Threads Affinity Processes Affinity Cores Socket None (All) Inherit CI SI NI Explicit per core CE SE NE

Threads in Detail The usual approach is to use thread pools to execute parallel tasks. Works well for multi-tasking such as serving network requests. Pooled threads sleep while no tasks are assigned to them. But, this sleep, awake and get scheduled cycle is expensive for compute intensive parallel algorithms. E.g. Implementation of the classic Fork-Join construct. The as-is implementation is to use a long running thread pool for the forked tasks and join them when they are completed. We call this the LRT-FJ implementation. Serial work Non-trivial parallel work

Threads in Detail LRT-FJ Serial work Non-trivial parallel work LRT-FJ is expensive for complex algorithms, especially for those with iterations over parallel loops. Alternatively, this structure can be implemented using Long Running Threads – Bulk Synchronous Parallel (LRT-BSP). Resembles the classic BSP model of processes. A long running thread pool similar to LRT-FJ. Threads occupy CPUs always – “hot” threads. LRT-FJ vs. LRT-BSP. High context switch overhead in FJ. BSP replicates serial work but reduced overhead. Implicit synchronization in FJ. BSP use explicit busy synchronizations. LRT-BSP Serial work Non-trivial parallel work Busy thread synchronization

Affinity in Detail Non-Uniform Memory Access (NUMA) and threads E.g. 1 node in Juliet HPC cluster 2 Intel Haswell sockets, 12 (or 18) cores each 2 hyper-threads (HT) per core Separate L1,L2 and shared L3 Which approach is better? All-processes All-threads 12 T x 2 P Other combinations Where to place threads? Node, socket, core Socket 0 1 Core – 2 HTs Socket 1 Intel QPI 12 cores 2 sockets

Affinity in Detail Six affinity patterns E.g. 2x4 2x4 CI C0 C1 C2 C3 C4 C5 C6 C7 Socket 0 Socket 1 P0 P1 P3 P4 Six affinity patterns E.g. 2x4 Two threads per process Four processes per node Two 4 core sockets 2x4 SI C0 C1 C2 C3 C4 C5 C6 C7 Socket 0 Socket 1 P0,P1 P2,P3 Threads Affinity Processes Affinity Cores Socket None (All) Inherit CI SI NI Explicit per core CE SE NE 2x4 NI C0 C1 C2 C3 C4 C5 C6 C7 Socket 0 Socket 1 P0,P1,P2,P3 Worker threads are free to “roam” over cores/sockets 2x4 CE C0 C1 C2 C3 C4 C5 C6 C7 Socket 0 Socket 1 P0 P1 P3 P4 2x4 SE C0 C1 C2 C3 C4 C5 C6 C7 Socket 0 Socket 1 P0,P1 P2,P3 Worker thread Background thread (GC and other JVM threads) Process 2x4 NE C0 C1 C2 C3 C4 C5 C6 C7 Socket 0 Socket 1 P0,P1,P2,P3 Worker threads are pinned to a core on each socket

A Quick Peek into Performance No thread pinning and FJ Threads pinned to cores and FJ No thread pinning and BSP Threads pinned to cores and BSP K-Means 10K performance on 16 nodes

Performance Sensitivity Kmeans: 1 million points and 1000 centers performance on 16 24 core nodes for LRT-FJ and LRT-BSP with varying affinity patterns (6 choices) over varying threads and processes C less sensitive than Java All processes less sensitive than all threads Java C

Performance Dependence on Number of Cores inside 24-core node (16 nodes total) All MPI internode All Processes LRT BSP Java All Threads internal to node Hybrid – Use one process per chip LRT Fork Join Java All Threads Fork Join C 15x 74x 2.6x

Java versus C Performance C and Java Comparable with Java doing better on larger problem sizes All data from one million point dataset with varying number of centers on 16 nodes 24 core Haswell

Communication Mechanisms Collective communications are expensive. Allgather, allreduce, broadcast. Frequently used in parallel machine learning E.g. Identical message size per node, yet 24 MPI is ~10 times slower than 1 MPI Suggests #ranks per node should be 1 for the best performance How to reduce this cost? 3 million double values distributed uniformly over 48 nodes

Communication Mechanisms Shared Memory (SM) for intra-node communication. Custom Java implementation in SPIDAL. Uses OpenHFT’s Bytes API. Reduce network communications to the number of nodes. Heterogeneity support, i.e. machines with multiple core/socket counts can run that many MPI processes. Java SM architecture

Other Factors: Garbage Collection (GC) “Stop the world” events are expensive. Especially, for parallel machine learning. Typical OOP  allocate – use – forget. Original SPIDAL code produced frequent garbage of small arrays. Unavoidable, but can be reduced by: Static allocation. Object reuse. Advantage. Less GC – obvious. Scale to larger problem sizes. E.g. Original SPIDAL code required 5GB (x 24 = 120 GB per node) heap per process to handle 200K DA-MDS. Optimized code use < 1GB heap to finish within the same timing. Heap size per process reaches –Xmx (2.5GB) early in the computation Frequent GC Heap size per process is well below (~1.1GB) of –Xmx (2.5GB) Virtually no GC activity after optimizing

Other Factors Serialization/Deserialization. Default implementations are verbose, especially in Java. Kryo is by far the best in compactness. Off-heap buffers are another option. Memory references and cache. Nested structures are expensive. Even 1D arrays are preferred over 2D when possible. Adopt HPC techniques – loop ordering, blocked arrays. Data read/write. Stream I/O is expensive for large data Memory mapping is much faster and JNI friendly in Java Native calls require extra copies as objects move during GC. Memory maps are in off-GC space, so no extra copying is necessary