1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.

Slides:

Advertisements

Similar presentations

Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee

Advertisements

1 Copyright © 2012 Oracle and/or its affiliates. All rights reserved. Convergence of HPC, Databases, and Analytics Tirthankar Lahiri Senior Director, Oracle.

Shredder GPU-Accelerated Incremental Storage and Computation

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

High Performing Cache Hierarchies for Server Workloads

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Corey – An Operating System for Many Cores 謝政宏.

CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications Martin Burtscher, Byoung-Do Kim, Jeff Diamond, John McCalpin, Lars Koesterke,

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,

Memory System Characterization of Big Data Workloads

Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding

Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.

Measuring Network Performance of Multi-Core Multi-Cluster (MCMCA) Norhazlina Hamid Supervisor: R J Walters and G B Wills PUBLIC.

GCSE Computing - The CPU

Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

02/21/2003 CART 1 On-chip MRAM as a High-Bandwidth, Low-Latency Replacement for DRAM Physical Memories Rajagopalan Desikan, Charles R. Lefurgy, Stephen.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.

A Measurement Based Memory Performance Evaluation of Streaming Media Servers Garba Isa Yau and Abdul Waheed Department of Computer Engineering King Fahd.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing TX, US, Experimental.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Sunpyo Hong, Hyesoon Kim

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

Morgan Kaufmann Publishers

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.

Performance Evaluation of Adaptive MPI

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

CMSC 611: Advanced Computer Architecture

Peng Jiang, Linchuan Chen, and Gagan Agrawal

CMSC 611: Advanced Computer Architecture

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin 3, Byoung-Do Kim 3, Stephen W. Keckler 1,4, James C. Browne 1 1 University of Texas, 2 Texas State, 3 Texas Advanced Computing Center, 4 NVIDIA

Trends In Supercomputers 2

3 Is multicore an issue?

The Problem: Multicore Scalability 4

5

6 Optimizations Differ in Multicore Base code vs Multicore Optimized code

Paper Contributions Studies multicore related bottlenecks Identifies performance measurement challenges unique to multicore systems Presents systematic approach to multicore performance analysis Demonstrates principles of optimization 7

Talk Outline Introduction Approach: An HPC Case Study Multicore Measurement Issues Optimization Example Conclusion 8

Approach: An HPC Case Study Examine a real HPC application  Major functions add variety What is a typical HPC application?  Many exhibit low arithmetic intensity Typical of explicit / iterative solvers, stencils Finite volume / elements / differences Molecular dynamics, particle simulations, graph search, Sparse MM, etc. 9

Application: HOMME  High Order Method Modeling Environment  3-D Atmospheric Simulation from NCAR  Required for NSF acceptance testing  Excellent scaling, highly optimized  Arithmetic Intensity typical of stencil codes Supercomputers:  Ranger – 62,976 cores, 579 Teraflops 2.3 GHz quad core AMD Barcelona chips  Longhorn – 2,048 cores GPUs 2.5 GHz quad core Intel Nehalem-EP chips 10 Approach: An HPC Case Study

Talk Outline Introduction Approach: An HPC Case Study Multicore Measurement Issues Optimization Example Conclusion 11

Multicore Performance Bottlenecks 12 SINGLE CHIP SINGLE DIMM PRIVATE L1/L2 Cache SHARED L3 CACHE SHARED OFF-CHIP BW SHARED DRAM PAGE CACHES NODE LOCAL DRAM L3 L2 L1

13 Disturbances Persist Longer

14 Measurement Implications

Measurements Must Be Lightweight 15 Duration of major HOMME functions ActionCycles Read Counter9 Read Four Counters30 Call Function40 PAPI READ400 System Call5,000 TLB Page Initialization25,000 Function DurationCalls Per Second% Exec Time 2,000 cycles or less100,00020% 2,000 to 10,000 cycles20,00010% 10K to 200K cycles1,60015% 200K to 1M cycles20015% 1M to 10M cycles-0% 10M or more cycle435%

Multicore Measurement Issues Performance issues in shared memory system  Context Sensitive  Nondeterministic  Highly non local Measurement disturbance is significant  Accessing memory or delaying core  Hard to “bracket” measurement effects  Disturbances can last billions of cycles  Bottlenecks can be “bursty” Conclusion – need multiple tools 16

Talk Outline Introduction Approach: An HPC Case Study Multicore Measurement Issues Optimization Example Conclusion 17

Multicore Performance Bottlenecks 18 SINGLE CHIP SINGLE DIMM SHARED L3 CACHE SHARED OFF-CHIP BW SHARED DRAM PAGE CACHES NODE LOCAL DRAM L3 L2 L1

Measurement Approach Find important functions Compare performance counters at min/max core density Identify key multicore bottleneck:  L3 capacity – L3 miss rates increase with density  Off-chip BW – BW usage at min density greater than share  DRAM contention – DRAM page miss rates increase with density For small and medium functions, follow up with light weight / temporal measurements 19

20 Typical Homme Loop

21 Apply “Microfission” (First Line)

“Loop Microfission” Local, context free optimization Each array processed independently  Add high-level blocking to fit cache Reduces total DRAM banks  Statistically reduces DRAM page miss rate Reduces instantaneous working set size  Helps with L3 capacity and off-chip BW 22

23 Microfission Results

Talk Outline Introduction Approach: An HPC Case Study Multicore Measurement Issues Optimization Example Conclusion 24

25 Summary and Conclusions HPC scalability must include multicore  Not well understood  Requires new analysis and measurement techniques  Optimizations differ from single-core Microfission is just one example  Multicore locality optimization for shared caches  Improves performance by 35%

26 Future Work Expect multicore observations apply to other HPC applications with low arithmetic intensity  Irregular parallel applications: Adaptive meshes, heterogeneous workloads  Irregular blocking applications: graph traversal Wider range of multicore (memory-focused) optimizations  Recomputation  Relocating Data  Temporary storage reduction  Structural changes

27 Thank You Any Questions?

28 BACKUP SLIDES…

29 Less DRAM Contention

30 Multicore Optimized, Low Density

31 Most important functions

32 L1 & L2 Miss Rates Less Relevant

33 TEST

34 HPC Applications Have Low Intensity

35 Loads Per Cycle vs Intrachip Scaling

36 TEST

37 TEST

38 Oscillations Effect L2 Miss Rate

39 Oscillations Effect L2 Miss Rate