Characterization of Parallel Scientific Simulations

Slides:

Advertisements

Similar presentations

Optimizing Expression Selection for Lookup Table Program Transformation Chris Wilcox, Michelle Mills Strout, James M. Bieman Computer Science Department.

Advertisements

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

P3- Represent how data flows around a computer system

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

CS 345 Computer System Overview

Chapter 4 M. Keshtgary Spring 91 Type of Workloads.

Introduction CS 524 – High-Performance Computing.

1 ASU MAT 591: Opportunities in Industry Performance Modeling Bo Faser Lockheed Martin Management & Data Systems Intelligence, Surveillance, and Reconnaissance.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Locality-Aware Request Distribution in Cluster-based Network Servers 1. Introduction and Motivation --- Why have this idea? 2. Strategies --- How to implement?

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Managing Agent Platforms with SNMP Brian Remick Research Proposal Defense June 27, 2015.

1 Software Testing and Quality Assurance Lecture 40 – Software Quality Assurance.

Performance Engineering and Debugging HPC Applications David Skinner

1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.

1 1 Profiling & Optimization David Geldreich (DREAM)

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Defining Anomalous Behavior for Phase Change Memory

Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.

Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa,

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.

Of Rostock University DuDE: A D istributed Computing System u sing a D ecentralized P2P E nvironment The 4th International Workshop on Architectures, Services.

CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

E X C E E D I N G E X P E C T A T I O N S VLIW-RISC CSIS Parallel Architectures and Algorithms Dr. Hoganson Kennesaw State University Instruction.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

A Measurement Based Memory Performance Evaluation of Streaming Media Servers Garba Isa Yau and Abdul Waheed Department of Computer Engineering King Fahd.

Outline Why this subject? What is High Performance Computing?

Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

Sunpyo Hong, Hyesoon Kim

ICC Module 3 Lesson 1 – Computer Architecture 1 / 11 © 2015 Ph. Janson Information, Computing & Communication Module 3 : Systems.

Software Engineering Prof. Dr. Bertrand Meyer March 2007 – June 2007 Chair of Software Engineering Lecture #20: Profiling NetBeans Profiler 6.0.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.

GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Taeho Kgil, Trevor Mudge Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, USA CASES’06.

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Defining the Competencies for Leadership- Class Computing Education and Training Steven I. Gordon and Judith D. Gardiner August 3, 2010.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Overview of Earth Simulator.

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency

Multi-Processing in High Performance Computer Architecture:

Scaling for the Future Katherine Yelick U.C. Berkeley, EECS

Types of Concurrent Events

CPSC 531: System Modeling and Simulation

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

CSC Classes Required for TCC CS Degree

Computer Evolution and Performance

Types of Concurrent Events

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Database System Architectures

What Are Performance Counters?

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Characterization of Parallel Scientific Simulations January 2017 Borja Miñano Maldonado Masther Project Thesis in Computer Engineering High Performance Computing

Presentation outline Introduction Objectives Scientific simulations Metric and tool overview Result from Characterization Modelization Conclusions

Introduction High Performance Computing (HPC) for Performance through Engineering Scientific research Business analytics Weather forecast Performance through Hardware improvement Programming skills

Objectives Analysis and characterization of parallel scientific simulations Bottleneck detection Improvement proposals

Scientific simulations Single Program, Multiple Data (SPMD) Different resource demands CPU Memory I/O requests 3 phases Initial setup Initialization – Memory allocation Evolution – High computing and I/O demand

Memory-bound - Advection Advection equation Fluid transport Simulation objective Numerical discretization test Simple algorithm + Huge size

CPU-bound - Euler Euler equations Simulation objective Movement of incompressible flow Simulation objective Numerical discretization test Complex algorithm + Regular size

I/O-bound – Cash and Goods Cash and goods Agent Based Model Complex system on trading Simulation objective Find unexpected behaviour Extremely simple algorithm + High output frequency

Summary and temporal statistics Metric overview CPU metrics Usage detail Arithmetic operations Function calls I/O metrics Read Write Memory metrics Read Write Page & cache faults Network metrics Sending Receiving Summary and temporal statistics

Profiler tool selection Sar (temporal prof.) Perf (summary prof.) CPU usage FLOPs Paging and cache Cache rates I/O usage Memory access Network usage Gprof (summary prof.) Valgrind Massif (temporal prof.) Function calls Elapsed time Memory consuption

Performance characterization Advection

Performance modeling - Advection

Conclusions - Advection Memory-bound -> I/O-bound Output size was too demanding Bottleneck: unique hard disk drive Suggestions: Higher rate HDD (hardware) Distributed architecture, multiple HDDs Less frequent or lighter output

Performance characterization Euler Event Count Events/ins Events/s cache-references 3.522.270.172 0,36 % 5,38 M cache-misses 1.237.187.524 0,13 % 1,89 M mem-stores 110.211.636.380 11,22 % 168,34 M L1-dcache-loads 308.795.201.413 31,43 % 471,65 M L1-dcache-load-misses 25.636.359.337 2,61 % 39,16 M FLOPS 409.563.438.375 41,69 % 625,56 M instructions 982.516.219.892 1.500,6 M

Performance modeling - Euler

Conclusions - Euler CPU-bound Good scalability in physical processors Ideal simulation Bottleneck: CPU Suggestions: Quicker CPUs More CPUs (not always recommended)

Performance characterization Cash and goods

Performance modeling Cash and goods - Memory

Performance modeling Cash and goods – I/O

Conclusions – Cash and goods I/O-bound -> Memory-bound Unbalanced behaviour for one process Bottleneck: Memory access Suggestions: Quicker caches/memory access (hardware) Distributed memory architecture Balance output algorithm

Project conclusions – Metric researh Metrics to cover all resources Multiple tools: Summary metrics Temporal metrics Consolidate conclusions Metric granularity for parallel processes

Project conclusions – Simulations Unexpected bottlenecks Modelization -> Scalability -> Speedup

Thank you for listening Questions? / Comments?