Workshop on Operating System Interference in High Performance Applications 20051 Performance Degradation in the Presence of Subnormal Floating-Point Values.

Slides:

Advertisements

Similar presentations

Part IV: Memory Management

Advertisements

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Reference: Message Passing Fundamentals.

An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.

1: Operating Systems Overview

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

A Load Balancing Framework for Adaptive and Asynchronous Applications Kevin Barker, Andrey Chernikov, Nikos Chrisochoides,Keshav Pingali ; IEEE TRANSACTIONS.

RISC and CISC. Dec. 2008/Dec. and RISC versus CISC The world of microprocessors and CPUs can be divided into two parts:

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

ParFUM Parallel Mesh Adaptivity Nilesh Choudhury, Terry Wilmarth Parallel Programming Lab Computer Science Department University of Illinois, Urbana Champaign.

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.

GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.

© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March

Parallelization Of The Spacetime Discontinuous Galerkin Method Using The Charm++ FEM Framework (ParFUM) Mark Hills, Hari Govind, Sayantan Chakravorty,

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Adaptive MPI Milind A. Bhandarkar

7 th Annual Workshop on Charm++ and its Applications ParTopS: Compact Topological Framework for Parallel Fragmentation Simulations Rodrigo Espinha 1 Waldemar.

MATH 685/CSI 700 Lecture Notes Lecture 1. Intro to Scientific Computing.

Computing Systems Basic arithmetic for computers.

Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.

ECE232: Hardware Organization and Design

Kinshuk Govil, Dan Teodosiu*, Yongqiang Huang, and Mendel Rosenblum

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

Floating Point. Agenda  History  Basic Terms  General representation of floating point  Constructing a simple floating point representation  Floating.

Advanced / Other Programming Models Sathish Vadhiyar.

Adaptive Mesh Modification in Parallel Framework Application of parFUM Sandhya Mangala (MIE) Prof. Philippe H. Geubelle (AE) University of Illinois, Urbana-Champaign.

Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Floating-Point Considerations.

A Wakeup Scheme for Sensor Networks: Achieving Balance between Energy Saving and End-to-end Delay Xue Yang, Nitin H.Vaidya Department of Electrical and.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.

Explicit and Implicit Pipelining in Wireless MAC Nitin Vaidya University of Illinois at Urbana-Champaign Joint work with Xue Yang, UIUC.

Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

OSes: 2. Structs 1 Operating Systems v Objective –to give a (selective) overview of computer system architectures Certificate Program in Software Development.

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.

Next Generation of Apache Hadoop MapReduce Owen

1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R.

Scalable Dynamic Adaptive Simulations with ParFUM Terry L. Wilmarth Center for Simulation of Advanced Rockets and Parallel Programming Laboratory University.

Operating System Overview

Parallel Objects: Virtualization & In-Process Components

Performance Evaluation of Adaptive MPI

CPSC 531: System Modeling and Simulation

MapReduce Simplied Data Processing on Large Clusters

Component Frameworks:

ECE 498AL Spring 2010 Lecture 11: Floating-Point Considerations

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

UNIVERSITY OF MASSACHUSETTS Dept

Case Studies with Projections

© David Kirk/NVIDIA and Wen-mei W. Hwu,

UNIVERSITY OF MASSACHUSETTS Dept

Parallel Implementation of Adaptive Spacetime Simulations A

Presentation transcript:

Workshop on Operating System Interference in High Performance Applications Performance Degradation in the Presence of Subnormal Floating-Point Values Orion Lawlor, Hari Govind, Isaac Dooley, Michael Breitenfeld, Laxmikant Kale Department of Computer Science University of Illinois at Urbana-Champaign

Workshop on Operating System Interference in High Performance Applications Overview Subnormal values make computations slow. Serial programs are affected. Parallel programs are affected. Parallel programs may exhibit amplified slowdowns due to load imbalances between processors. How do I detect and fix the problem in Serial? How do I fix the problem in parallel?

Workshop on Operating System Interference in High Performance Applications Denormalized or Subnormal Floating Point Values IEEE754 Standardspecifies a class of values with small magnitudes: to Subnormals contain the smallest possible exponent, and mantissa with a leading zero. A loss of precision may occur when subnormals are used in operations. Programs should be notified when these operations occur. The worst case implementation involves expensive software traps to the OS on each operation. It has been claimed that occurrences of subnormals is rare and thus of no concern.

Workshop on Operating System Interference in High Performance Applications Subnormals are Bad “Underflows occur infrequently, but when they do occur, they often come in convoys. Whatever causes one underflow will usually cause a lot more. So occasionally a program will encounter a large batch of underflows, which makes it slow.” W. Kahan IEEE 754R minutes from September 19, 2002

Workshop on Operating System Interference in High Performance Applications Serial Example* *Code available at // initialize array for (i = 0; i<SIZE; i++) a[i] = 0.0; a[0] = 1.0; // perform iterative averaging for (j=0; j< ITER; j++) for (i = 2; i<SIZE-1; i++) a[i] = (a[i]+a[i-1]+a[i-2])*(1.0/3.0); Jacobi(Stencil), Relaxation, and Gauss-Seidel type methods can give rise to many subnormals.

Workshop on Operating System Interference in High Performance Applications Serial Example* *Code available at // initialize array for (i = 0; i<SIZE; i++) a[i] = 10E-50; a[0] = 1.0; // perform iterative averaging for (j=0; j< ITER; j++) for (i = 2; i<SIZE-1; i++) a[i] = (a[i]+a[i-1]+a[i-2])*(1.0/3.0); Jacobi(Stencil), Relaxation, and Gauss-Seidel type methods can give rise to many subnormals.

Workshop on Operating System Interference in High Performance Applications Serial Test : Worst Case Slowdown

Workshop on Operating System Interference in High Performance Applications Serial Test : Worst Case Slowdown

Workshop on Operating System Interference in High Performance Applications Serial Test : Worst Case Slowdown

Workshop on Operating System Interference in High Performance Applications Serial Test : Worst Case Slowdown

Workshop on Operating System Interference in High Performance Applications Parallel Program where this phenomenon was observed. Simulating a 1D wave propagating through a finite 3D bar in parallel. The wavefront (left) and partitioning (right).

Workshop on Operating System Interference in High Performance Applications Parallel Program Timeline Overview: 32 Alpha processors(PSC Lemieux Cluster) with Linear Partitioning time Processor 0 Processor Busy Idle Busy Idle

Workshop on Operating System Interference in High Performance Applications Parallel Program: 32 Alpha processors with Linear Partitions and Processor Mapping Average CPU Utilization

Workshop on Operating System Interference in High Performance Applications Parallel Program: 32 Alpha processors with Normal METIS Partitioner Average CPU Utilization

Workshop on Operating System Interference in High Performance Applications Flushing Subnormals to Zero Some processors have modes that flush any subnormal to zero. No subnormal will ever be produced by any floating-point operation. Available on some processors. Can be enabled by:  Compiler flags  Explicit code

Workshop on Operating System Interference in High Performance Applications Parallel Program: 32 Alpha processors with Normal METIS Partitioner Flush Subnormals to Zero Average CPU Utilization

Workshop on Operating System Interference in High Performance Applications Summary of execution times on Alpha cluster Processor modeExecution time Metis Partitioning with Flush To Zero (Subnormals Disabled by default) 50.3s Metis Partitioning (Subnormals Enabled “-fpe1”) 392.4s Linear Partitioning (Subnormals Enabled “-fpe1”) 522.6s

Workshop on Operating System Interference in High Performance Applications Parallel Program: 32 Xeon processors (NCSA Tungsten Cluster) Average CPU Utilization

Workshop on Operating System Interference in High Performance Applications Parallel Program: 32 Xeon processors, Flush Subnormals to Zero Average CPU Utilization

Workshop on Operating System Interference in High Performance Applications The Anatomy of The Parallel Problem Each processor can only proceed one step ahead of its neighbors. If the program uses any global synchronization, it will be significantly worse than the asynchronous. If one processor is slow, it will interfere with the progress of the rest of the processors. Essentially a load imbalance occurs between the processors. Systems built with message driven execution, virtualization / migratable objects, or loadbalancing may fix the parallel interference.

Workshop on Operating System Interference in High Performance Applications ParFUM: A Parallel Framework for Unstructured Mesh Applications Makes parallelizing a serial code faster and easier  Handles mesh partitioning and communication  Load balancing (via Charm++)  Parallel Incremental Mesh Modification: Adaptivity Refinement  Visualization Support  Collision Detection Library  High Processor Utilization even for Dynamic Physics codes

Workshop on Operating System Interference in High Performance Applications Instrument Based Load Balancing on Xeon Cluster Average CPU Utilization

Workshop on Operating System Interference in High Performance Applications Parallel Program on Xeon Cluster Processor modeExecution time(speedup) Default56.8s Flush to Zero mode33.1s (42%) Load Balanced45.6s (20%)

Workshop on Operating System Interference in High Performance Applications How do I detect subnormals? Some compilers give an option which generates code that counts all occurrences of subnormals. E.g. DEC’s “-fpe1” option Many compilers have options to generate IEEE compliant code. Try using such a flag and compare the results to a version with a non-IEEE compliant version. Explicitly scan through your data and count number of denormals. This method may yield false positives. It may be possible to monitor OS traps to determine if any floating-point exception handling code is executing.

Workshop on Operating System Interference in High Performance Applications How do I elminate these slowdowns? Use a computer system where the operations are less impacted by subnormals. Scale/translate data to different range if possible. Use a compiler option or additional code to put the processor in a mode where subnormals are flushed to zero(FTZ). If your numerical code cannot tolerate FTZ, then in parallel use some type of load balancing to distribute the subnormals across all processors.

Workshop on Operating System Interference in High Performance Applications Conclusion Subnormals are increasingly slower on modern processors. Parallel applications will have greatly amplified slowdowns due to the interference caused by one processor’s slowdown. These factors may no longer be as negligible as previously thought. We believe there are probably many other applications with similar slowdowns.

Workshop on Operating System Interference in High Performance Applications Thank You Parallel Programming Lab at University of Illinois Additional Resources and Results: