GrADS Program Preparation System: Issues and Ideas Keith Cooper, Ken Kennedy, John Mellor-Crummey, Linda Torczon Center for High-Performance Software Rice.

Slides:



Advertisements
Similar presentations
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Program Slicing Mark Weiser and Precise Dynamic Slicing Algorithms Xiangyu Zhang, Rajiv Gupta & Youtao Zhang Presented by Harini Ramaprasad.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Virtual Memory Hardware Support
Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.
Code Shape III Booleans, Relationals, & Control flow Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
The Procedure Abstraction Part I: Basics Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412.
Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.
1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.
Register Allocation (via graph coloring)
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
Distributed Systems Management What is management? Strategic factors (planning, control) Tactical factors (how to do support the strategy practically).
Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Architectural Design.
The Procedure Abstraction Part I: Basics Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412.
CMSC 345 Fall 2000 Unit Testing. The testing process.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.
The Procedure Abstraction, Part V: Support for OOLs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Development Timelines Ken Kennedy Andrew Chien Keith Cooper Ian Foster John Mellor-Curmmey Dan Reed.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Enabling Self-management of Component-based High-performance Scientific Applications Hua (Maria) Liu and Manish Parashar The Applied Software Systems Laboratory.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Testing OO software. State Based Testing State machine: implementation-independent specification (model) of the dynamic behaviour of the system State:
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Boolean & Relational Values Control-flow Constructs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011,
1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )
Memory-Aware Compilation Philip Sweany 10/20/2011.
Profile Guided Code Positioning C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Software Metrics 1.
Session 3 Memory Management
Multiscalar Processors
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Many-core Software Development Platforms
Introduction to cosynthesis Rabi Mahapatra CSCE617
The University of Texas at Austin
CSCI1600: Embedded and Real Time Software
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
The Procedure Abstraction Part I: Basics
Code Shape III Booleans, Relationals, & Control flow
COMP755 Advanced Operating Systems
Introduction to Computer Systems Engineering
Procedure Linkages Standard procedure linkage Procedure has
CSCI1600: Embedded and Real Time Software
Presentation transcript:

GrADS Program Preparation System: Issues and Ideas Keith Cooper, Ken Kennedy, John Mellor-Crummey, Linda Torczon Center for High-Performance Software Rice University GrADS Site Visit, April 2001

2 Challenges Estimating resource requirements for program execution —express these requirements in a form usable by all the tools Resource selection —selecting a set of available resources that meet requirements Mapping data and computation to resources selected —coping with heterogenity ( ISA, system architecture, clock rate) —tailoring program to both available resources and problem instance Performance contracts —developing contracts that reflect expected behavior —detecting contract violations —diagnosing contract violations —responding to contract violations

GrADS Site Visit, April Challenges Estimating resource requirements for program execution —express these requirements in a suitable form Resource selection —selecting a set of available resources that meet requirements Mapping data and computation to resources selected —coping with heterogenity (ISA, system architecture, clock rate) —tailoring program for available resources Performance contracts —developing contracts that reflect expected behavior —detecting contract violations —diagnosing contract violations —responding to contract violations

GrADS Site Visit, April Estimating Resource Requirements What requirements? Data volume —RAM and disk Computation volume and characteristics Resource topology —characteristics –required vs. desired –relative importance of desired capabilities –directional requirements for communication bandwidth constraints latency constraints

GrADS Site Visit, April Estimating Resource Requirements Two-pronged approach Automated —help construct resource requirements from sample executions User-directed —override any misconceived notions that might be derived automatically Rationale: full automation is impossible for general programs —data-driven execution characteristics: e.g. adaptive mesh

GrADS Site Visit, April Characterizing Application Performance Use hardware performance counters to sample events —FLOPS, memory accesses, cache misses, integer operations,... Map events to loops: CFG from program binary + symbol table Measure multiple runs with different inputs Develop loop-nest performance model as function of inputs —polynomial function of measured characteristics –instruction balance –memory hierarchy: data reuse distance [stack histograms] –number of trips Compose program-level model from loop-nest models

GrADS Site Visit, April Characterizing Application Performance Synthesizing models Recover CFG from binary Interpret loop nesting structure –blue: outermost level –red: loop level 1 –green loop level 2 Aggregate performance statistics for loop nests

GrADS Site Visit, April Modeling Memory Hierarchy on a Node Reuse Distance L2 Size L2 Hits L1 Size L1 Hits *

GrADS Site Visit, April Modeling Memory Hierarchy on a Node Problem Size m Problem Size n *

GrADS Site Visit, April Predicting Application Performance Combine information to make predictions Parameterized program-level models (based on measurements) Program input parameters Hardware characteristics Milestones (3a-b) Developing model of program performance useful for resource selection establishing performance contracts This work builds on several ongoing projects, with funding from LACSI & Texas ATP

GrADS Site Visit, April Predicting Parallel Performance Parallel performance determined by critical path Critical path through a directed acyclic graph Actual critical path won’t be known until run-time —depends upon communication delays & node performance Possible Approaches Automatic synthesis of DAG models of parallel performance with program slicing —slice out computation —execute sliced program with input parameters to elaborate DAG The “Dongarra oracle”: omnipotent library writer Modeling assistant —power steering to combine automatic & oracular models

GrADS Site Visit, April Mapping Data and Computation Two Alternatives User-provided mapping Initial guess + dynamic adaptation based on observed behavior

GrADS Site Visit, April Dynamic Optimizer Two roles Creating final executable program —mapping COP to run-time resources —inserting probes, actuators, & trip wires —final optimization Responding to contract violations —recognizing local violations & raising an alarm —using accumulated knowledge to improve performance –limited set of things that can be done –high cost of relocation opens a window for re-optimization

GrADS Site Visit, April Dynamic Optimizer Correct data mis-alignments Monitored behavior points to bad alignments Relocate data to avoid cache conflicts Move infrequently executed code out of the way Relocate unused procedures to edge of address space Relocate “fluff” code inside procedure to edge of address space —increase spatial reuse in the I-cache by exiling unused code —keep hot paths in contiguous memory, using fall-through branches Respond to increases in code page fault rate by changing program layout

GrADS Site Visit, April Performance Contracts Role of the program preparation system Develop performance models for applications —combination of automatic techniques & user input Insert the sensors, probes, & actuators to monitor performance Give the performance monitor better information —some violations might not matter –off-critical path computation running too slowly —some violations might be unavoidable –node running flat out and still falling behind model –model may have mispredicted the code’s speed on that node

GrADS Site Visit, April Performance Contracts Many open questions about performance contracts To what do we compare actual performance? —user specifications? —dynamically-augmented models? How do we recognize under-performance? —local criteria? —global criteria? —measure differential progress –amount of idle time at communication points –cycles - FLOPS –measured time lost versus predicted progress –measured mismatch between resource availability and needs

GrADS Site Visit, April Performance Contracts One strategy: Time above expectations Lost Time Expected Time Contract violation occurs when ratio of measured time to expected time exceeds a threshhold Interval *

GrADS Site Visit, April Ongoing Work Refinement of GrADSoft architecture —interface between library writers, program preparation system, and execution environment ( Abstract Application Resource Topology ) —performance contracts and interface to execution environment Performance analysis and modeling —binary analysis for reconstructing CFG and loop nesting structure —CFG reconstruction when delay slots contain branches —developing architecture independent performance models Evaluating compilers for GrADS compilation —rejected GCC for code quality —rejected LCC for performance of compiled code —considering SGI Pro64 compiler with generic back end t Milestone (4b) Milestones (3a-b)

GrADS Site Visit, April Extra slides begin here

GrADS Site Visit, April Representing Program Requirements GrADSoft Application Manager Input parameterization Abstract Application Resource Topology (AART) Model —dimensionality —structure —constraints —external representation in XML Performance model Mapper Performance contract GrADSoft Architecture Holly Dail, Mark Mazina, Graziano Obertelli, Otto Sievert John Mellor-Crummey Milestone (4b) Interfaces that enable information sharing across Grid compilers, run-time systems and libraries *

GrADS Site Visit, April Resource Selection Execution environment performs resource selection Uses information provided by —program preparation system —library writer —user’s input parameters —grid conditions –network weather service forecasts –Globus/grid information systems: MDS, GRIS

GrADS Site Visit, April Performance Diagnosis Local problem —sensors and tests —measure progress against milestones –expected time vs. measured time —notify if outside envelope Global problem —appropriate assessment of progress via differential measures –wait-time vs. computation time —comparison against original large-scale model