System Software for Parallel Computing. Two System Software Components Hard to do the innovation Replacement for Tradition Optimizing Compilers Replacement.

Slides:



Advertisements
Similar presentations
Integration of MBSE and Virtual Engineering for Detailed Design
Advertisements

Progress Status of Subproject 6 VMC-PPO VMC-PPO Project Investigator.
Instructor Notes Lecture discusses parallel implementation of a simple embarrassingly parallel nbody algorithm We aim to provide some correspondence between.
An Overview Of Virtual Machine Architectures Ross Rosemark.
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Software Development.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.
Main issues: • Why is reuse so difficult • How to realize reuse
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Carnegie Mellon 1 Optimal Scheduling “in a lifetime” for the SPIRAL compiler Frédéric de Mesmay Theodoros Strigkos based on Y. Voronenko’s idea.
Compilation Techniques for Multimedia Processors Andreas Krall and Sylvain Lelait Technische Universitat Wien.
Instruction Level Parallelism (ILP) Colin Stevens.
Telescoping Languages: A Compiler Strategy for Implementation of High-Level Domain-Specific Programming Systems Ken Kennedy Rice University.
February 21, 2008 Center for Hybrid and Embedded Software Systems Mapping A Timed Functional Specification to a Precision.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
Anne Mascarin DSP Marketing The MathWorks
CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Generic Software Pipelining at the Assembly Level Markus Pister
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
C++ Programming. Table of Contents History What is C++? Development of C++ Standardized C++ What are the features of C++? What is Object Orientation?
Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)
Using Analog Devices’ Blackfin for Embedded Processing Diana Franklin and John Seng.
VM Algorithm Improvement Student’s Name: Kamlesh Patel Date: Oct 13, 2008 Advisor’s Name: Dr. Chung-E-Wang Prof. Dick Smith Department of Computer Science.
Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop
Automated Design of Custom Architecture Tulika Mitra
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview Part 2: History (continued)
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
2015/10/22\course\cpeg323-08F\Final-Review F.ppt1 Midterm Review Introduction to Computer Systems Engineering (CPEG 323)
GPU Architecture and Programming
1 5 Nov 2002 Risto Pohjonen, Juha-Pekka Tolvanen MetaCase Consulting AUTOMATED PRODUCTION OF FAMILY MEMBERS: LESSONS LEARNED.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.
MILAN: Technical Overview October 2, 2002 Akos Ledeczi MILAN Workshop Institute for Software Integrated.
Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.
University of Maryland Towards Automated Tuning of Parallel Programs Jeffrey K. Hollingsworth Department of Computer Science University.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Why it might be interesting to look at ARM Ben Couturier, Vijay Kartik Niko Neufeld, PH-LBC SFT Technical Group Meeting 08/10/2012.
Overview of SAIP and LSSA. Software Architecture in Practice Provides a set of techniques, not a prescriptive method for architectural design. Based on.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
CS533 Concepts of Operating Systems Jonathan Walpole.
CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.
Copyright 2014 – Noah Mendelsohn Performance Analysis Tools Noah Mendelsohn Tufts University Web:
R-Verify: Deep Checking of Embedded Code James Ezick † Donald Nguyen † Richard Lethin † Rick Pancoast* (†) Reservoir Labs (*) Lockheed Martin The Eleventh.
Background Computer System Architectures Computer System Software.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Compiler Research How I spent my last 22 summer vacations Philip Sweany.
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Computer System Structures
Chapter 18 Maintaining Information Systems
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Programming Languages
Implementation of IDEA on a Reconfigurable Computer
Performance Optimization for Embedded Software
CSE 471 Autumn 1998 Virtual memory
O.S. Security.
rePLay: A Hardware Framework for Dynamic Optimization
Introduction to Computer Systems Engineering
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
CS Introduction to Operating Systems
Presentation transcript:

System Software for Parallel Computing

Two System Software Components Hard to do the innovation Replacement for Tradition Optimizing Compilers Replacement for conventional large monolithic OS

Quick View of Optimizing Compiler

Autotuners vs Traditional Compilers Quality of Generated Code Which Optimizations to perform Choosing parameters for the optimizations Selecting from among alternative implementations Resulting Optimizing Space

Difficulty of Enhancing Modern Compilers Constraints of Modern Compilers Million lines of code New optimizations are difficult to add Large investment Functional Correctness is more imp than output code quality Hence peak performance may still require handcrafting of the program

Promise of Search Based Autotuners Search based technique used in several areas of code generation Generates many variants of a given kernel Benchmarks each variant by running on the target platform Time to complete on the target platform ( tries many or all optimization switches ) Often find non-intuitive loop unrolling or register blocking factors that lead to better performance

Recent Autotuners Earlier Auto -Tuners were used concentrate on non-intuitive loop unrolling Recent Auto-Tuners are applicable for general-purpose parallel programs Auto-Tuning Cycle Auto-Tuners as Libraries Auto-Tuners as Stand-Alone Application Integrating Auto-Tuners as part of Operating System Compiler Extensions for Auto-Tuning Note: Taken from More recent paper " Auto-Tuning Support for Manycore applications - Perspectives for Operating Systems and Compilers

References High-Performance Compilers for Parallel Computing by Michael WolfeMichael Wolfe Optimizing Compilers for Modern Architectures: A Dependence-based Approach by Randy AllenRandy Allen C.A. Schaefer, V.Pankratius and W.F.Ticy. Atune-IL: An instrumentation language for autotuning parallel applications. Technical Report, University of Karlsruhe, 2009