ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 1 Future mass apps reflect a concurrent world u Exciting applications.

Slides:



Advertisements
Similar presentations
Introduction to Direct3D 10 Course Porting Game Engines to Direct3D 10: Crysis / CryEngine2 Carsten Wenzel.
Advertisements

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
Context-Sensitive Interprocedural Points-to Analysis in the Presence of Function Pointers Presentation by Patrick Kaleem Justin.
Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 14: Basic Parallel Programming Concepts.
Introduction CS 524 – High-Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
DUSD(Labs) Breaking Down the Memory Wall for Future Scalable Computing Platforms Wen-mei Hwu Sanders-AMD Endowed Chair Professor with John W. Sias, Erik.
Introduction to Optimization Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
What Is Object-Oriented Design? (Chapter 1). Software Development Life Cycle 1. Problem statement and requirements 2. Solution specification 3. Code design.
Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:
5 th Biennial Ptolemy Miniconference Berkeley, CA, May 9, 2003 MESCAL Application Modeling and Mapping: Warpath Andrew Mihal and the MESCAL team UC Berkeley.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
L29:Lower Power Embedded Architecture Design 성균관대학교 조 준 동 교수,
Lecture 2: Basic Notions and Fundamentals
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
CS 363 Comparative Programming Languages
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12,
Recognizing Potential Parallelism Introduction to Parallel Programming Part 1.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
Image Processing Architecture, © 2001, 2002, 2003 Oleh TretiakPage 1 ECE-C490 Image Processing Architecture MP-3 Compression Course Review Oleh Tretiak.
1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.
Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.
1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:
JAVA AND MATRIX COMPUTATION
1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.
QCAdesigner – CUDA HPPS project
Toulouse, September 2003 Page 1 JOURNEE ALTARICA Airbus ESACS  ISAAC.
GPU Programming Shirley Moore CPS 5401 Fall 2013
FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.
Chapter 2: Software Maintenance Omar Meqdadi SE 3860 Lecture 2 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
Computing Systems: Next Call for Proposals Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
Parallel Computing Presented by Justin Reschke
EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.
TTCN-3 Testing and Test Control Notation Version 3.
Embedded Real-Time Systems
Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.
Parallel Patterns.
SOFTWARE DESIGN AND ARCHITECTURE
Vector Processing => Multimedia
TerraForm3D Plasma Works 3D Engine & USGS Terrain Modeler
Chapter 12 Pipelining and RISC
Loop-Level Parallelism
rePLay: A Hardware Framework for Dynamic Optimization
Pointer analysis John Rollinson & Kaiyuan Li
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 1 Future mass apps reflect a concurrent world u Exciting applications in future mass computing market represent and model physical world. u Traditionally considered “supercomputing apps” or super-apps. s Physiological simulation, Molecular dynamics simulation, Video and audio manipulation, Medical imaging, Consumer game and virtual reality products u Attempts to grow current architectures “out” or domain-specific architectures “in” lack success; a more broad approach to cover more domains is promising

ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 2 MPEG Encoding Parallelism u Independent IPPP sequences u Frames: independent 16x16 pel macroblocks u Localized dependence of P-frame macroblocks on previous frame u Steps of macroblock processing exhibit finer grained parallelism, each block spans function boundaries

ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 3 Alternative Forms of MPEG-4 Threading

ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 4 Building on HPF Compilation: what’s new? u Applicability to mass software base - requires pointer analysis, control flow analysis, data structure and object analysis, beyond traditional dependence analysis u Domain-specific, application model languages s More intuitive than C for inherently parallel problems t increased productivity, increased portability t Will still likely have C as implementation language s There is room for a new app language or a family of languages u Role for the compiler in model language environments s Model can provide structured semantics for the compiler, beyond what can be derived from analysis of low-level code s Compiler can magnify the usefulness of model information with its low-level analysis

ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 5 Pointer analysis: sensitivity, stability and safety Improved efficiency increases the scope over which unique, heap- allocated objects can be discovered Improved analysis algorithms provide more accurate call graphs (below) instead of a blurred view (above) for use by program transformation tools Fulcra in OpenIMPACT [SAS2004, PASTE2004] and others

ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 6 Thoughts from the VLIW/EPIC Experience u Any significant compiler work for a new computing platform takes years to mature s initial academic results from IMPACT s technology collaboration with Intel/HP s SPEC 2000, Itanium 1 and 2, open source apps s This was built on significant work from Multiflow, Cydrom, RISC, HPC teams u Real work in compiler development begins when hardware arrives s IMPACT output code performance improved by more than 20% since arrival of Itanium hardware – and much more stable s Most apps brought up with IMPACT after Itanium systems arrived: debugging! s Real performance effects can only be measured on hardware s Early access to hardware for academic compiler teams crucial and must a priority for industry development team. u Quantitative methodology driven by large apps is key s Innovations evaluated in whole system context

ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 7 How the next-generation compiler will do it (1) To-do list: o Identify acceleration opportunities o Localize memory o Stream data and overlap computation Heavyweight loops Acceleration opportunities: o Heavyweight loops identified for acceleration o However, they are isolated in separate functions called through pointers

ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 8 Large constant lookup tables identified How the next-generation compiler will do it (2) To-do list: Identify acceleration opportunities o Localize memory o Stream data and overlap computation Localize memory: o Pointer analysis identifies indirect callees o Pointer analysis identifies localizable memory objects o Private tables inside accelerator initialized once, saving traffic Initialization code identified

ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 9 How the next-generation compiler will do it (3) To-do list: Identify acceleration opportunities Localize memory o Stream data and overlap computation Streaming and computation overlap: o Memory dataflow summarizes array/pointer access patterns o Opportunities for streaming are automatically identified o Unnecessary memory operations replaced with streaming Summarize input access pattern Summarize output access pattern Constant table privatized

ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 10 How the next-generation compiler will do it (4) To-do list: Identify acceleration opportunities Localize memory Stream data and overlap computation Achieve macropipelining of parallelizable accelerators o Upsampling and color conversion can stream to each other o Optimizations can have substantial effect on both efficiency and performance

ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 11 Memory dataflow in the pointer world u Arrays are not true 3D arrays (unlike in Fortran) u Actual implementation: array of pointers to array of samples u New type of dataflow problem – understanding the semantics of memory structures instead of true arrays Array of constant pointers Row arrays never overlap