Rethinking Computer Architecture Wen-mei Hwu University of Illinois, Urbana-Champaign Celebrating September 19, 2014.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

I/O Management and Disk Scheduling
Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Computer Abstractions and Technology
Microprocessor Performance, Phase II Yale Patt The University of Texas at Austin STAMATIS Symposium TU-Delft September 28, 2007.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.
Chapter 4 Conventional Computer Hardware Architecture
Avishai Wool lecture Introduction to Systems Programming Lecture 8 Input-Output.
OGO 2.1 SGI Origin 2000 Robert van Liere CWI, Amsterdam TU/e, Eindhoven 11 September 2001.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Reference: Message Passing Fundamentals.
MPI and RDMA Yufei 10/15/2010. MPI over uDAPL: abstract MPI: most popular parallel computing standard MPI needs the ability to deliver high performace.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Introduction to Systems Architecture Kieran Mathieson.
EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
University of Kansas Research Interests David Andrews Rm. 324 Nichols
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer performance.
Chapter 1 CSF 2009 Computer Abstractions and Technology.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
Computer Architecture Challenges Shriniwas Gadage.
ECE 720T5 Fall 2012 Cyber-Physical Systems Rodolfo Pellizzoni.
Cis303a_chapt06_exam.ppt CIS303A: System Architecture Exam - Chapter 6 Name: __________________ Date: _______________ 1. What connects the CPU with other.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Invitation to Computer Science 5th Edition
EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
Parallelization: Area Under a Curve. AUC: An important task in science Neuroscience – Endocrine levels in the body over time Economics – Discounting:
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
If Exascale by 2018, Really? Yes, if we want it, and here is how Laxmikant Kale.
Spring 2007Lecture 16 Heterogeneous Systems (Thanks to Wen-Mei Hwu for many of the figures)
Computer Organization and Design Computer Abstractions and Technology
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
EECB 473 Data Network Architecture and Electronics Lecture 1 Conventional Computer Hardware Architecture
Chapter 1 — Computer Abstractions and Technology — 1 Below Your Program Application software – Written in high-level language System software – Compiler:
Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO CS 219 Computer Organization.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
Computer Architecture Organization and Architecture
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Xen GPU Rider. Outline Target & Vision GPU & Xen CUDA on Xen GPU Hardware Acceleration On VM - VMGL.
NFV Compute Acceleration APIs and Evaluation
Computer Architecture & Operations I
Instant replay The semester was split into roughly four parts.
Architecture & Organization 1
The University of Texas at Austin
Single Thread Parallelism
Architecture & Organization 1
Single Thread Parallelism
Hybrid Programming with OpenMP and MPI
Lecture 3: Evolution of the Uniprocessor
Introduction to Heterogeneous Parallel Computing
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

Rethinking Computer Architecture Wen-mei Hwu University of Illinois, Urbana-Champaign Celebrating September 19, 2014

What Yale and I debate about in Samos and other places.

Algorithm Program ISA (Instruction Set Arch)‏ Microarchitecture Circuits Problem Electrons Patt and Patel, Introduction to Computer Systems: from Bits and Gates to C and Beyond 3 Application developers should not deal with variations in HW

The HPS Vision One static program (algorithm) Many execution resource configurations –Types of Function Units –Number of Function Units –I-Fetch bandwidth –Memory Latencies Key enablers –Branch prediction –Resource mapping –Restricted data flow execution –Sequential retirement 4 Patt, Hwu, Shebanow, “HPS, A New Microarchitecture: Rationale and Initial Results

Some Lessons Learned Parallelism and communication costs motivate algorithm changes –Locality vs. parallelism tradeoffs in libraries Performance and efficiency pressure breaks abstraction –Java is great for abstraction portability but insufficient for performance and efficiency –MPI, OpenMP apps often explicitly handle hardware-centric details 5

Productivity and Performance Library functions factor out data decomposition, parallelism, and communication ys = [sum(x * cos(r*k) for (x, k) in zip(xs, ks)) for r in par(rs)] Triolet C with MPI+OpenMP Setup Cleanup 128-way Speedup (16 cores × 8 nodes) Triolet C with MPI+OpenMP

Trends in System Design – 2014 CPUs/GPUs/Accelerators or entire nodes are the new function units Compute functions are the new instructions Distributed execution of functions to avoid data movement – Accelerators in/near Network I/O, Disk I/O, DRAM – Some come with own DRAM/SRAM for bandwidth 7

Example - Desirable Data Transfer and Compute Behavior Runtime/OS should map buffers and compute functions – I/O buffer to any major DRAM/SRAM Compute functions (decompression) to any CPU/GPU/accelerators 8

Example -Today’s Data Transfer and Compute Behavior Main Memory (DRAM) GPU card (or other Accelerator cards) CPU DMA Device Memory Network I/O Decompression Accelerator Disk I/O Each additional copy diminishes application-perceived bandwidth 9

A Call to Action Redefine system architecture – HAS/CUDA 6.0 a step in the right direction Redefine ISA binary standard – SPIR/HSAIL/PTX with finalizers a step in the right direction Redesign OS/Runtime for data and compute mapping – UNIX/Linux overdue for redesign Provide performance portable domain libraries to sustain abstraction – High-level mechanisms such as Triolet and Tangram to fuse and tune library code into apps 10

Congratulations, Yale! 11