Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

The University of Adelaide, School of Computer Science

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

Chapter Hardwired vs Microprogrammed Control Multithreading

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Chapter 7 Multicores, Multiprocessors, and Clusters.

Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.

Panda: MapReduce Framework on GPU’s and CPU’s

Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

1 Chapter 04 Authors: John Hennessy & David Patterson.

1 Parallelism, Multicores, Multiprocessors, and Clusters [Adapted from Computer Organization and Design, Fourth Edition, Patterson & Hennessy, © 2009]

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Morgan Kaufmann Publishers

GPU Architecture and Programming

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.

Computer performance issues* Pipelines, Parallelism. Process and Threads.

EKT303/4 Superscalar vs Super-pipelined.

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Lecture 3: Computer Architectures

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Advanced Science and Technology Letters Vol.43 (Multimedia 2013), pp Superscalar GP-GPU design of SIMT.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

GPU Architecture and Its Application

Parallel Computing Lecture

Morgan Kaufmann Publishers

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

Lecture 2: Intro to the simd lifestyle and GPU internals

Multi-Processing in High Performance Computer Architecture:

Mattan Erez The University of Texas at Austin

NVIDIA Fermi Architecture

Chapter 17 Parallel Processing

General Purpose Graphics Processing Units (GPGPUs)

Chapter 4 Multiprocessors

Graphics Processing Unit

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

CIS 6930: Chip Multiprocessor: GPU Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU

Faster: Final Chapter Strategies for faster processors Instruction–level parallelism – general applicability but limited gain SIMD – Single Instruction Multiple Data MIMD – Multiple Instruction Multiple Data 11/30/15Computer Architecture lecture 242

SIMD Multimedia extensions for x86 architecture – 4 or 8-way parallel arithmetic Vector arithmetic – multiple fast pipelines – specialized processors GPUs – Graphics Processing Units – co-processor for CPU – hundreds / thousands of arithmetic units 11/30/15Computer Architecture lecture 243

GPU High-quality rendering is very compute intensive – image realized by 1M triangles – computing pixels may require 4x10 9 cycles – led to specialized graphics ‘cards’ for rendering hardwired sequence of stages Transition in early 2000’s to more general design – large arrays of processors – fostered experiments for wider use of GPU 11/30/15Computer Architecture lecture 244

GPU Several levels of parallelism: – basic unit is a streaming processor (SP) also called CUDA core scalar integer and floating point arithmetic large register file – 128 streaming processors form a streaming multiprocessor (SM) act as SIMD through hardware multithreading (SIMT = single instruction multiple thread) – 16 SMs form a GPU act as MIMD processor composed of SIMD processors 11/30/15Computer Architecture lecture 245

GPU structure 11/30/15Computer Architecture lecture 246

CUDA NVIDIA developed software to execute GPU programs from C (CUDA = Compute Unified Device) 11/30/15Computer Architecture lecture 247

GPU Aimed at high throughput, latency tolerant tasks – multithreading hides latency of main memory, reduces need for large, multilevel cache Very high throughput for suitable tasks – multi-teraflops possible 11/30/15Computer Architecture lecture 248

MIMD Provided through (any combination of) – multithreading – multicore chips – clusters Processes communicate through – message passing – shared memory UMA (uniform memory architecture) NUMA (non-uniform memory architecture) 11/30/15Computer Architecture lecture 249