Jan 30, 2003 GCAFE: 1 Compilation Targets Ian Buck, Francois Labonte February 04, 2003.

Slides:



Advertisements
Similar presentations
1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.
Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Computer Organization and Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
® IA-64 Architecture Innovations John Crawford Architect & Intel Fellow Intel Corporation Jerry Huck Manager & Lead Architect Hewlett Packard Co.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
The University of Adelaide, School of Computer Science
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Why GPUs? Robert Strzodka. 2Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Streaming Supercomputer Strawman Architecture November 27, 2001 Ben Serebrin.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Cisc Complex Instruction Set Computing By Christopher Wong 1.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Advanced Computer Architectures
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Chapter 18 Multicore Computers
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Cg Programming Mapping Computational Concepts to GPUs.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
IBM System/360 Matt Babaian Nathan Clark Paul DesRoches Jefferson Miner Tara Sodano.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.
EKT303/4 Superscalar vs Super-pipelined.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
An Overview of Parallel Processing
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Processor Level Parallelism 1
1 Lecture 5a: CPU architecture 101 boris.
The Present and Future of Parallelism on GPUs
CS 352H: Computer Systems Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Advanced Architectures
Multi-core processors
Graphics Processing Unit
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 22 Similarities & Differences between Vector Arch & GPUs Prof. Zhang Gang.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Lecture 2: Intro to the simd lifestyle and GPU internals
Array Processor.
Hardware Multithreading
Coe818 Advanced Computer Architecture
Multivector and SIMD Computers
EE 4xx: Computer Architecture and Performance Programming
The Vector-Thread Architecture
Introduction to CUDA.
Mattan Erez The University of Texas at Austin
CSC3050 – Computer Architecture
The University of Adelaide, School of Computer Science
CSE 502: Computer Architecture
Research: Past, Present and Future
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Jan 30, 2003 GCAFE: 1 Compilation Targets Ian Buck, Francois Labonte February 04, 2003

Jan 30, 2003 GCAFE: 2 GPU: Architectural Differences No SRF Pipelined MADD units Multiplexed Register File MUL ADD Register Fifo

Jan 30, 2003 GCAFE: 3 GPU: Architectural Differences No SRF Pipelined MADD units Multiplexed Register File MUL ADD Register Fifo MUL ADD Register Fifo MUL ADD Register Fifo MUL ADD Register Fifo

Jan 30, 2003 GCAFE: 4 GPU: Architectural Differences No SRF Multiplexed Register File Data Parallelism Arithmetic Intensity Gather inside kernels MUL ADD Register Fifo

Jan 30, 2003 GCAFE: 5 GPU: Programming Model

Jan 30, 2003 GCAFE: 6 GPU: Programming Model Positives –4-vector fp32 SIMD instruction set –Gathers allowed inside kernels –High level compilers (Cg & HLSL)

Jan 30, 2003 GCAFE: 7 GPU: Programming Model Negatives –No exposed SRF –Limited Scatter capabilities –No branching –No retained state between stream elements

Jan 30, 2003 GCAFE: 8 GPU: Compilation Target Compile Brook kernels to Cg Streams = Textures Roll Operators into gathers –Stencil, Group Compile stream graph into large kernels

Jan 30, 2003 GCAFE: 9 GPU Compilation Target Challenges –Reductions require lg(N) passes –Scatter requires host assist May be fixed soon –Limtied resources registers inputs / outputs instruction counts –Needs generalized RDS

Jan 30, 2003 GCAFE: 10 GPU Compilation Target Questions –How does a GPU fit into the SVM? Texture memory ~ SRF? –Do we allow gather operations inside of kernels? –Multinode issues? Not a shared memory machine.

Jan 30, 2003 GCAFE: 11 Smart Memories Original Smart Memories –4 CPUs in a quad could be configured as a 4 cluster machine working in SIMD –Control node was one processor node –Memory tiles could be configured as SRF banks, kernel instruction memory stream buffers. QI KMSRF EX KMSRF KM SRF EX

Jan 30, 2003 GCAFE: 12 Smart Memory Implementation Status Instead of creating the whole processor core, Smart Memories is looking at using a processor core from Tensilica Tensilica provides extensible (add instructions) synthesizable processor cores. The status of streaming is uncertain because Until this is resolved, it is not worthwhile discussing

Jan 30, 2003 GCAFE: 13 X86 Workstation cluster - Diff No SRF per se –Could try to exploit cache as SRF (similar to Sandia’s Sierra) Indexing in kernels is possible –Though degrades performance if outside the cache Conditionals: branches are possible, predication not (single cluster) SIMD instructions – SSE/MMX provide extra ILP Simultaneous Multithreading – Chance to overlap memory and kernel execution.

Jan 30, 2003 GCAFE: 14 Multinode issues Not shared memory environment –Do we need software address translation? –Would be simpler to implement on SGI Origin or Flash ScatterOps across multiple nodes need to go through the CPU of the concerned memory location

Jan 30, 2003 GCAFE: 15 Compilation Paths Brook -> Mattan/Jayanth compiler -> SVM -> pthreads Brook on multiple threads - Christos