Intel Redefines GPU: Larrabee Tianhao Tong Liang Wang Runjie Zhang Yuchen Zhou.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.

GPGPU Programming Dominik G ö ddeke. 2Overview Choices in GPGPU programming Illustrated CPU vs. GPU step by step example GPU kernels in detail.

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Is There a Real Difference between DSPs and GPUs?

SE-292 High Performance Computing

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

L1 Event Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting Dubna, October 16, 2008.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

The University of Adelaide, School of Computer Science

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

A many-core GPU architecture.. Price, performance, and evolution.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Chapter 17 Parallel Processing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Computer performance.

Computer System Architectures Computer System Software

Computer Graphics Graphics Hardware

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

1 Chapter 04 Authors: John Hennessy & David Patterson.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

History of Microprocessor MPIntroductionData BusAddress Bus

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

GPU Architecture and Programming

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University

Copyright © Curt Hill SIMD Single Instruction Multiple Data.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

1 Lecture 5a: CPU architecture 101 boris.

Computer Engg, IIT(BHU)

The Present and Future of Parallelism on GPUs

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Multi-core processors

Graphics Processing Unit

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

Vector Processing => Multimedia

Lecture 26: Multiprocessors

NVIDIA Fermi Architecture

Multivector and SIMD Computers

Lecture 27: Multiprocessors

Graphics Processing Unit

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

Presentation transcript:

Intel Redefines GPU: Larrabee Tianhao Tong Liang Wang Runjie Zhang Yuchen Zhou

Vision, Ambition, and Design Goals  Intel: Software is the New Hardware !  Intel: x86 ISA makes parallel program easier  Better flexibility and programmability  Support subroutine call and page faulting  Mostly software rendering pipeline, except texture filtering

Architecture Overview  Lots of x86 cores (8 to 64?)  Fully coherence L2 cache  Fully Programmable Rendering Pipeline  Shared L2, Divided L2  Cache Control Instructions  Ring Network  4-Way MT

Comparison with other GPU  Larrabee will use the x86 instruction set with Larrabee-specific extensions  Larrabee will feature cache coherency across all its cores.  Larrabee will include very little specialized graphics hardware, instead performing tasks like z-buffering, clipping, and blending in software, using a tile-based rendering approach.  more flexible than current GPUs

Comparison with other CPU  Based on the much simpler Pentium P54C design.  Each Larrabee core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time.  Larrabee includes one major fixed-function graphics hardware feature: texture sampling units.  Larrabee has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory.  Larrabee includes explicit cache control instructions.  Each core supports 4-way simultaneous multithreading, with 4 copies of each processor register.

New ISA  512 bit vector types (8×64bit double, 16×32bit float, or 16×32bit integer)  Lots of 3-operand instructions, like a=a*b+c  Most combinations of +,-,*,/ are provided, that is, you can choose between a*b-c, c-a*b and so forth.  Some instructions have built-in constants: 1-a*b  Many instructions take a predicate mask, which allows to selectively work on just a part of the 16-wide vector primitive types  32bit integer multiplications which return the lower 32bit of the result  Bit helper functions (scan for bit, etc.)  Explicit cache control functions (load a line, evict a line, etc.)  Horizontal reduce functions: Add all elements inside a vector, multiply all, get the minimum, and logical reduction (or, and, etc.).

Overview  Scalar Processing Unit  Vector Processing Unit  Separated Register File  Communication via memory

Scalar Processing Unit Derived from Pentium(1990s) 2-issue superscalar In-order Short, inexpensive pipeline execution But more than that 64-bit Enhanced multithreading – 4 threads per core Aggressive pre-fetching ISA extension, new cache features.

Vector Unit(VPU) 16-wide SIMD, each with 32bits wide data. Load/stores with gather/scatter support. Mask register enables flexible read and store within packed vector.

Cache Organization  L1  8K I-Cache and 8k D-Cache per thread  32K I/D cache per core  2-way  Treated as extended registers  L2  Coherent  Shared & Divided  A 256K subset for each core

On Chip Network: Ring Reason for doing this: Simplify the design; Cut costs; Get to the market faster

Multithreading at Multiple Levels  Threads (Hyper-Threading)  Hardware-managed  As heavy as an application program or OS  Up to 4 threads per core  Fibers  Software-managed  Chunks decomposed by compliers  Typically up to 8 fibers per thread

Multithreading at Multiple Levels  Strands  Lowers level  Individual operations in the SIMD engines  One strand corresponds to a thread on GPUs  16 Strands because the 16-lane VPU

“Braided parallelism”

Larrabee Programming Model  “FULLY” programmable  Legacy code easy to migrate and deploy  Run both DirectX/OpenGL and C/C++ code.  Many C/C++ source can be recompiled without modification, due to x86 structures.  Crucial to large x86 legacy programs.  Limitations  System call  Requires application recompilation

Software Threading  Architecture level threading:  P-threads  Extended P-threads to support developers to specify thread affinity.  Better task scheduling (task stealing by Bluemofe,1996 & Reinders 2007), lower costs for threads creation and switching  Also supports OpenMP

Communication  Larrabe native binaries tightly bond with host binaries.  Larrabe library handles all memory message/data passing.  System calls like I/O functions are proxied from the Larrabe app back to OS service.

High level programming  Larrabe Native is also designed to implement some higher level programming languages.  Ct, Intel Math Kernel Library or Physics APIs.

Irregular Data Structure Support  Complex pointer trees, spatial data structures or large sparse n-dimensional matrices.  Developer-friendly: Larrabe allows but does not require direct software management to load data into memory.

Memory Hierarchy Difference with Nvidia GPUs  Nvidia Geforce  Memory sharing is supported by PBSM(Per-Block Shared Memories), 16KB on Geforce 8.  Each PBSM shared by 8 scalar processors.  Programmers MUST explicitly load data into PBSM.  Not directly sharable by different SIMD groups.  Larrabe  All memory shared by all processors.  Local data structure sharing transparantly provided by coherent cached memory hierarchy.  Don’t have to care about data loading procedure.  Scatter-gather mechanism

Larrabe Prospective and Expectation?  Designed as a GPGPU rather than a GPU  Expected to achieve similar/lower performance with high-end Nvidia/ATi GPUs.  3D/Game performance may suffer.  Bad news:  Last Friday Intel announced the delayed of birth of Larrabee.

Words from Intel  "Larrabee silicon and software development are behind where we hoped to be at this point in the project," Intel spokesman Nick Knupffer said Friday. "As a result, our first Larrabee product will not be launched as a standalone discrete graphics product," he said.  "Rather, it will be used as a software development platform for internal and external use.

Reference  Tom R. Halfhill, “Intel’s Larrabee Redefines GPUs”, MPR 2008, 9/29/  Larry Seiler, etc. “Larrabee: A Many-Core x86 Architecture for Visual Computing”, ACM Transactions on Graphics, Vol 27., No.3, Article 18, 2008

Thank you! Questions?