Introduction Introduction Håkon Kvale Stensland August 26 th, 2011 INF5063: Programming heterogeneous multi-core processors.

Slides:



Advertisements
Similar presentations
Computer Graphics Prof. Muhammad Saeed. 2 Hardware ( Graphic Cards ) II Hardware II Computer Graphics 1 August 2012.
Advertisements

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Home Exam 2: Video Encoding on GPUs using nVIDIA CUDA with Managed Memory Home Exam 2: Video Encoding on GPUs using nVIDIA CUDA with Managed Memory September.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Structure of Computer Systems
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Introduction Introduction Håkon Kvale Stensland August 19 th, 2012 INF5063: Programming heterogeneous multi-core processors.
Introduction Introduction Håkon Kvale Stensland August 28 th, 2012 INF5063: Programming heterogeneous multi-core processors.
Introduction Introduction Håkon Kvale Stensland August 22 th, 2014 INF5063: Programming heterogeneous multi-core processors.
Home Exam 1: Video Encoding on Intel x86 using Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX) Home Exam 1: Video Encoding on Intel.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
IXP: Bump in the Wire IXP: Bump in the Wire INF5063: Programming Asymmetric Multi-Core Processors 15 April 2015.
Introduction Introduction 28/ INF5063: Programming heterogeneous multi-core processors.
INF5062 – Competition Håkon Kvale Stensland & Håvard Espeland Simula Research Laboratory.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Introduction Introduction 1/ INF5062: Programming asymmetric multi-core processors.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Introduction to the Cell multiprocessor J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy (IBM Systems and Technology Group)
Information and Communication Technology Fundamentals Credits Hours: 2+1 Instructor: Ayesha Bint Saleem.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Computer Graphics Graphics Hardware
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
1 Latest Generations of Multi Core Processors
Computer Organization & Assembly Language © by DR. M. Amer.
Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Sony PlayStation 3 Sony also laid out the technical specs of the device. The PlayStation 3 will feature the much-vaunted Cell processor, which will run.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Lecture # 10 Processors Microcomputer Processors.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
Hardware Architecture
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Introduction Introduction September 26, 2016 INF5063: Programming heterogeneous multi-core processors.
Computer Graphics Graphics Hardware
Graphics Processor Graphics Processing Unit
Microarchitecture.
Distributed Processors
Cell Architecture.
Chapter 1 Introduction.
Computer Graphics Graphics Hardware
Multicore and GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

Introduction Introduction Håkon Kvale Stensland August 26 th, 2011 INF5063: Programming heterogeneous multi-core processors

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Overview  Course topic and scope  Background for the use and parallel processing using heterogeneous multi-core processors  Examples of heterogeneous architectures

INF5063: The Course

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo People  Håvard Espeland ifi  Håkon Kvale Stensland ifi  Carsten Griwodz ifi  Pål Halvorsen ifi

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Time and place  Lectures: Fridays – Perl (Room 2453)  The web page states that we will have group exercises: Thursdays , Java (Room 2423) However, there will NOT be any weekly exercises, see the detailed plan for each week!

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo About INF5063: Topic & Scope  Content: The course gives … −… an overview of heterogeneous multi-core processors in general and three variants in particular and a modern general-purpose core (architectures and use) −… an introduction to working with heterogeneous multi-core processors Intel IXP 2400 network processor card SSE x for x86 The Cell Broadband Engine Architecture nVIDIA’s family of GPUs and the CUDA/OpenCL programming framework −… some ideas of how to use/program heterogeneous multi-core processors

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo About INF5063: Topic & Scope  Tasks: The important part of the course is lab-assignments where you program each of the three examples of heterogeneous multi-core processors  1 exercise (not graded) on Intel IXP −packet inspection – download, run and extend wwpingbump  3 graded home exams (counting 33% each): −Deliver code −Make a demonstration and explain your design and code to the class 1.On the x86 Video encoding – Improve performance of video compression by using SSE instructions. 2.On the Cell processor Video encoding – Improve the performance of video compression using the Cell processor’s SBEs 3.On the nVIDIA graphics cards Video encoding – The same as above, but exploit the parallelity by using the GPU architecture

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Available Resources  Resources will be placed at − −Login:inf5063 −Password: ixp −Manuals, papers, code example, …

Background and Motivation: Moore’s Law

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Motivation: Intel View  >billion transistors integrated 2011: 2,3 billion - Intel 8-Core Xeon Nehalem-EX 3,0 billion - nVidia GF100 (Fermi) 1971: 2,300 - Intel 4004

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Motivation: Intel View  >billion transistors integrated  Clock frequency can still increase 2011 (Still): 5 GHz – IBM Power6

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Motivation: Intel View  >billion transistors integrated  Clock frequency can still increase  Future applications will demand TIPS 2011: 159, GHz – Intel Core i7 Extreme Edition 990x (6 cores)

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Motivation: Intel View  >billion transistors integrated  Clock frequency can still increase  Future applications will demand TIPS  Power? Heat?

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Motivation: Intel View  Soon >billion transistors integrated  Clock frequency can still increase  Future applications will demand TIPS  Power? Heat?

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Motivation “Future applications will demand TIPS” “Think platform beyond a single processor” “Exploit concurrency at multiple levels” “Power will be the limiter due to complexity and leakage” Distribute workload on multiple cores

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Symmetric Multi-Core Processors Intel Sandy Bridge

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Symmetric Multi-Core Processors AMD “Bulldozer”

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Intel Multi-Core Processors  Symmetric multi-processors allow multi-threaded applications to achieve higher performance at less die area and power consumption than single-core processors

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Symmetric Multi-Core Processors  Good −Growing computational power  Problematic −Growing die sizes −Unused resources Some cores used much more than others Many core parts frequently unused  Why not spread the load better? −Functions exist only once per core −Parallel programming is hard  Asymmetric multi-core processors

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Asymmetric Multi-Core Processors  Asymmetric multi-processors consume power and provide increased computational power only on demand  This is now done in most modern CPUs! Highly parallel Moderately parallel Sequential

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Motivation “Future applications will demand TIPS” “Think platform beyond a single processor” “Exploit concurrency at multiple levels” “Power will be the limiter due to complexity and leakage” Distributed workload on multiple cores + simple processors are “easier” to program +consume less energy  heterogeneous multi-core processors

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo What now – are today’s cores really “Symmetric”? Nehalem

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Co-Processors  The original IBM PC included a socket for an Intel 8087 floating point co-processor (FPU) −50-fold speed up of floating point operations  Intel kept the co-processor up to i486 −486DX contained an optimized i487 block −Still separate pipeline (pipeline flush when starting and ending use) −Communication over an internal bus  Commodore Amiga was one of the earlier machines that used multiple processors −Motorola 680x0 main processor −Blitter (block image transferrer - moving data, fill operations, line drawing, performing boolean operations) −Copper (Co-Processor - change address for video RAM on the fly)

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo IXA: Internet Exchange Architecture  IXP2400 basic features: −1 embedded 600 MHz Intel XScale (ARMv5) −8 packet 600 MHz µengines −onboard memory −3 x 1 Gbps Ethernet ports −multiple, independent busses −low-speed serial interface −interfaces for external memory and I/O busses −…

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo IXP2400: IXP2400 Architecture SRAM coprocessor FLASH DRAM SRAM access SDRAM access SCRATCH memory PCI access MSF access Embedded ARM CPU (XScale) PCI bus receive bus DRAM bus SRAM bus microengine 2 microengine 1 microengine 5 microengine 4 microengine 3 multiple independent internal buses slowport access … transmit bus microengine 8

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo nVIDIA Tegra 2 ARM SoC  One of many multi-core processors for handheld devices  Two ARM Cortex-A9 processors −Out-of-order design −32-bit ARMv7 instruction set −Up to 4 cache-coherent cores − GHz  Several “dedicated” processors: −HD Video Decoder −HD Video Encoder −Audio Processor −Image Processor −Graphics Processor  Next generation will bring more cores and programmable GPUs

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Intel SCC (Single Chip Cloud Computer)  Prototype processor from Intel’s TerraScale project  24 «tiles» connected in a mesh  1.3 billion transistors  Intel SCC Tile: −Two Intel P54C cores Pentium −In-order-execution design −IA32 architecture −NO SIMD units(!!)  No cache coherency  Only made a limited number of chips for research.

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Intel «Larabee» / «Knights Ferry»  Launched in 2009 as a consumer GPU from Intel  Canceled in May 2010 due to “disappointing performance”  Will launch in 2012 as a dedicated “accelerator” card  32 cores connected with a ring bus  Intel “Larabee” core: −Based on the Intel P54C Pentium core −In-order-execution design −Multi-threaded (4-way) −64-bit architecture −512-bit SIMD unit −256 kB L2 Cache  Cache coherency  Software rendering Designing GPUs this big is "fucking hard“! Jonah Alben, NVIDIA's VP of GPU Engineering

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Graphics Processing Units (GPUs) First GPUs, 80s: for early 2D operations Amiga and Atari used a blitter, Amiga had also the copper 90s: 3D hardware for game consoles like PS and N64 3dfx Voodoos 3D add-on card for PCs New powerful GPUs, e.g.,: Nvidia GeForce GTX MHz core 3 GB memory memory BW: 192,4 GB/sec PCI Express 2.0 similar to other manufacturers … GPU: a dedicated graphics rendering device

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo General Purpose Computing on GPU  The −high arithmetic precision −extreme parallel nature −optimized, special-purpose instructions −available resources −… … of the GPU allows for general, non-graphics related operations to be performed on the GPU  Generic computing workload is off-loaded from CPU and to GPU  More generically: Heterogeneous multi-core processing

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo nVIDIA Fermi Architecture - GF110 −3,0 billion transistors −512 cores −384 bit memory bus (GDDR5) −192,4 GB/sec memory bandwidth −1536 Gflops single precision −PCI Express 2.0

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo nVIDIA GF100 / GF110  Stream Multiprocessors (SMs) −Fundamental thread block unit −32 stream processors (SPs) (scalar ALU for threads) −4 super function units (SFUs) (cos, sin, log,...) −128 kB local register files (RFs) −16 / 48 kB level 1 cache −16 / 48 kB shared memory −768 kB global level 2 cache  Number of stream multiprocessors −1 - Quadro NVS 130M −16 - GeForce 8800 GTX −30 - GeForce GTX 285 −16 – GeForce GTX 580

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo STI (Sony, Toshiba, IBM) Cell  Motivation for the Cell −Cheap processor −Energy efficient −For games and media processing −Short time-to-market  Conclusion −Use a multi-core chip −Design around an existing, power- efficient design −Add simple cores specific for game and media processing requirements

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo STI (Sony, Toshiba, IBM) Cell  Cell is a 9-core processor −combining a light-weight general- purpose processor with multiple co-processors into a coordinated whole −Power Processing Element (PPE) conventional Power processor not supposed to perform all operations itself, acting like a controller running conventional OSes 16 KB instruction/data level 1 cache 512 KB level 2 cache

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo STI (Sony, Toshiba, IBM) Cell −Synergistic Processing Elements (SPE) specialized co-processors for specific types of code, i.e., very high performance vector processors local stores can do general purpose operations the PPE can start, stop, interrupt and schedule processes running on an SPE −Element Interconnect Bus (EIB) internal communication bus connects on-chip system elements:  PPE & SPEs  the memory controller (MIC)  two off-chip I/O interfaces 25.6 GBps each way

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo STI (Sony, Toshiba, IBM) Cell −memory controller Rambus XDRAM interface to Rambus XDR memory dual channels at 12.8 GBps  25.6 GBps −I/O controller Rambus FlexIO interface which can be clocked independently dual configurable channels maximum ~ 76.8 GBps

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo STI (Sony, Toshiba, IBM) Cell −Cell has in essence traded running everything at moderate speed for the ability to run certain types of code at high speed −used for example in Sony PlayStation 3:  3.2 GHz clock  6 SPEs for general operations  1 SPE for security for the OS Toshiba home cinema:  decoding of 48 HDTV MPEG streams  dozens of thumbnail videos simultaneously on screen IBM blade centers:  3.2 GHz clock  Linux ≥

INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo The End: Summary  Heterogeneous multi-core processors are already everywhere  Challenge: programming −Need to know the capabilities of the system −Different abilities in different cores −Memory bandwidth −Memory sharing efficiency −Need new methods to program the different components  Next up: How to start using the Intel IXP for assignment 1