© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 6: GPU as part of the PC Architecture.

Slides:



Advertisements
Similar presentations
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Advertisements

The AMD Athlon ™ Processor: Future Directions Fred Weber Vice President, Engineering Computation Products Group.
BY TONY JIA Mother Board and Buses. What is a Mother Board? The motherboard is the largest piece of internal hardware. All of the other internal.
Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.
Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CPU Chips The logical pinout of a generic CPU. The arrows indicate input signals and output signals. The short diagonal lines indicate that multiple pins.
Complete CompTIA A+ Guide to PCs, 6e Chapter 2: On the Motherboard © 2014 Pearson IT Certification
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Lecture 10: GPU as part of the PC Architecture.
Input/Output Systems and Peripheral Devices (03-2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
* Definition of -RAM (random access memory) :- -RAM is the place in a computer where the operating system, application programs & data in current use.
1 COMPUTER ARCHITECTURE (for Erasmus students) Assoc.Prof. Stasys Maciulevičius Computer Dept.
Module 3 System Components. Case Form Factors Form Factor Characteristics ATX The most common form factor for full-sized computers. ATX boards measure.
INFO1119 (Fall 2012) INFO1119: Operating System and Hardware Module 2: Computer Components Hardware – Part 2 Hardware – Part 2.
LOGO. Types of System Boards  Nonintegrated System Board  Nonintegrated system boards can be easily identified because each expansion slot is usually.
PHY 201 (Blum) Buses Warning: some of the terminology is used inconsistently within the field.
Peripheral Buses COMP Jamie Curtis. PC Buses ISA is the first generation bus 8 bit on IBM XT 16 bit on 286 or above (16MB/s) Extended through.
Computer Organization CSC 405 Bus Structure. System Bus Functions and Features A bus is a common pathway across which data can travel within a computer.
COMP 1017: Digital Technologies Session 7: Motherboards.
Interconnection Structures
… when you will open a computer We hope you will not look like …
Peripheral Busses COMP Jamie Curtis. PC Busses ISA is the first generation bus 8 bit on IBM XT 16 bit on 286 or above (16MB/s) Extended through.
HyperTransport™ Technology I/O Link Presentation by Mike Jonas.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Computer Hardware Mr. Richard Orr Technology Teacher Bednarcik Jr. High School.
بسم الله الرحمن الرحيم QPI and PCI. INTRODUCTION  Short for Peripheral Component Interconnect, PCI was introduced by Intel in The PCI bus Came.
LOGO BUS SYSTEM Members: Bui Thi Diep Nguyen Thi Ngoc Mai Vu Thi Thuy Class: 1c06.
Organization of a computer: The motherboard and its components.
Computer Graphics Graphics Hardware
Buses Warning: some of the terminology is used inconsistently within the field.
Complete CompTIA A+ Guide to PCs, 6e Chapter 2: On the Motherboard © 2014 Pearson IT Certification
Chapter 2 The CPU and the Main Board  2.1 Components of the CPU 2.1 Components of the CPU 2.1 Components of the CPU  2.2Performance and Instruction Sets.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Agenda  Mother Board – P4M266  Types Of Mother Boards  Components - Processor - RAM - Cards - Ports and Slots - BIOS.
Computer Hardware The Processing Unit.
CS/ECE 217 GPU Architecture and Parallel Programming Lecture 16: GPU within a computing system.
Lecture 25 PC System Architecture PCIe Interconnect
1 PCI Express. 2FUJITSU CONFIDENTIAL What is PCI Express PCI-Express, formerly known as 3GIO(3 rd Generation I/O, is a 2-way, serial connection that carries.
GPU Programming Shirley Moore CPS 5401 Fall 2013
Chun-Yuan Lin Brief of GPU&CUDA. What is GPU? Graphics Processing Units 2015/12/16 2 GPU.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Motherboard A motherboard allows all the parts of your computer to receive power and communicate with one another.
CSIT 301 (Blum)1 Processor Specs. CSIT 301 (Blum)2 Pentium 4 Processor Specs.
1 Chapter 2 Central Processing Unit. 2 CPU The "brain" of the computer system is called the central processing unit. Everything that a computer does is.
Assembling & Disassembling of CPU. Mother Board Components.
Input/Output Organization III: Commercial Bus Standards CE 140 A1/A2 20 August 2003.
System Bus.
CSUDH Fall 2015 Instructor: Robert Spengler
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 7: GPU as part of the PC Architecture.
What is a Bus? A Bus is a communication system that transfers data between components inside a computer or between computers. Collection of wires Data.
Instructor: Syed Shuja Hussain Chapter 2: The System Unit.
Instructor: Chapter 2: The System Unit. Learning Objectives: Recognize how data is processed Understand processors Understand memory types and functions.
Instructor: Syed Shuja Hussain Chapter 2: The System Unit.
THE COMPUTER MOTHERBOARD AND ITS COMPONENTS Compiled By: Jishnu Pradeep.
Recent experience with PCI-X 2.0 and PCI-E network interfaces and emerging server systems Yang Xia Caltech US LHC Network Working Group October 23, 2006.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming.
MOTHER BOARD PARTS BY BOGDAN LANGONE BACK PANEL CONNECTORS AND PORTS Back Panels= The back panel is the portion of the motherboard that allows.
Computer Graphics Graphics Hardware
Motherboard And Its component
Bus Systems ISA PCI AGP.
Brief of GPU&CUDA Chun-Yuan Lin.
HyperTransport™ Technology I/O Link
Unit 2 Computer Systems HND in Computing and Systems Development
Introduction to Computing
PC Buses & Standards Bus = Pathway across which data can travel. Can be established between two or more computer elements. PC has a hierarchy of different.
Computer Graphics Graphics Hardware
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 6: GPU as part of the PC Architecture

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Objective To understand the major factors that dictate performance when using GPU as an compute accelerator for the CPU –The feeds and speeds of the traditional CPU world –The feeds and speeds when employing a GPU –To form a solid knowledge base for performance programming in modern GPU’s Knowing yesterday, today, and tomorrow –The PC world is becoming flatter –Outsourcing of computation is becoming easier…

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Stretching from Both Ends for the Meat New GPU ’ s cover massively parallel parts of applications better than CPU Attempts to grow current CPU architectures “ out ” or domain-specific architectures “ in ” lack success –Using a strong combination on apps a compelling idea –CUDA

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Bandwidth – Gravity of Modern Computer Systems The Bandwidth between key components ultimately dictates system performance –Especially true for massively parallel systems processing massive amount of data –Tricks like buffering, reordering, caching can temporarily defy the rules in some cases –Ultimately, the performance goes falls back to what the “speeds and feeds” dictate

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Classic PC architecture Northbridge connects 3 components that must be communicate at high speed –CPU, DRAM, video –Video also needs to have 1 st - class access to DRAM –Previous NVIDIA cards are connected to AGP, up to 2 GB/s transfers Southbridge serves as a concentrator for slower I/O devices CPU Core Logic Chipset

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign PCI Bus Specification Connected to the southBridge –Originally 33 MHz, 32-bit wide, 132 MB/second peak transfer rate –More recently 66 MHz, 64-bit, 512 MB/second peak –Upstream bandwidth remain slow for device (256MB/s peak) –Shared bus with arbitration Winner of arbitration becomes bus master and can connect to CPU or DRAM through the southbridge and northbridge

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign PCI as Memory Mapped I/O PCI device registers are mapped into the CPU’s physical address space –Accessed through loads/ stores (kernel mode) Addresses assigned to the PCI devices at boot time –All devices listen for their addresses

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign PCI Express (PCIe) Switched, point-to-point connection –Each card has a dedicated “link” to the central switch, no bus arbitration. –Packet switches messages form virtual channel –Prioritized packets for QoS E.g., real-time video streaming

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign PCIe Links and Lanes Each link consists of one more lanes –Each lane is 1-bit wide (4 wires, each 2-wire pair can transmit 2.5Gb/s in one direction) Upstream and downstream now simultaneous and symmetric –Each Link can combine 1, 2, 4, 8, 12, 16 lanes- x1, x2, etc. –Each byte data is 8b/10b encoded into 10 bits with equal number of 1’s and 0’s; net data rate 2 Gb/s per lane each way. –Thus, the net data rates are 250 MB/s (x1) 500 MB/s (x2), 1GB/s (x4), 2 GB/s (x8), 4 GB/s (x16), each way

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign PCIe PC Architecture PCIe forms the interconnect backbone –Northbridge/Southbridge are both PCIe switches –Some Southbridge designs have built-in PCI-PCIe bridge to allow old PCI cards –Some PCIe cards are PCI cards with a PCI-PCIe bridge Source: Jon Stokes, PCI Express: An Overview – s/paedia/hardware/pcie.ars

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Today’s Intel PC Architecture: Single Core System FSB connection between processor and Northbridge (82925X) –Memory Control Hub Northbridge handles “primary” PCIe to video/GPU and DRAM. –PCIe x16 bandwidth at 8 GB/s (4 GB each direction) Southbridge (ICH6RW) handles other peripherals

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Today’s Intel PC Architecture: Dual Core System Bensley platform –Blackford Memory Control Hub (MCH) is now a PCIe switch that integrates (NB/SB). –FBD (Fully Buffered DIMMs) allow simultaneous R/W transfers at 10.5 GB/s per DIMM –PCIe links form backbone –PCIe device upstream bandwidth now equal to down stream –Workstation version has x16 GPU link via the Greencreek MCH Two CPU sockets –Dual Independent Bus to CPUs, each is basically a FSB –CPU feeds at 8.5–10.5 GB/s per socket –Compared to current Front-Side Bus CPU feeds 6.4GB/s PCIe bridges to legacy I/O devices Source:

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Today’s AMD PC Architecture AMD HyperTransport™ Technology bus replaces the Front-side Bus architecture HyperTransport ™ similarities to PCIe: –Packet based, switching network –Dedicated links for both directions Shown in 4 socket configuraton, 8 GB/sec per link Northbridge/HyperTransport ™ is on die Glueless logic –to DDR, DDR2 memory –PCI-X/PCIe bridges (usually implemented in Southbridge)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Today’s AMD PC Architecture “Torrenza” technology –Allows licensing of coherent HyperTransport™ to 3 rd party manufacturers to make socket- compatible accelerators/co- processors –Allows 3 rd party PPUs (Physics Processing Unit), GPUs, and co- processors to access main system memory directly and coherently –Could make accelerator programming model easier to use than say, the Cell processor, where each SPE cannot directly access main memory.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign HyperTransport™ Feeds and Speeds Primarily a low latency direct chip-to-chip interconnect, supports mapping to board-to-board interconnect such as PCIe HyperTransport ™ 1.0 Specification –800 MHz max, 12.8 GB/s aggregate bandwidth (6.4 GB/s each way) HyperTransport ™ 2.0 Specification –Added PCIe mapping – GHz Clock, 22.4 GB/s aggregate bandwidth (11.2 GB/s each way) HyperTransport ™ 3.0 Specification – GHz Clock, 41.6 GB/s aggregate bandwidth (20.8 GB/s each way) –Added AC coupling to extend HyperTransport ™ to long distance to system-to-system interconnect Courtesy HyperTransport ™ Consortium Source: “White Paper: AMD HyperTransport Technology-Based System Architecture

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign A Recent AMD CPU AMD Opteron 250 at 2.4GHz 4.8 GFLOPS double- precision with SSE2 L1 cache –64KB I and 64KB D –38.4 GB/s (2 * 8 * 2.4 GHz) L1 D cache feed to data path Not clear what the L2 to L1 feed is, probably around 38.4 GB/s 19.2 GB/sec (8 * 2.4 GHz) DRAM (DDR1) to L2 feed

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign A Recent Intel CPU Intel Pentium 4 at 1.50 Ghz 400 Mhz FSB with 3.2 GB/s bandwidth 2MB L2 cache with >?48 GB/s to L1 with SSE2 8 KB L1 d-cache with >?36 GB/s to datapath 3 GFLOPS double- precision with SSE2

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign GeForce 7800 GTX Board Details 256MB/256-bit DDR3 600 MHz 8 pieces of 8Mx32 16x PCI-Express SLI Connector DVI x 2 sVideo TV Out Single slot cooling