AMD’s ATI GPU Radeon R700 (HD 4xxx) series Elizabeth Soechting David Chang Jessica Vasconcellos 1 CS 433 Advanced Computer Architecture May 7, 2008.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Computer Organization and Architecture
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
SECTION 4a Transforming Data into Information.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
High Dynamic Range Emeka Ezekwe M11 Christopher Thayer M12 Shabnam Aggarwal M13 Charles Fan M14 Manager: Matthew Russo 6/26/
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Evolutions of GPU Architectures Andrew Coile CMPE220 3/2007.
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
Basic Computer Organization CH-4 Richard Gomez 6/14/01 Computer Science Quote: John Von Neumann If people do not believe that mathematics is simple, it.
PlayStation 2 Architecture Irin Jose Farid Momin Quy Ngo Olivia Wong.
5.1 Chaper 4 Central Processing Unit Foundations of Computer Science  Cengage Learning.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
* Definition of -RAM (random access memory) :- -RAM is the place in a computer where the operating system, application programs & data in current use.
Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.
Interconnection Structures
1 Copyright © 2011, Elsevier Inc. All rights Reserved. Appendix E Authors: John Hennessy & David Patterson.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Computer Processing of Data
Practical PC, 7th Edition Chapter 17: Looking Under the Hood
Computer Graphics Graphics Hardware
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Chapter 2 The CPU and the Main Board  2.1 Components of the CPU 2.1 Components of the CPU 2.1 Components of the CPU  2.2Performance and Instruction Sets.
History of Microprocessor MPIntroductionData BusAddress Bus
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Hardware. Make sure you have paper and pen to hand as you will need to take notes and write down answers and thoughts that you can refer to later on.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Computer Architecture Lecture 2 System Buses. Program Concept Hardwired systems are inflexible General purpose hardware can do different tasks, given.
EEE440 Computer Architecture
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Lecture 25 PC System Architecture PCIe Interconnect
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
Motherboard A motherboard allows all the parts of your computer to receive power and communicate with one another.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
CPU/BIOS/BUS CES Industries, Inc. Lesson 8.  Brain of the computer  It is a “Logical Child, that is brain dead”  It can only run programs, and follow.
Lecture on Central Process Unit (CPU)
Overview von Neumann Architecture Computer component Computer function
بسم الله الرحمن الرحيم MEMORY AND I/O.
Computer Hardware & Processing Inside the Box CSC September 16, 2010.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
CS 1410 Intro to Computer Tecnology Computer Hardware1.
Information Technology INT1001 Lecture 2 1. Computers Are Your Future Tenth Edition Chapter 6: Inside the System Unit Copyright © 2009 Pearson Education,
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
Computer Graphics Graphics Hardware
GPU Architecture and Its Application
CSC 4250 Computer Architectures
What happens inside a CPU?
Cache Memory Presentation I
NVIDIA Fermi Architecture
Computer Graphics Graphics Hardware
RADEON™ 9700 Architecture and 3D Performance
Presentation transcript:

AMD’s ATI GPU Radeon R700 (HD 4xxx) series Elizabeth Soechting David Chang Jessica Vasconcellos 1 CS 433 Advanced Computer Architecture May 7, 2008

General 2 Interconnect SIMD Cores Memory Pipeline FetchMultithread Radeon R700 series GPUs have three levels: Value Mainstream Performance Marketed for: Blu-ray movies and HD content play Improved video game visuals AMD concentrated on improving efficiency of components in this GPU

General 3 Interconnect SIMD Cores Memory Pipeline FetchMultithread Value GPUs are ATI Radeon HD 4550 ATI Radeon HD 4350 Price range is $40-$90

General 4 Interconnect SIMD Cores Memory Pipeline FetchMultithread Mainstream GPUs are ATI Radeon HD 4600 Price range is $70-$100 Has 5 models 4 have single-slot fansink cooling solution 1 has passive-cooling solution

General 5 Interconnect SIMD Cores Memory Pipeline FetchMultithread Performance GPUs are ATI Radeon HD 4890 ATI Radeon HD 4870 ATI Radeon HD 4850 ATI Radeon HD 4830 ATI Radeon HD 4770 Price range is $100-$250

General 6 Interconnect SIMD Cores Memory Pipeline FetchMultithread All Radeon R series GPUs are equip with DirectX 10.1 HD resolution HDMI with 7.1 digital surround sound support PCI Express nm based GPU

Instruction cache 7 General Interconnect SIMD Cores Memory Pipeline FetchMultithread ATI RV770 GPU block diagram tech.net/hardware/graphics/2008/09/02/ati- radeon architecture- review/comments

Fetch 8 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Instruction Cache Part of the setup processes Next to Constants cache Through to the SIMD cores via the Ultra- Threaded Processor

Pipeline 9 General Interconnect SIMD Cores Memory Pipeline FetchMultithread An RV770 shader processor

Pipeline 10 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Design is based on “very long instruction word” idea Started in Radeon HD 2900 XT

Pipeline 11 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Although the general ideas are known, the details of the pipeline are Company Secrets Because most bottlenecks occurs during the pipeline A effective pipeline is extremely valuable

Pipeline 12 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Execution units are similar to past models (R600) other than 16KB data stores Register overflow Realignment of texturing hardware Dedicated branch execution units and texture address processors

Pipeline 13 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Changes to the actual pipeline are: Clocking architecture quarter data-rate command clock Better error detection New data mask algorithm DRAM support interface improvements

Pipeline 14 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Each shader processor is made up of 5 stream processors each with different functionality 4 units handle the instructions Floating point MAD Floating point Multiply Floating point add Integer add Dot product Each of these can be handled in one clock cycle

Pipeline 15 General Interconnect SIMD Cores Memory Pipeline FetchMultithread 5th unit in each shader processor Cannot compute dot products Cannot compute integer ADD Cannot compute double precision

Pipeline 16 General Interconnect SIMD Cores Memory Pipeline FetchMultithread 5 unit in each shader processor Can handle a larger variety of instructions such as integer multiply Integer division bit shifting transcendental commands like SIN, COS, LOG All of the shader operations are done with 32-bit precision.

Pipeline 17 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Most operations are simple so they can be handled by the 4 units more complicated calculations can be handled with the more complicated units

Pipeline 18 General Interconnect SIMD Cores Memory Pipeline FetchMultithread AMD discussed the chip's ability to handle fast integer bit shift operations important for video processing encoding compression algorithms this is implemented into all 800 of Radeon’s stream processing units

Pipeline 19 General Interconnect SIMD Cores Memory Pipeline FetchMultithread The branch execution unit is also present inside the Radeon chip Has the same flow control and conditional operations

20 General Interconnect SIMD Cores Memory Pipeline FetchMultithread ATI RV770 GPU shader processor tech.net/hardware/graphics/2008/09/02/ati- radeon architecture- review/comments

Pipeline 21 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Processing units are up to twice as fast as R670’s units Are multi-sample anti-aliasing enabled Have 64-bit color modes when anti- aliasing is disabled

Pipeline 22 General Interconnect SIMD Cores Memory Pipeline FetchMultithread 64 ops per clock for each SIMD core 160 texture fetches per clock cycle 128 – bit precision for all operations

Pipeline 23 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Collectively, the actual performance is actually four times the base clock. For the Radeon HD 4870 the base clock is 900MHz With the speedup, effective clock rate is 3.6GHz

Pipeline 24 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Each of the ten texture units can handle four addresses 16 Floating Point 32 samples four Floating Point 32 filters per clock AMD says that it has increased The Floating point 32 filter rate by 2.5x the Floating point 64 filter rate by 1.25x the Texture cache bandwidth by 2x

SIMD Organization 25 General Interconnect SIMD Cores Memory Pipeline FetchMultithread AMD has changed its texturing architecture Moved away from completely decoupled texture samples Now they are aligned with the SIMD cores Each texture unit has its own L1 texture cache so each SIMD core can store data that is unique to that core

SIMD Organization 26 General Interconnect SIMD Cores Memory Pipeline FetchMultithread This results in more effective memory space Not sure of the specific size of the L1 texture cache’s size AMD says it has doubled the storage space (per L1) which equates to a five- fold increase chip-wide.

SIMD Organization 27 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Improved efficiency of die space Improved capacitor layout and supplementary power connector placement

SIMD Organization 28 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Shader Model 4.1 Has a 'unified' architecture. same shader units can handle both vertex and pixel processing, as well as geometry shaders Organized so there is no situations where a shader unit is unused (if there's no bandwidth bottleneck and the CPU can keep up)

SIMD Organization 29 General Interconnect SIMD Cores Memory Pipeline FetchMultithread each SIMD core is either in single or double precision mode Double precision complies with IEEE 754 precision requirements Double precision workloads are handled on all 5 processors in each shader unit Switching between single and double precision is difficult

SIMD Organization 30 General Interconnect SIMD Cores Memory Pipeline FetchMultithread AMD in their own tests increased texturing performance per mm² by 70% SIMD core’s performance per mm² by around 40% The changes aren’t obvious from looking at the GPU

Number of Cores 31 General Interconnect SIMD Cores Memory Pipeline FetchMultithread The shader processors are split down into ten SIMD cores, each with 16 shader processors per SIMD. Each SIMD array also has four texture units with an L1 cache

Number of Cores 32 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Each SIMD array has 80 cores and a 16KB cache (local data store). 16KB cache allows each shader processing unit inside each SIMD to communicate with each other without having to use texture cache space Improves communication between SIMD cores because SIMD cores can access the high-speed data request bus.

Number of Cores 33 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Total of 800 stream processors AMD redesigned the stream processing units and increased from 320 to 800 units.

Memory Hierarchy 34 General Interconnect SIMD Cores Memory Pipeline FetchMultithread The memory controller was redesigned from previous models Shader processors have increased 2.5X so total register space has also increased by 2.5x AMD has improved bandwidth and transistor efficiency Something they have failed at in the past

Memory Hierarchy 35 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Redesigned L2 texture cache design Radeon’s R700 texture caches are assigned to memory channels There are four memory channels each with its own L2 texture cache

Memory Hierarchy 36 General Interconnect SIMD Cores Memory Pipeline FetchMultithread As mentioned before, each texture unit has its own L1 texture cache had doubled the storage space for each L1 which is a 5x increase chip-wide

Memory Hierarchy 37 General Interconnect SIMD Cores Memory Pipeline FetchMultithread AMD has also improved the chip's vectored I/O performance (random reading and writing) by a factor of 2 GPUs can compute these at a rate of bit or eight 128-bit exports per clock. These operations take a lot of bandwidth so speeding these up, can leave bandwidth for other ops.

Memory Hierarchy 38 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Fully associative multi-level texture cache Each of these texture caches is connected to a memory crossbar data is allocated dynamically across the four memory controllers to maintain good locality of reference. Fully associative texture Z/stencil cache Double-sided hierarchical Z/stencil buffer Memory read/write cache for improved stream output performance

39 General Interconnect SIMD Cores Memory Pipeline FetchMultithread tech.net/content_images/2008 /09/ati-radeon architecture-review/tex.jpg

40 General Interconnect SIMD Cores Memory Pipeline FetchMultithread tech.net/content_images/20 08/09/ati-radeon architecture- review/cache.jpg

Memory Hierarchy 41 General Interconnect SIMD Cores Memory Pipeline FetchMultithread One than one shared cache is needed for such a large GPU. there are global data stores for larger tasks than cannot be handled in one SIMD There is enough bandwidth available inside the chip to move data between caches on other chips

Memory Hierarchy 42 General Interconnect SIMD Cores Memory Pipeline FetchMultithread The four memory controllers are each associated with a L2 cache hub is connected directly to lower-bandwidth interfaces like PCI-Express the display controllers UVD video engine CrossFireX interconnect Total bandwidth improvement for new GPUs is about 10%

Memory Hierarchy 43 General Interconnect SIMD Cores Memory Pipeline FetchMultithread AMD increased texture fetch bandwidth up to 480GB/sec if only using L1 texture caches up to 384GB/sec if using L1 and L2 caches. AMD has added a register overflow prevent shader processors from stalling often when they’re waiting for data return from another part of the GPU

Memory Hierarchy 44 General Interconnect SIMD Cores Memory Pipeline FetchMultithread ATI has changed the memory layout no longer a horseshoe now laid out in an L-shape GPU uses eight 64MB Qimonda 1.0ns GDDR3 memory chips total of 512MB of on-board memory Chips have been rotated by 90 degrees on new GPUs more space in between the DRAMs and the power phases

Memory Hierarchy 45 General Interconnect SIMD Cores Memory Pipeline FetchMultithread AMD stayed with 256-bit memory interface on Radeon R700. Adding bus width increases the required number of pins the size of the chip AMD wanted to avoid the increases

On-Chip Interconnect 46 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Data transfer at115GB/sec On a 256-bit memory bus Data request bus can access vertex texture cache CI Express 2.0 x16 bus interface

47 General Interconnect SIMD Cores Memory Pipeline FetchMultithread

48 General Interconnect SIMD Cores Memory Pipeline FetchMultithread

On-Chip Interconnect 49 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Built-In CrossfireX Technology enables two or more discrete graphics processors to work together to improve system performance Can scale up performance using up to 4 GPUs

50 On-Chip Interconnect General Interconnect SIMD Cores Memory Pipeline FetchMultithread CrossfireX Sideports The new ATI Radeon HD 4870 X2 (most advanced model) has sideports allows direct data-exchange between GPUs by-passes the bridge and motherboard

On-Chip Interconnect 51 General Interconnect SIMD Cores Memory Pipeline FetchMultithread AMD hasn’t changed the way power is delivered to the card 75W of the board’s power is supplied via the PCI-Express interconnect retain backwards compatibility

Multithreading Organization 52 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Unified Superscalar Shader Architecture Dynamic load balancing and resource allocation for vertex, geometry, and pixel shaders Common instruction set and texture unit access supported for all types of shaders Dedicated branch execution units and texture address processors

Multithreading Organization 53 General Interconnect SIMD Cores Memory Pipeline FetchMultithread Dynamic Geometry Acceleration High performance vertex cache Programmable tessellation unit Accelerated geometry shader path for geometry amplification Memory read/write cache for improved stream output performance