AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Computer architecture
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
ARCHITECTURE OF APPLE’S G4 PROCESSOR BY RON WEINWURZEL MICROPROCESSORS PROFESSOR DEWAR SPRING 2002.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Contents Even and odd memory banks of 8086 Minimum mode operation
THE AMD-K7 TM PROCESSOR Microprocessor Forum 1998 Dirk Meyer.
1 VR BIT MICROPROCESSOR โดย นางสาว พิลาวัณย์ พลับรู้การ นางสาว เพ็ญพรรณ อัศวนพเกียรติ
1 Microprocessor-based Systems Course 4 - Microprocessors.
Embedded Systems Programming
The AMD K8 Processor Architecture December 14 th 2006.
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
GCSE Computing - The CPU
The AMD and Intel Architectures COMP Jamie Curtis.
Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.
AMD Opteron - AMD64 Architecture Sean Downes. Description Released April 22, 2003 The AMD Opteron is a 64 bit microprocessor designed for use in server.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
History of Microprocessor MPIntroductionData BusAddress Bus
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.
The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
IBM/Motorola/Apple PowerPC
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Computer Architecture System Interface Units Iolanthe II in the Bay of Islands.
The Alpha – Data Stream Matt Ziegler.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Fundamentals of Programming Languages-II
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Chao Han ELEC6200 Computer Architecture Fall 081ELEC : Han: PowerPC.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
GCSE Computing - The CPU
Itanium® 2 Processor Architecture
Protection in Virtual Mode
Cache Memory.
Memory COMPUTER ARCHITECTURE
Visit for more Learning Resources
CS 704 Advanced Computer Architecture
Introduction to Pentium Processor
Special Instructions for Graphics and Multi-Media
The Microarchitecture of the Pentium 4 processor
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Comparison of Two Processors
Lecture: Cache Innovations, Virtual Memory
Chapter 6 Memory System Design
* From AMD 1996 Publication #18522 Revision E
GCSE Computing - The CPU
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Presentation transcript:

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used

Outline Features Block diagram Microarchitecture Pipeline Cache Memory controller HyperTransport InterCPU Connections

Features 64-bit x86-based microprocessor On chip double-data-rate (DDR) memory controller [low memory latency] Three HyperTransport links [connect to other devices without support chips] Out of order, superscalar processor Adds 64-bit (48-bit virtual and 40-bit physical) addressing and expands number of registers Supports legacy 32-bit applications without modifications or recompilation

Features Double the number of registers Integer general purposes registers (GPR’s) – 16 each Streaming SIMD extension (SSE) registers – 16 each Satisfies the register allocation needs of more than 80% of functions appearing in a typical program. Connected to a memory through an integrated memory controller High performance I/O subsystem via HyperTransport bus.

Block diagram

Microarchitecture Works with fixed-length micro-ops and dispatches into two independent schedulers: One for integer, and one for floating point and multimedia (MMX, 3DNow, SSE and SSE2) Load and store micro-ops go to the load/store unit 11 micro-ops each cycle to the following execution resources. Three integer execution units Three address generation units Three floating point and multimedia units Two load/store to the data cache

Microarchitecture

Pipeline Long enough for high frequency and short enough for good IPC (Instructions per cycle) Fully integrated from instruction fetch through DRAM access. Execute pipeline is typically 12 stages for integer 17 stages for floating-point Data cache access occurs in stage 11. In case that L1 cache miss, the pipeline access the L2 cache in parallel and the request goes to the system request queue. Pipeline in the DRAM run as the same frequency as the core

Pipeline

Memory, Cache, and HyperTransport

Cache Separate L1 Instruction and Data caches. Each is 64 Kbytes, 2-way set associative, 64-byte cache line. L2 cache (Data & Instructions) Size: 1 Mbytes. 16-way set associative. uses a pseudo-least-recently-used (LRU) replacement policy Independent L1 and L2 translation look-aside buffers (TLB). The L1 TLB is fully associative and stores thirty-two 4-Kbyte page translations, and eight 2-Mbyte/4-Mbyte page translations. The L2 TLB is four-way set-associative with Kbyte entries.

Onboard Memory Control 128-bit memory bus Latency reduced and bandwidth doubled Multicore: Processors have own memory interface and own memory Available memory scales with the number of processors DDR-SDRAM only Up to 8 registered DDR DIMMs per processor Memory bandwidth of up to 5.3 Gbytes/s per processor.

HyperTransport Bidirectional, serial/parallel, scalable, high-bandwidth low- latency bus Packet based 32-bit words regardless of physical width Facilitates power management and low latencies

HyperTransport in the Opteron 16 CAD HyperTransport (16-bit wide, CAD=Command, Address, Data) processor-to-processor and processor-to-chipset bandwidth of up to 6.4 GB/s (per HT port) 8-bit wide HyperTransport for components such as normal I/O-Hubs

InterCPU Connections Multiple CPUs connected through a proprietary extension running on additional HyperTransport interfaces Allows support of a cache-coherent, Non-Uniform Memory Access, multi-CPU memory access protocol Non-Uniform Memory Access Separate cache memory for each processor Memory access time depends on memory location. (i.e. local faster than non-local) Cache coherence Integrity of data stored in local caches of a shared resource Each CPU can access the main memory of another processor, transparent to the programmer