1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

Slides:



Advertisements
Similar presentations
Dr. Rabie A. Ramadan Al-Azhar University Lecture 3
Advertisements

Computer Organization and Architecture
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
5-1 Memory System. Logical Memory Map. Each location size is one byte (Byte Addressable) Logical Memory Map. Each location size is one byte (Byte Addressable)
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
Power Reduction Techniques For Microprocessor Systems
The First Microprocessor By: Mark Tocchet and João Tupinambá.
Original Authors: Stefan Rusu, Simon Tam, Harry Muljono, Jason Stinson, David Ayers, Jonathan Chang, Raj Varada, Matt Ratta, Sailesh Kottapalli Some slides.
Memory Chapter 3. Slide 2 of 14Chapter 1 Objectives  Explain the types of memory  Explain the types of RAM  Explain the working of the RAM  List the.
1 A 90nm 512Mb 166MHz Multilevel Cell Flash Memory with 1.5MByte/s Programming Adopted from ISSCC Dig. Tech. Papers, Feb.2005, Intel Corporation[2.6] Presented.
Introduction to CMOS VLSI Design Lecture 13: SRAM
Processor history / DX/SX SX/DX Pentium 1997 Pentium MMX
Chapter 9 Memory Basics Henry Hexmoor1. 2 Memory Definitions  Memory ─ A collection of storage cells together with the necessary circuits to transfer.
11/29/2004EE 42 fall 2004 lecture 371 Lecture #37: Memory Last lecture: –Transmission line equations –Reflections and termination –High frequency measurements.
Vacuum tubes Transistor 1948 –Smaller, Cheaper, Less heat dissipation, Made from Silicon (Sand) –Invented at Bell Labs –Shockley, Brittain, Bardeen ICs.
Lecture 5 – Power Prof. Luke Theogarajan
Lecture 19: SRAM.
Lecture 7: Power.
Parts from Lecture 9: SRAM Parts from
Case Study - SRAM & Caches
LOGO Multi-core Architecture GV: Nguyễn Tiến Dũng Sinh viên: Ngô Quang Thìn Nguyễn Trung Thành Trần Hoàng Điệp Lớp: KSTN-ĐTVT-K52.
Multi-Core Architectures
1 Review Of “A 125 MHz Burst-Mode Flexible Read While Write 256Mbit 2b/c 1.8V NOR Flash Memory” Adopted From: “ISSCC 2005 / SESSION 2 / NON-VOLATILE MEMORY.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 12.1 EE4800 CMOS Digital IC Design & Analysis Lecture 12 SRAM Zhuo Feng.
Chapter 3 Internal Memory. Objectives  To describe the types of memory used for the main memory  To discuss about errors and error corrections in the.
CPEN Digital System Design
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
1 Class Presentation For Advanced VLSI Course Professor: Dr. S. M. Fakhraie Presented by: Sayyed Hassan Sohofi Major Reference: A 0.13µm Triple-Vt 9MB.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Computer Organization & Assembly Language © by DR. M. Amer.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 6.1 EE4800 CMOS Digital IC Design & Analysis Lecture 6 Power Zhuo Feng.
Capability of processor determine the capability of the computer system. Therefore, processor is the key element or heart of a computer system. Other.
1 Dual-V cc SRAM Class presentation for Advanced VLSIPresenter:A.Sammak Adopted from: M. Khellah,A 4.2GHz 0.3mm 2 256kb Dual-V CC SRAM Building Block in.
Chapter 5 Internal Memory. contents  Semiconductor main memory - organisation - organisation - DRAM and SRAM - DRAM and SRAM - types of ROM - types of.
Overview of microcomputer structure and operation
Computer Architecture Chapter (5): Internal Memory
Types of RAM (Random Access Memory) Information Technology.
Submitted To: Submitted By: Seminar On 8086 Microprocessors.
Hardware Architecture
MAHARANA PRATAP COLLEGE OF TECHNOLOGY SEMINAR ON- COMPUTER PROCESSOR SUBJECT CODE: CS-307 Branch-CSE Sem- 3 rd SUBMITTED TO SUBMITTED BY.
Temperature and Power Management
Multiprocessing.
Processor support devices Part 2: Caches and the MESI protocol
An Unobtrusive Debugging Methodology for Actel AX and RTAX-S FPGAs
Types of RAM (Random Access Memory)
CS-301 Introduction to Computing Lecture 17
Phnom Penh International University (PPIU)
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Unit 2 Computer Systems HND in Computing and Systems Development
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
BIC 10503: COMPUTER ARCHITECTURE
Adaptive Single-Chip Multiprocessing
Memory Organization.
Chapter 4: MEMORY.
William Stallings Computer Organization and Architecture 8th Edition
Modified from notes by Saeid Nooshabadi
Presentation transcript:

1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara, CA) ISSCC 2006 Instructor: Dr. S. M. Fakhraie Provided by: Nayere Ghobadi Fall 2006 Advanced VLSI Class Presentation

2 Outline Multi-core processors Cache Xeon processors Dual-Core Multi-Threaded Xeon Processor Features 16MB L3 cache Clock Generation and Distribution Voltage supplies Processor Package Front-side bus (FSB) Protection Temperature sensing Summary and conclusion

3 Multi-core Processors Is one that combines two or more independent processors into a single package, often a single IC. Exhibit some form of thread-level parallelism (TLP). Diagram of an Intel Core 2 dual core processor (from[6])

4 Multi-core Processors Cont. Advantages: 1. Signals don’t have to travel off-chip, so cache coherency circuitry can operate at a much higher clock rate. 2. Require much less space than multi-chip designs. 3. Slightly less power than two coupled single-core processors.

5 Multi-core Processors Cont. Disadvantages: 1. In addition to OS support, adjustments to existing software are required to maximize utilization of the computing resources provided by multi-core processors. 2. Drive production yields down and they are more difficult to manage thermally.

6 Cache A temporary storage area where frequently accessed data can be stored for rapid access. If the processor finds the desired memory location in the cache, This situation is known as a cache hit, otherwise it is cache miss. The proportion of accesses that result in a cache hit is known as the hit rate. Diagram of a CPU memory cache (from[6])

7 Cache Cont. Multi-level caches: There is a tradeoff between cache latency and hit rate. Larger caches have better hit rates but longer latency. So many computers use multiple levels of cache, with small fast caches backed up by larger slower caches.

8 Xeon Processors The Xeon is intel's brand name for its server-class PC microprocessor intended for multiple-processor machines. Generally have more cache and support larger multiprocessor configurations than their desktop counterparts. Xeon processor and logo (from[6])

9 Dual-Core Multi-Threaded Xeon Processor Features Two 64b cores. Each core has: 1. Two threads 2. A unified 1MB L2 cache 16MB unified L3 cache A simple direct interface between core and front- side bus (FSB) for minimizing: 1. L3 cache latency. 2. External bus latency. Block diagram (from[1])

10 Dual-Core Multi-Threaded Xeon Processor Features Cont. Caching FSB controller for handling: 1. Core arbitration. 2. L3 cache accesses. 3. External bus requests. The processor die is 435mm 2 with 1.328B transistors. Operates at more than 3.0GHz from a 1.25V core supply. Die micrograph (from[1])

11 Dual-Core Multi-Threaded Xeon Processor Features Cont. The worst-case power dissipation is 165W (power dissipation on a typical server workload is 110W). 65nm process Technology. Eight copper interconnect layers. Low-k carbon-doped oxide (k=2.9) inter-level dielectric. 65nm process technology summary (from[1])

12 16MB L3 Cache 6T SRAM Cell Read: Precharge both bitlines high Raise wordline One of the two bitlines will be pulled down by the cell Write: Drive one bitline high, the other low Raise wordline Bitlines overpower cell with new value 6T memory cell (from[5])

13 16MB L3 Cache Cont. 256 data sub-arrays (64kB each). Each data sub-array stores 32 bits. 32 redundancy sub-arrays (68kB each). Each redundancy sub-array store 34 bits. Is Composed of 6T memory- cells with the size of 0.624µm 2. Physical address is 40b wide. Only 0.8% of all array blocks are powered up for each cache access for reducing active power. L3 cache block (from[1])

14 16MB L3 Cache Cont. Sleep circuit Active mode: Virtual V ss =V ss Full voltage swing. Sleep mode: Virtual V ss = 250mV. Reducing the leakage by 2X. Shut-off mode: NMOS shut-off device is turned off. Virtual V ss = V cc /2. Reducing the leakage by 4X. L3 cache sleep circuit and shut-off mode (from[1])

15 Clock Generation and Distribution The critical clocking features of this processor are: 1. multiple clock domains with different frequencies. 2. dedicated core and uncore voltage domains. Separate PLLs and clock distribution trees for each core and the associated L2 cache. A third PLL for the uncore half-frequency clock. De-skew circuits controlled by on-die fuses reduce the uncore clock skew to less than 11ps.

16 Clock Generation and Distribution Cont. System clock (BCLK) = 200MHz. Cores clock (MCLK) = BCLK×N. MCLK can be more than 3.0GHz at a 1.25V core supply voltage (V core ). Uncore clock (SCLK) = 1/2MCLK. Using a separate uncore voltage supply (V cache ). FSB clock (ZCLK) = BCLK×4 (quad pumping) Clock distribution map (from[2])

17 Voltage Supplies Three voltage supplies are used for: 1. Two cores. 2. L3 cache together with the associated control logic. 3. The FSB I/O circuits. Level shifters are used between voltage domains. A custom tool checks for presence and correct connectivity of level shifters on all signals that cross voltage domain boundaries.

18 Voltage Supplies Cont. Voltage domains and power breakdown (from[1])

19 Processor Package The processor is flip-chip or Controlled Collapse Chip Connection (C4). The processor die has C4 solder bumps. Is attached to a 12-layer (4-4-4) organic package with an integrated heat spreader. The package has 604 pins. 238 pins are signal pins and the rest are power and ground. The chip-level power distribution consists of a uniform M8-M7 grid synchronized with the C4 power and ground bump array.

20 Front-Side Bus (FSB) Operates at 800MT/s. A symmetric pre-driver design for controlling the edge rate to meet timing and signal integrity requirements: 1. Dividing the FSB output (V OL to V OH ) into six voltage levels. 2. Each driven by an output driver segment with different R ON value. 3. When a segment is enabled, it forms a parallel resistance to the previously enabled segments. 4. A new voltage level is generated, thus creating a stair- case-like waveform in every transition.

21 Front-Side Bus (FSB) Cont. Symmetric I/O pre-driver circuit (form[1])

22 Protection Using bit interleaving for adjacent cache lines To prevent multiple bit errors caused by a single upset event in the same cache line. L3 data and tag arrays and L2 data array have Error- correction code (ECC) protection L2 tag has parity checking. A dynamic 32-entries cache line disable mechanism protects the L3 cache from erratic bits and infant mortality failures.

23 Temperature Sensing Three diodes for temperature sensing: One in each core. routed to an on-package temperature-monitor chip. provide temperature data to the system for fan speed control. One between the two cores. is routed to pins for system use. A temperature sensor near the hot spot in each core, provides a digital temperature readout that is used in conjunction with operating-system power-state requests to make informed throttle and boost decisions.

24 Summary and Conclusion Dual-core multi-threaded Xeon processor in 65nm process Technology. The processor is flip-chip (C4). Has Two 64b cores. Each core has Two threads and A unified 1MB L2 cache. Has 16MB unified L3 cache Operates at more than 3.0GHz from a 1.25V core supply Three voltage supplies The processor FSB Operates at 800MT/s

25 References [1] S. Rusu, S. Tam, “A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache”, IEEE ISSCC Tech. Digest, p118, [2] S. Tam, J. Leung, “Clock Generation and Distribution of a Dual-Core Xeon® Processor with 16MB L3 Cache”, IEEE ISSCC Tech. Digest, p382, [3] “Dual-Core Intel® Xeon® Processor 7100 Series Datasheet”, Intel Corporation, September [4] S. Tam, et al., “Clock Generation and Distribution for the Third Generation Itanium® Processor,” Symp. VLSI Circuits, pp. 9-12, Jun., [5] N. H. E. Weste, D. Harris, “ CMOS VLSI Design”,Pearson Education Inc., [6] Wikipedia, The free encyclopedia. Available: