Download presentation
Presentation is loading. Please wait.
Published byHoward Farmer Modified over 9 years ago
1
Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11
2
Computer Science and Engineering Contents 1.Multi-Core Why now? A Paradigm Shift Multi-Core Architecture 2.Case Studies IBM Cell Intel Core 2Duo AMD
3
Computer Science and Engineering The Path to Multi-Core
4
Computer Science and Engineering Background Wafer Thin slice of semiconducting material, such as a silicon crystal, upon which microcircuits are constructed Die Size The die size of the processor refers to its physical surface area size on the wafer. It is typically measured in square millimeters (mm^2). In essence a "die" is really a chip. the smaller the chip, the more of them that can be made from a single wafer. Circuit Size The level of miniaturization of the processor. In order to pack more transistors into the same space, they must be continually made smaller and smaller. Measured in Microns ( m) or Nanometer (nm)
5
Computer Science and Engineering Examples 386C Die Size: 42 mm2 1.0 technology 275,000 transistors Pentium III Die Size: 106 mm2 0.18 technology 28 million transistors 486C Die Size: 90 mm2 0.7 technology 1.2 million transistors Pentium Die Size: 148 mm2 0.5 technology 3.2 million transistors
6
Computer Science and Engineering Pentium III (0.18 process technology) Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32
7
Computer Science and Engineering
8
Technology (nm)9065453222 Integration Capacity (BT) 2481632 nm Process Technology
9
Computer Science and Engineering Increasing Die Size Using the same technology Increasing the Die Size 2-3X 1.5-1.7X in Performance. Power is proportional to Die-area * Frequency We cannot produce microprocessors with ever increasing Die size – The constraint is POWER
10
Computer Science and Engineering Reducing circuit size in particular is key to reducing the size of the chip. The first generation Pentium used a 0.8 micron circuit size, and required 296 square millimeters per chip. The second generation chip had the circuit size reduced to 0.6 microns, and the die size dropped by a full 50% to 148 square millimeters. Reducing circuit Size
11
Computer Science and Engineering Shrink transistors by 30% every generation transistor density doubles, oxide thickness shrinks, frequency increases, and threshold voltage decreases. Gate thickness cannot keep on shrinking slowing frequency increase, less threshold voltage reduction.
12
Computer Science and Engineering Processor Evolution Gate delay reduces by 1/ (frequency up by ) Number of transistors in a constant area goes up by 2 (Deeper pipelines, ILP, more cashes) Additional transistors enable an additional increase in performance Result: 2x performance at roughly equal cost Generation i Generation i +1 (0.5 m, for example)(0.35 m, for example)
13
Computer Science and Engineering What happens to power if we hold die size constant at each generation? Allows ~ 100% growth in transistors each generation Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32
14
Computer Science and Engineering What happens to die Size if we hold power constant at each generation? Die size has to reduce ~ 25% in area each generation 50% growth in transistors, which limits PERFORMANCE, Power Density is still a problem Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32
15
Computer Science and Engineering Source: Intel Developer Forum, Spring 2004 Pat Gelsinger (Pentium at 90 W) Power Density continues to soar
16
Computer Science and Engineering Business as Usual won’t work: Power is a Major Barrier As Processor Continue to improve in Performance and Speed, Power consumption and heat dissipation have become major challenges Higher costs: Thermal Packaging Fans Electricity Air conditioning
17
Computer Science and Engineering A new Paradigm Shift Old Paradigm Performance == improved Frequency, unconstrained power, voltage scaling New Paradigm: Performance == improved IPC, Multi-core, power efficient micro architecture advancement
18
Computer Science and Engineering Multiple CPUs on a Single Chip An attractive option for chip designers because of the availability of cores from earlier processor generations, which, when shrunk down to present-day process technology, are small enough for aggregation into a single die
19
Computer Science and Engineering Multi-core Gate delay does not reduce much The frequency and performance of each core is the same or a little less than previous generation Generation i Generation i Generation i Technology Generation i Technology Generation i+1
20
Computer Science and Engineering 10 100 1 200320052007200920112013 Increasing HW Threads HT Multi-core Era Scalar and Parallel Applications Many-core Era Massively Parallel Applications From HT to Many-Core Intel predicts 100’s of cores on a chip in 2015
21
Computer Science and Engineering Source: Saman Amarasinghe, MIT (6.189 2007, lecture-1) # of Cores Multi-cores are Reality
22
Computer Science and Engineering Multi-Core Architecture
23
Computer Science and Engineering Multi-core Architecture Multiple cores are being integrated on a single chip and made available for general purpose computing Higher levels of integration – multiple processing cores Caches memory controllers some I/O processing) Network on Chip (NoC)
24
Computer Science and Engineering Interconnection Networks MM MM PPPPP MMMM PPPP Shared memory One copy of data shared among multiple cores Synchronization via locking intel Distributed memory Cores access local data Cores exchange data
25
Computer Science and Engineering Memory Access Alternatives Symmetric Multiprocessors (SMP) Message Passing (MP) Distributed Shared Memory (DSM) Shared address space Distributed address space Global Memory SMP Symmetric Multiprocessors Distributed Memory DMS Distributed Shared Memory MP Message Passing
26
Computer Science and Engineering Network on Chip (NoC) controldataI/O Traditional Bus Switch Network
27
Computer Science and Engineering Global Memory P P P PC SC Global Memory P P P PC Secondary Cache Global Memory P P P Secondary Cache Primary Cache Shared Memory Shared Global Memory Shared Secondary CacheShared Primary Cache
28
Computer Science and Engineering General Architecture CPU core registers L1 I$L1 D$ L2 cache main memory I/O CPU core 1 registers L1 I$L1 D$ L2 cache CPU core N registers L1 I$L1 D$ L2 cache main memory I/O Conventional Microprocessor Multiple cores
29
Computer Science and Engineering General Architecture (cont) Shared Cache CPU core 1 registers L1 I$L1 D$ CPU core N registers L1 I$L1 D$ L2 cache main memory I/O CPU core 1 regs L1 I$L1 D$L1 I$L1 D$ L2 cache main memory I/O regs CPU core N regs Multithreaded Shared Cache
30
Computer Science and Engineering “Case Studies”
31
Computer Science and Engineering Case Study 1: “IBM’s Cell Processor”
32
Computer Science and Engineering Cell Highlights Supercomputer on a chip Multi-core microprocessor(9 cores) >4 Ghz clock frequency 10X performance for many applications
33
Computer Science and Engineering Key Attributes Cell is Multi-core -Contains 64-bit power architecture -Contains 8 synergetic processor elements Cell is a Broadband Architecture -SPE is RISC architecture with SIMD organization and local store -128+ concurrent transactions to memory per processor Cell is a Real-Time Architecture -Resource allocation (for bandwidth measurement) -Locking caching (via replacement management table) Cell is a Security Enabled Architecture -Isolate SPE for flexible security programming
34
Computer Science and Engineering Cell Processor Components
35
Computer Science and Engineering Cell BE Processor Block Diagram
36
Computer Science and Engineering POWER Processing Element (PPE) POWER Processing Unit (PPU) connected to a 512KB L2 cache. Responsible for running the OS and coordinating the SPEs. Key design goals: maximize the performance/power ratio as well as the performance/area ratio. Dual-issue, in-order processor with dual-thread support Utilizes delayed-execution pipelines and allows limited out- of-order execution of load instructions.
37
Computer Science and Engineering Synergistic Processing Elements (SPE) Dual-issue, in-order machine with a large 128-entry, 128-bit register file used for both floating-point and integer operations Modular design consisting of a Synergistic Processing Unit (SPU) and a Memory Flow Controller (MFC). Compute engine with SIMD support and 256KB of dedicated local storage. The MFC contains a DMA controller with an associated MMU and an Atomic Unit to handle synch operations with other SPUs and the PPU.
38
Computer Science and Engineering SPE (cont.) They operate directly on instructions and data from its dedicated local store. They rely on a channel interface to access the main memory and other local stores. The channel interface, which is in the MFC, runs independently of the SPU and is capable of translating addresses and doing DMA transfers while the SPU continues with the program execution. SIMD support can perform operations on 16 8- bit, 8 16-bit, 4 32-bit integers, or 4 single- precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable of performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision.
39
Computer Science and Engineering Blade level 2 cell processors per blade Chip level 9 cores Instruction level Dual issue pipelines on each SPE Register level Native SIMD on SPE and PPE VMX Four levels of Parallelism
40
Computer Science and Engineering Cell Chip Floor plan
41
Computer Science and Engineering Element Interconnect Bus (EIB) Implemented as a ring Interconnect 12 elements: 1 PPE with 51.2GB/s aggregate bandwidth 8 SPEs: each with 51.2GB/s aggregate bandwidth MIC: 25.6GB/s of memory bandwidth 2 IOIF: 35GB/s(out), 25GB/s(in) of I/O bandwidth Support two transfer modes DMA between SPEs MMIO/DMA between PPE and system memory Source: Ainsworth & Pinkston, On Characterizing Performance of the Cell Broad band Engine Element Interconnect Bus, 1st International Symp. on NOCS 2007
42
Computer Science and Engineering Element Interconnect Bus (EIB) An EIB consists of the following: 1.Four 16 byte-wide rings (two in each direction) 1.1 Each ring capable of handling up to 3 concurrent non-overlapping transfers 1.2 Supports up to 12 data transfers at a time 2.A shared command bus 2.1 Distributes commands 2.2 Sets up end to end transactions 2.3 Handles coherency 3.A central data arbiter to connect the 12 Cell elements 3.1 Implemented in a star-like structure 3.2 It controls access to the EIB data rings on a per transaction basis Source: Ainsworth & Pinkston, On Characterizing Performance of the Cell Broad band Engine Element Interconnect Bus, 1st International Symp. on NOCS 2007
43
Computer Science and Engineering Element Interconnect Bus (EIB)
44
Computer Science and Engineering Cell Manufacturing Parameters About 234 million transistors (compared with 125 million for Pentium 4) that runs at more than 4.0 GHz As compared to conventional processors, Cell is fairly large, with a die size of 221 square millimeters The introductory design is fabricated using a 90 nm Silicon on insulator (SOL) process In March 2007 IBM announced that the 65 nm version of Cell BE (Broadband Engine) is in production
45
Computer Science and Engineering Cell Power Consumption Each SPE consumes about 1 W when clocked at 2 GHz, 2 W at 3 GHz, and 4 W at 4 GHz Including the eight SPEs, the PPE, and other logic, the CELL processor will dissipate close to 15W at 2 GHz, 30W at 3 GHz, and approximately 60W 4 GHz
46
Computer Science and Engineering Cell Power Management Dynamic Power Management (DPM) Five Power Management States One linear sensor Ten digital thermal sensors
47
Computer Science and Engineering Case Study 2: “Intel’s Core 2 Duo ”
48
Computer Science and Engineering Intel Core 2 Duo Highlights Multi-core microprocessor(2 cores) It has a range of 1.5 to 3 Ghz clock frequency 2X performance for many applications Dedicated level 1 cache and shared level 2 cache Its shared L2 cache comes in two flavors: 2MB and 4MB, depending on the model It supports 64bit architecture
49
Computer Science and Engineering Intel Core 2 Duo Block Diagram Dedicated L1$ Shared L2$ The two cores exchange data implicitly through the shared level 2 cache
50
Computer Science and Engineering Intel Core 2 Duo Architecture Reduced front-side bus traffic: effective data sharing between cores allows data requests to be resolved at the shared cache level instead of going all the way to the system memory Core 1 had to retrieve the data from Core 2 by going all the way through the FSB and Main Memory One Copy needed to be retrieved
51
Computer Science and Engineering Intel’s Core 2 Duo Manufacturing Parameters About 291 million transistors Compared to Cell’s 221 square millimeters, Core 2 Duo has a smaller die size between 143 and 107 square millimeters depending on the model. The current Intel process technology for the Dual core ranges between 65 nm and 45nm (2007) with an estimate of 155 million transistors.
52
Computer Science and Engineering Intel Core 2 Duo Power Consumption Power consumption in Core 2 Duo ranges 65w-130w depending on the model. Assuming you have 75 w processor model (Conroe is 65W) it will cost you $4 to keep your computer up for the whole month
53
Computer Science and Engineering Intel Core 2 Duo Power Management It uses 65 nm technology instead of the previous 90nm technology (Less voltage requirements) Aggressive clock gating Enhanced Speed-Step Low VCC Arrays Blocks controlled via sleep transistors Low leakage transistors
54
Computer Science and Engineering Case Study 3: “AMD’s Quad-Core Processor (Barcelona) ”
55
Computer Science and Engineering AMD Quad-Core Highlights Designed to enable simultaneous 32- and 64-bit computing Minimizes the cost of transition and maximizes current investments Integrated DDR2 Memory Controller Increases application performance by dramatically reducing memory latency Scales memory bandwidth and performance to match compute needs HyperTranspor Technology Provides up to 24.0GB/s peak bandwidth per processor, reducing I/O bottlenecks
56
Computer Science and Engineering AMD Quad-Core Block Diagram Dedicated L1$ and L2$ Shared L3$
57
Computer Science and Engineering AMD Quad-Core Architecture It has a crossbar switch instead of the usual bus used in dual core processors It lowers the probability of having memory access collisions L3$ to alleviate the memory access latency since we have a greater possibility of accessing the memory due to the high number of cores
58
Computer Science and Engineering AMD Quad-Core Architecture (cont) Replacement policies: L1,L2: pseudo LRU L3:Sharing aware pseudo LRU Cache Hierarchy : Dedicated L1 cache 2 way associative 8 banks (each 16B wide). Dedicated L2 cache 16 way associative victim cache, exclusive w.r.t L1 Shared L3 cache 32 way associative Fills from L3 leave likely shared lines in L3 Victim cache, partially exclusive w.r.t. L2 Sharing aware replacement policy
59
Computer Science and Engineering AMD Quad-Core Manufacturing Parameters The current AMD process technology for Quad-Core is 65nm It is comprised of approximately 463M transistors (about 119M less than Intel’s quad-core Kentsfield) It has a die size of 285 square millimeters (Compared to Cell’s 221 square millimeters)
60
Computer Science and Engineering AMD Quad-Core Power Consumption Power consumption in AMD Quad-Core ranges 68-95w( compared to 65w-130w of Intel’s Core 2 Duo) depending on the model. AMD CoolCore Technology Reduces processor energy consumption by turning off unused parts of the processor. For example, the memory controller can turn off the write logic when reading from memory, helping reduce system power Power can be switched on or off within a single clock cycle, saving energy with no impact to performance
61
Computer Science and Engineering AMD Quad-Core Power Management Native quad-core technology enables enhanced power management across all four cores
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.