Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11.

Slides:



Advertisements
Similar presentations
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Advertisements

Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
TO COMPUTERS WITH BASIC CONCEPTS Lecturer: Mohamed-Nur Hussein Abdullahi Hame WEEK 1 M. Sc in CSE (Daffodil International University)
Lecture 6: Multicore Systems
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
The First Microprocessor By: Mark Tocchet and João Tupinambá.
Lecture 2: Modern Trends 1. 2 Microprocessor Performance Only 7% improvement in memory performance every year! 50% improvement in microprocessor performance.
Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.
1 Microprocessor-based Systems Course 4 - Microprocessors.
Processor history / DX/SX SX/DX Pentium 1997 Pentium MMX
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
Chapter 17 Parallel Processing.
3.1Introduction to CPU Central processing unit etched on silicon chip called microprocessor Contain tens of millions of tiny transistors Key components:
CS 423 – Operating Systems Design Lecture 22 – Power Management Klara Nahrstedt and Raoul Rivas Spring 2013 CS Spring 2013.
Computer performance.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Copyright © 2007 Heathkit Company, Inc. All Rights Reserved PC Fundamentals Presentation 27 – A Brief History of the Microprocessor.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
A Gentler, Kinder Guide to the Multi-core Galaxy Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Guest lecture for ECE4100/6100.
® 1 VLSI Design Challenges for Gigascale Integration Shekhar Borkar Intel Corp. October 25, 2005.
Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi
1 CS/EE 6810: Computer Architecture Class format:  Most lectures on YouTube *BEFORE* class  Use class time for discussions, clarifications, problem-solving,
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
History of Microprocessor MPIntroductionData BusAddress Bus
1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
Shashwat Shriparv InfinitySoft.
Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
THE BRIEF HISTORY OF 8085 MICROPROCESSOR & THEIR APPLICATIONS
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
CS203 – Advanced Computer Architecture
Lecture # 10 Processors Microcomputer Processors.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 6th Edition
CS203 – Advanced Computer Architecture
Lynn Choi School of Electrical Engineering
CIT 668: System Architecture
Lynn Choi School of Electrical Engineering
Technology advancement in computer architecture
Architecture & Organization 1
Phnom Penh International University (PPIU)
Architecture & Organization 1
BIC 10503: COMPUTER ARCHITECTURE
3.1 Introduction to CPU Central processing unit etched on silicon chip called microprocessor Contain tens of millions of tiny transistors Key components:
A High Performance SoC: PkunityTM
Chapter 1 Introduction.
Computer Evolution and Performance
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
Presentation transcript:

Computer Science and Engineering Advanced Computer Architecture CSE 8383 April 17, 2008 Session 11

Computer Science and Engineering Contents 1.Multi-Core Why now? A Paradigm Shift Multi-Core Architecture 2.Case Studies IBM Cell Intel Core 2Duo AMD

Computer Science and Engineering The Path to Multi-Core

Computer Science and Engineering Background Wafer Thin slice of semiconducting material, such as a silicon crystal, upon which microcircuits are constructed Die Size The die size of the processor refers to its physical surface area size on the wafer. It is typically measured in square millimeters (mm^2). In essence a "die" is really a chip. the smaller the chip, the more of them that can be made from a single wafer. Circuit Size The level of miniaturization of the processor. In order to pack more transistors into the same space, they must be continually made smaller and smaller. Measured in Microns (  m) or Nanometer (nm)

Computer Science and Engineering Examples 386C  Die Size: 42 mm2  1.0  technology  275,000 transistors Pentium III  Die Size: 106 mm2  0.18  technology  28 million transistors 486C  Die Size: 90 mm2  0.7  technology  1.2 million transistors Pentium  Die Size: 148 mm2  0.5  technology  3.2 million transistors

Computer Science and Engineering Pentium III (0.18  process technology) Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32

Computer Science and Engineering

Technology (nm) Integration Capacity (BT) nm Process Technology

Computer Science and Engineering Increasing Die Size Using the same technology Increasing the Die Size 2-3X  X in Performance. Power is proportional to Die-area * Frequency We cannot produce microprocessors with ever increasing Die size – The constraint is POWER

Computer Science and Engineering Reducing circuit size in particular is key to reducing the size of the chip. The first generation Pentium used a 0.8 micron circuit size, and required 296 square millimeters per chip. The second generation chip had the circuit size reduced to 0.6 microns, and the die size dropped by a full 50% to 148 square millimeters. Reducing circuit Size

Computer Science and Engineering Shrink transistors by 30% every generation  transistor density doubles, oxide thickness shrinks, frequency increases, and threshold voltage decreases. Gate thickness cannot keep on shrinking  slowing frequency increase, less threshold voltage reduction.

Computer Science and Engineering Processor Evolution Gate delay reduces by 1/ (frequency up by ) Number of transistors in a constant area goes up by 2 (Deeper pipelines, ILP, more cashes) Additional transistors enable an additional increase in performance Result: 2x performance at roughly equal cost Generation i Generation i +1 (0.5  m, for example)(0.35  m, for example)

Computer Science and Engineering What happens to power if we hold die size constant at each generation? Allows ~ 100% growth in transistors each generation Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32

Computer Science and Engineering What happens to die Size if we hold power constant at each generation? Die size has to reduce ~ 25% in area each generation  50% growth in transistors, which limits PERFORMANCE, Power Density is still a problem Source: Fred Pollack, Intel. New Micro-architecture Challenges in the coming Generations of CMOS Process Technologies. Micro32

Computer Science and Engineering Source: Intel Developer Forum, Spring 2004 Pat Gelsinger (Pentium at 90 W) Power Density continues to soar

Computer Science and Engineering Business as Usual won’t work: Power is a Major Barrier  As Processor Continue to improve in Performance and Speed, Power consumption and heat dissipation have become major challenges  Higher costs: Thermal Packaging Fans Electricity Air conditioning

Computer Science and Engineering A new Paradigm Shift Old Paradigm Performance == improved Frequency, unconstrained power, voltage scaling New Paradigm: Performance == improved IPC, Multi-core, power efficient micro architecture advancement

Computer Science and Engineering Multiple CPUs on a Single Chip An attractive option for chip designers because of the availability of cores from earlier processor generations, which, when shrunk down to present-day process technology, are small enough for aggregation into a single die

Computer Science and Engineering Multi-core Gate delay does not reduce much The frequency and performance of each core is the same or a little less than previous generation Generation i Generation i Generation i Technology Generation i Technology Generation i+1

Computer Science and Engineering Increasing HW Threads HT Multi-core Era Scalar and Parallel Applications Many-core Era Massively Parallel Applications From HT to Many-Core Intel predicts 100’s of cores on a chip in 2015

Computer Science and Engineering Source: Saman Amarasinghe, MIT ( , lecture-1) # of Cores Multi-cores are Reality

Computer Science and Engineering Multi-Core Architecture

Computer Science and Engineering Multi-core Architecture  Multiple cores are being integrated on a single chip and made available for general purpose computing  Higher levels of integration –  multiple processing cores  Caches  memory controllers  some I/O processing)  Network on Chip (NoC)

Computer Science and Engineering Interconnection Networks MM MM PPPPP MMMM PPPP Shared memory One copy of data shared among multiple cores Synchronization via locking intel Distributed memory Cores access local data Cores exchange data

Computer Science and Engineering Memory Access Alternatives  Symmetric Multiprocessors (SMP)  Message Passing (MP)  Distributed Shared Memory (DSM) Shared address space Distributed address space Global Memory SMP Symmetric Multiprocessors Distributed Memory DMS Distributed Shared Memory MP Message Passing

Computer Science and Engineering Network on Chip (NoC) controldataI/O Traditional Bus Switch Network

Computer Science and Engineering Global Memory P P P PC SC Global Memory P P P PC Secondary Cache Global Memory P P P Secondary Cache Primary Cache Shared Memory Shared Global Memory Shared Secondary CacheShared Primary Cache

Computer Science and Engineering General Architecture CPU core registers L1 I$L1 D$ L2 cache main memory I/O CPU core 1 registers L1 I$L1 D$ L2 cache CPU core N registers L1 I$L1 D$ L2 cache main memory I/O Conventional Microprocessor Multiple cores

Computer Science and Engineering General Architecture (cont) Shared Cache CPU core 1 registers L1 I$L1 D$ CPU core N registers L1 I$L1 D$ L2 cache main memory I/O CPU core 1 regs L1 I$L1 D$L1 I$L1 D$ L2 cache main memory I/O regs CPU core N regs Multithreaded Shared Cache

Computer Science and Engineering “Case Studies”

Computer Science and Engineering Case Study 1: “IBM’s Cell Processor”

Computer Science and Engineering Cell Highlights  Supercomputer on a chip  Multi-core microprocessor(9 cores)  >4 Ghz clock frequency  10X performance for many applications

Computer Science and Engineering Key Attributes Cell is Multi-core -Contains 64-bit power architecture -Contains 8 synergetic processor elements Cell is a Broadband Architecture -SPE is RISC architecture with SIMD organization and local store concurrent transactions to memory per processor Cell is a Real-Time Architecture -Resource allocation (for bandwidth measurement) -Locking caching (via replacement management table) Cell is a Security Enabled Architecture -Isolate SPE for flexible security programming

Computer Science and Engineering Cell Processor Components

Computer Science and Engineering Cell BE Processor Block Diagram

Computer Science and Engineering POWER Processing Element (PPE) POWER Processing Unit (PPU) connected to a 512KB L2 cache. Responsible for running the OS and coordinating the SPEs. Key design goals: maximize the performance/power ratio as well as the performance/area ratio. Dual-issue, in-order processor with dual-thread support Utilizes delayed-execution pipelines and allows limited out- of-order execution of load instructions.

Computer Science and Engineering Synergistic Processing Elements (SPE) Dual-issue, in-order machine with a large 128-entry, 128-bit register file used for both floating-point and integer operations Modular design consisting of a Synergistic Processing Unit (SPU) and a Memory Flow Controller (MFC). Compute engine with SIMD support and 256KB of dedicated local storage. The MFC contains a DMA controller with an associated MMU and an Atomic Unit to handle synch operations with other SPUs and the PPU.

Computer Science and Engineering SPE (cont.) They operate directly on instructions and data from its dedicated local store. They rely on a channel interface to access the main memory and other local stores. The channel interface, which is in the MFC, runs independently of the SPU and is capable of translating addresses and doing DMA transfers while the SPU continues with the program execution. SIMD support can perform operations on bit, 8 16-bit, 4 32-bit integers, or 4 single- precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable of performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision.

Computer Science and Engineering  Blade level  2 cell processors per blade  Chip level  9 cores  Instruction level  Dual issue pipelines on each SPE  Register level  Native SIMD on SPE and PPE VMX Four levels of Parallelism

Computer Science and Engineering Cell Chip Floor plan

Computer Science and Engineering Element Interconnect Bus (EIB) Implemented as a ring Interconnect 12 elements:  1 PPE with 51.2GB/s aggregate bandwidth  8 SPEs: each with 51.2GB/s aggregate bandwidth  MIC: 25.6GB/s of memory bandwidth  2 IOIF: 35GB/s(out), 25GB/s(in) of I/O bandwidth Support two transfer modes  DMA between SPEs  MMIO/DMA between PPE and system memory Source: Ainsworth & Pinkston, On Characterizing Performance of the Cell Broad band Engine Element Interconnect Bus, 1st International Symp. on NOCS 2007

Computer Science and Engineering Element Interconnect Bus (EIB) An EIB consists of the following: 1.Four 16 byte-wide rings (two in each direction) 1.1 Each ring capable of handling up to 3 concurrent non-overlapping transfers 1.2 Supports up to 12 data transfers at a time 2.A shared command bus 2.1 Distributes commands 2.2 Sets up end to end transactions 2.3 Handles coherency 3.A central data arbiter to connect the 12 Cell elements 3.1 Implemented in a star-like structure 3.2 It controls access to the EIB data rings on a per transaction basis Source: Ainsworth & Pinkston, On Characterizing Performance of the Cell Broad band Engine Element Interconnect Bus, 1st International Symp. on NOCS 2007

Computer Science and Engineering Element Interconnect Bus (EIB)

Computer Science and Engineering Cell Manufacturing Parameters About 234 million transistors (compared with 125 million for Pentium 4) that runs at more than 4.0 GHz As compared to conventional processors, Cell is fairly large, with a die size of 221 square millimeters The introductory design is fabricated using a 90 nm Silicon on insulator (SOL) process In March 2007 IBM announced that the 65 nm version of Cell BE (Broadband Engine) is in production

Computer Science and Engineering Cell Power Consumption Each SPE consumes about 1 W when clocked at 2 GHz, 2 W at 3 GHz, and 4 W at 4 GHz Including the eight SPEs, the PPE, and other logic, the CELL processor will dissipate close to 15W at 2 GHz, 30W at 3 GHz, and approximately 60W 4 GHz

Computer Science and Engineering Cell Power Management Dynamic Power Management (DPM) Five Power Management States One linear sensor Ten digital thermal sensors

Computer Science and Engineering Case Study 2: “Intel’s Core 2 Duo ”

Computer Science and Engineering Intel Core 2 Duo Highlights Multi-core microprocessor(2 cores) It has a range of 1.5 to 3 Ghz clock frequency 2X performance for many applications Dedicated level 1 cache and shared level 2 cache Its shared L2 cache comes in two flavors: 2MB and 4MB, depending on the model It supports 64bit architecture

Computer Science and Engineering Intel Core 2 Duo Block Diagram Dedicated L1$ Shared L2$ The two cores exchange data implicitly through the shared level 2 cache

Computer Science and Engineering Intel Core 2 Duo Architecture Reduced front-side bus traffic: effective data sharing between cores allows data requests to be resolved at the shared cache level instead of going all the way to the system memory Core 1 had to retrieve the data from Core 2 by going all the way through the FSB and Main Memory One Copy needed to be retrieved

Computer Science and Engineering Intel’s Core 2 Duo Manufacturing Parameters About 291 million transistors Compared to Cell’s 221 square millimeters, Core 2 Duo has a smaller die size between 143 and 107 square millimeters depending on the model. The current Intel process technology for the Dual core ranges between 65 nm and 45nm (2007) with an estimate of 155 million transistors.

Computer Science and Engineering Intel Core 2 Duo Power Consumption Power consumption in Core 2 Duo ranges 65w-130w depending on the model. Assuming you have 75 w processor model (Conroe is 65W) it will cost you $4 to keep your computer up for the whole month

Computer Science and Engineering Intel Core 2 Duo Power Management It uses 65 nm technology instead of the previous 90nm technology (Less voltage requirements) Aggressive clock gating Enhanced Speed-Step Low VCC Arrays Blocks controlled via sleep transistors Low leakage transistors

Computer Science and Engineering Case Study 3: “AMD’s Quad-Core Processor (Barcelona) ”

Computer Science and Engineering AMD Quad-Core Highlights Designed to enable simultaneous 32- and 64-bit computing Minimizes the cost of transition and maximizes current investments Integrated DDR2 Memory Controller Increases application performance by dramatically reducing memory latency Scales memory bandwidth and performance to match compute needs HyperTranspor Technology Provides up to 24.0GB/s peak bandwidth per processor, reducing I/O bottlenecks

Computer Science and Engineering AMD Quad-Core Block Diagram Dedicated L1$ and L2$ Shared L3$

Computer Science and Engineering AMD Quad-Core Architecture It has a crossbar switch instead of the usual bus used in dual core processors It lowers the probability of having memory access collisions L3$ to alleviate the memory access latency since we have a greater possibility of accessing the memory due to the high number of cores

Computer Science and Engineering AMD Quad-Core Architecture (cont) Replacement policies: L1,L2: pseudo LRU L3:Sharing aware pseudo LRU Cache Hierarchy : Dedicated L1 cache 2 way associative 8 banks (each 16B wide). Dedicated L2 cache 16 way associative victim cache, exclusive w.r.t L1 Shared L3 cache 32 way associative Fills from L3 leave likely shared lines in L3 Victim cache, partially exclusive w.r.t. L2 Sharing aware replacement policy

Computer Science and Engineering AMD Quad-Core Manufacturing Parameters The current AMD process technology for Quad-Core is 65nm It is comprised of approximately 463M transistors (about 119M less than Intel’s quad-core Kentsfield) It has a die size of 285 square millimeters (Compared to Cell’s 221 square millimeters)

Computer Science and Engineering AMD Quad-Core Power Consumption Power consumption in AMD Quad-Core ranges 68-95w( compared to 65w-130w of Intel’s Core 2 Duo) depending on the model. AMD CoolCore Technology Reduces processor energy consumption by turning off unused parts of the processor. For example, the memory controller can turn off the write logic when reading from memory, helping reduce system power Power can be switched on or off within a single clock cycle, saving energy with no impact to performance

Computer Science and Engineering AMD Quad-Core Power Management Native quad-core technology enables enhanced power management across all four cores