William Stallings Computer Organization and Architecture 8th Edition

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Lecture 6: Multicore Systems
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Contents Even and odd memory banks of 8086 Minimum mode operation
Processor history / DX/SX SX/DX Pentium 1997 Pentium MMX
Chapter Hardwired vs Microprogrammed Control Multithreading
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
GCSE Computing - The CPU
Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CE-321: Computer.
Chapter 18 Multicore Computers
Computer System Architectures Computer System Software
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.
History of Microprocessor MPIntroductionData BusAddress Bus
Computer Architecture Lecture 3 Cache Memory. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics.
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,
Understanding Parallel Computers Parallel Processing EE 613.
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
The University of Adelaide, School of Computer Science
+ CS 325: CS Hardware and Software Organization and Architecture Multicore Computers 1.
Hardware Architecture
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
William Stallings Computer Organization and Architecture 8th Edition
GCSE Computing - The CPU
Multiple Processor Systems
ECE232: Hardware Organization and Design
Parallel Processing - introduction
Lynn Choi School of Electrical Engineering
Multi-core processors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Assembly Language for Intel-Based Computers, 5th Edition
William Stallings Computer Organization and Architecture 7th Edition
Cache memory Direct Cache Memory Associate Cache Memory
Unit 2 Computer Systems HND in Computing and Systems Development
Hyperthreading Technology
CMSC 611: Advanced Computer Architecture
Adaptive Single-Chip Multiprocessing
William Stallings Computer Organization and Architecture 8th Edition
The University of Adelaide, School of Computer Science
William Stallings Computer Organization and Architecture 8th Edition
GCSE Computing - The CPU
The University of Adelaide, School of Computer Science
Presentation transcript:

William Stallings Computer Organization and Architecture 8th Edition Chapter 18 Multicore Computers

Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved organization Increased clock frequency Increase in Parallelism Pipelining Superscalar (multi-issue) Simultaneous multithreading (SMT) Diminishing returns More complexity requires more logic Increasing chip area for coordinating and signal transfer logic Harder to design, make and debug SMT IS SUPERSCALAR WITH PARALLEL THREADS IN THE ISSUE SLOTS

Alternative Chip Organizations http://www.cadalyst.com/files/cadalyst/nodes/2008/6351/i4.jpg

Exponential speedup trend Intel Hardware Trends Exponential speedup trend ILP has come and gone http://smoothspan.files.wordpress.com/2007/09/clockspeeds.jpg http://www.ixbt.com/cpu/semiconductor/intel-65nm/power_density.jpg

Increased Complexity Power requirements grow exponentially with chip density and clock frequency Can use more chip area for cache Smaller Order of magnitude lower power requirements By 2015 100 billion transistors on 300mm2 die Cache of 100MB 1 billion transistors for logic http://techreport.com/r.x/core-i7/die-callout.jpg http://www.tomshardware.com/reviews/core-duo-notebooks-trade-battery-life-quicker-response,1206-4.html

Power and Memory Considerations We passed 50%!!! Is this a RAM or a processor? More action Less action

Increased Complexity Pollack’s rule: Performance is roughly proportional to square root of increase in complexity Double complexity gives 40% more performance Multicore has the potential for near-linear improvement (needs some programming effort and won’t work for all problems) Unlikely that one core can use all of a huge cache effectively, so add PEs to make an MPSoC

Chip Utilization of Transistors Cache CPU

Software Performance Issues Performance benefits dependent on effective exploitation of parallel resources (obviously) Even small amounts of serial code impact performance (not so obvious) 10% inherently serial on 8 processor system gives only 4.7 times performance Many overheads of MPSoC: Communication Distribution of work Cache coherence Some applications effectively exploit multicore processors

Effective Applications for Multicore Processors Database (e.g. Select *) Servers handling independent transactions Multi-threaded native applications Lotus Domino, Siebel CRM Multi-process applications Oracle, SAP, PeopleSoft Java applications Java VM is multi-threaded with scheduling and memory management (not so good at SSE ) Sun’s Java Application Server, BEA’s Weblogic, IBM Websphere, Tomcat Multi-instance applications One application running multiple times SSE is Streaming SIMD Extensions

Multicore Organization Main design variables: Number of core processors on chip (dual, quad ... ) Number of levels of cache on chip (L1, L2, L3, ...) Amount of shared cache v.s. not shared (1MB, 4MB, ...) The following slide has examples of each organization: ARM11 MPCore AMD Opteron Intel Core Duo Intel Core i7

Multicore Organization Alternatives No shared ARM11 MPCore AMD Opteron Shared Intel Core Duo Intel Core i7

Advantages of shared L2 Cache Constructive interference reduces overall miss rate (A wants X then B wants X  good!) Data shared by multiple cores not replicated at cache level (one copy of X for both A and B) With proper frame replacement algorithms mean amount of shared cache dedicated to each core is dynamic Threads with less locality can have more cache Easy inter-process communication through shared memory Cache coherency confined to small L1 Dedicated L2 cache gives each core more rapid access Good for threads with strong locality Shared L3 cache may also improve performance

Core i7 and Duo Let us review these two Intel architectures…

Individual Core Architecture Intel Core Duo uses superscalar cores Intel Core i7 uses simultaneous multi-threading (SMT) Scales up number of threads supported 4 SMT cores, each supporting 4 threads appears as 16 core (my corei7 has 2 threads per CPU) Core i7 Core 2 duo

Intel x86 Multicore Organization - Core Duo (1) 2006 Two x86 superscalar, shared L2 cache Dedicated L1 cache per core 32KB instruction and 32KB data Thermal control unit per core Manages chip heat dissipation with sensors, clock speed is throttled Maximize performance within thermal constraints Improved ergonomics (quiet fan) Advanced Programmable Interrupt Controlled (APIC) Inter-process interrupts between cores Routes interrupts to appropriate core Includes timer so OS can self-interrupt a core

Intel x86 Multicore Organization - Core Duo (2) Power Management Logic Monitors thermal conditions and CPU activity Adjusts voltage (and thus power consumption) Can switch on/off individual logic subsystems to save power Split-bus transactions can sleep on one end 2MB shared L2 cache Dynamic allocation MESI support for L1 caches Extended to support multiple Core Duo in SMP (not SMT) L2 data shared between local cores (fast) or external Bus interface is FSB

Intel Core Duo Block Diagram

Intel x86 Multicore Organization - Core i7 November 2008 Four x86 SMT processors Dedicated L2, shared L3 cache Speculative pre-fetch for caches On chip DDR3 memory controller Three 8 byte channels (192 bits) giving 32GB/s No front side bus (just like labs 1 & 2 with the SDRAM controller) QuickPath Interconnect (QPI video if time allows) Cache coherent point-to-point link High speed communications between processor chips 6.4G transfers per second, 16 bits per transfer Dedicated bi-directional pairs Total bandwidth 25.6GB/s

Intel Core i7 Block Diagram

ARM11 MPCore “ARM vs. x86 and Microsoft Intel started this fight by challenging ARM with its Atom processor, which is moving downmarket and towards smartphones. Apparently, the major ARM vendors are feeling the threat, are now moving upmarket and are beginning to make their run at low-end PCs and storage appliances to put the pressure back on Intel.” http://www.tgdaily.com/trendwatch-features/41561-the-coming-arm-vs-intel-pc-battle

ARM11 MPCore Up to 4 processors each with own L1 instruction and data cache Distributed Interrupt Controller (DIC) Recall the APIC from Intel’s core architecture Timer per CPU Watchdog (feed or it barks!) Warning alerts for software failures Counts down from predetermined values Issues warning at zero CPU interface Interrupt acknowledgement, masking and completion acknowledgement CPU Single ARM11 called MP11 Vector floating-point unit (VFP) FP co-processor L1 cache Snoop control unit L1 cache coherency http://barfblog.foodsafety.ksu.edu/DogObedienceTraining.jpg

ARM11 MPCore Block Diagram

ARM11 MPCore Interrupt Handling Distributed Interrupt Controller (DIC) collates from many sources (ironically it is a centralized controller) It provides Masking (who can ignore an interrupt) Prioritization (CPU A is more important than CPU B) Distribution to target MP11 CPUs Status tracking (of interrupts) Software interrupt generation Number of interrupts independent of MP11 CPU design Memory mapped DIC control registers Accessed by CPUs via private interface through SCU DIC can: Route interrupts to single or multiple CPUs Provide inter-process communication Thread on one CPU can cause activity by thread on another CPU

To defined group of CPUs To all CPUs OS can generate interrupt to: DIC Routing Direct to specific CPU To defined group of CPUs To all CPUs OS can generate interrupt to: All but self Self Other specific CPU Typically combined with shared memory for inter-process communication 16 interrupt ids available for inter-process communication (per cpu)

Interrupt States Inactive Pending Active Non-asserted Completed by that CPU but pending or active in others E.g. allgather Pending Asserted Processing not started on that CPU Active Started on that CPU but not complete Can be pre-empted by higher priority interrupt

Interrupt Sources Inter-process Interrupts (IPI) Private to CPU ID0-ID15 (16 IPIs per CPU as mentioned earlier) Software triggered Priority depends on receiving CPU not source Private timer and/or watchdog interrupt ID29 and ID30 Legacy FIQ line Legacy FIQ pin, per CPU, bypasses interrupt distributor Directly drives interrupts to CPU Hardware Triggered by programmable events on associated interrupt lines Up to 224 lines Start at ID32

ARM11 MPCore Interrupt Distributor

3 types of SCU shared data resolution: Cache Coherency Snoop Control Unit (SCU) resolves most shared data bottleneck issues Note: L1 cache coherency based on MESI similar to Intel’s core architecture 3 types of SCU shared data resolution: Direct data Intervention Copying clean entries between L1 caches without accessing external memory or L2 Can resolve local L1 miss from remote L1 rather than L2 Reduces read after write from L1 to L2 Duplicated tag RAMs Cache tags implemented as separate block of RAM, a copy is held in the SCU. So the SCU knows when 2 CPUs have the same cache lines. Tag RAM has same length as number of lines in cache TAG duplicates used by SCU to check data availability before sending coherency commands Only send to CPUs that must update coherent data cache Less bus locking due to less communication during coherency step Migratory lines Allows moving dirty data between CPUs without writing to L2 and reading back from external memory(See Stallings CH 18.5 pg703)

Performance Effect of Multiple Cores

Recommended Reading Multicore Association web site Stallings chapter 18 ARM web site (if we have time) http://www.intel.com/technology/quickpath/index.htm http://www.arm.com/products/CPUs/ARM11MPCoreMultiprocessor.html http://www.eetimes.com/news/design/features/showArticle.jhtml?articleID=23901143