Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CE-321: Computer.

Slides:



Advertisements
Similar presentations
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Advertisements

Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Lecture 6: Multicore Systems
William Stallings Computer Organization and Architecture 8th Edition
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
CS-334: Computer Architecture
Processor history / DX/SX SX/DX Pentium 1997 Pentium MMX
Chapter Hardwired vs Microprogrammed Control Multithreading
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
Computer Organization and Architecture William Stallings 8 th Edition Chapter 1 Introduction.
CH01: Architecture & Organization 1 Architecture is those attributes visible to the programmer  Instruction set, number of bits used for data representation,
CS-334: Computer Architecture
Chapter 18 Multicore Computers
Computer performance.
+ CS 325: CS Hardware and Software Organization and Architecture Introduction.
Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CE-321: Computer.
William Stallings Computer Organization and Architecture 6 th Edition Chapter 1 Introduction.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
EKT 422 Computer Architecture
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.
Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CE-321: Computer.
Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CE-321: Computer.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
Introduction Computer System “An electronic device, operating under the control of instructions stored in its own memory unit, that can accept data (input),
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.
Coupling Facility. The S/390 Coupling Facility (CF), the key component of the Parallel Sysplex cluster, enables multisystem coordination and datasharing.
Chapter 1 Introduction.  Architecture is those attributes visible to the programmer ◦ Instruction set, number of bits used for data representation, I/O.
Dr Mohamed Menacer College of Computer Science and Engineering, Taibah University CE-321: Computer.
Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CE-321: Computer.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
William Stallings Computer Organization and Architecture Chapter 1 Introduction.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
+ CS 325: CS Hardware and Software Organization and Architecture Multicore Computers 1.
Chapter 1 Introduction.   In this chapter we will learn about structure and function of computer and possibly nature and characteristics of computer.
William Stallings Computer Organization and Architecture 8th Edition
Multiple Processor Systems
ECE232: Hardware Organization and Design
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 7th Edition
Lynn Choi School of Electrical Engineering
Multi-core processors
Architecture & Organization 1
Hyperthreading Technology
Architecture & Organization 1
Text Book Computer Organization and Architecture: Designing for Performance, 7th Ed., 2006, William Stallings, Prentice-Hall International, Inc.
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 7th Edition
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 7th Edition
William Stallings Computer Organization and Architecture 8th Edition
Computer Organization and Architecture William Stallings 8th Edition
Presentation transcript:

Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CE-321: Computer Architecture Chapter 14: Multicore Computers William Stallings, Computer Organization and Architecture, 8th Edition

Hardware Performance Issues Microprocessors have seen an exponential increase in performance —Improved organization —Increased clock frequency Increase in Parallelism —Pipelining —Superscalar —Simultaneous multithreading (SMT) Diminishing returns —More complexity requires more logic —Increasing chip area for coordinating and signal transfer logic –Harder to design, make and debug

Alternative Chip Organizations

Intel Hardware Trends

Increased Complexity Power requirements grow exponentially with chip density and clock frequency —Can use more chip area for cache –Smaller –Order of magnitude lower power requirements By 2015 —100 billion transistors on 300mm 2 die –Cache of 100MB –1 billion transistors for logic Pollack’s rule: —Performance is roughly proportional to square root of increase in complexity –Double complexity gives 40% more performance Multicore has potential for near-linear improvement Unlikely that one core can use all cache effectively

Power and Memory Considerations

Chip Utilization of Transistors

Software Performance Issues Performance benefits dependent on effective exploitation of parallel resources Even small amounts of serial code impact performance —10% inherently serial on 8 processor system gives only 4.7 times performance Communication, distribution of work and cache coherence overheads Some applications effectively exploit multicore processors

Effective Applications for Multicore Processors Database Servers handling independent transactions Multi-threaded native applications —Lotus Domino, Siebel CRM Multi-process applications —Oracle, SAP, PeopleSoft Java applications —Java VM is multi-thread with scheduling and memory management —Sun’s Java Application Server, BEA’s Weblogic, IBM Websphere, Tomcat Multi-instance applications —One application running multiple times E.g. Value Game Software

Performance Effect of Multiple Cores

Multicore Organization Number of core processors on chip Number of levels of cache on chip Amount of shared cache Next slide examples of each organization: (a) ARM11 MPCore (b) AMD Opteron (c) Intel Core Duo (d) Intel Core i7

Multicore Organization Alternatives

Advantages of shared L2 Cache Constructive interference reduces overall miss rate Data shared by multiple cores not replicated at cache level With proper frame replacement algorithms mean amount of shared cache dedicated to each core is dynamic —Threads with less locality can have more cache Easy inter-process communication through shared memory Cache coherency confined to L1 Dedicated L2 cache gives each core more rapid access —Good for threads with strong locality Shared L3 cache may also improve performance

Individual Core Architecture Intel Core Duo uses superscalar cores Intel Core i7 uses simultaneous multi- threading (SMT) —Scales up number of threads supported –4 SMT cores, each supporting 4 threads appears as 16 core

Intel x86 Multicore Organization - Core Duo (1) 2006 Two x86 superscalar, shared L2 cache Dedicated L1 cache per core —32KB instruction and 32KB data Thermal control unit per core —Manages chip heat dissipation —Maximize performance within constraints —Improved ergonomics Advanced Programmable Interrupt Controlled (APIC) —Inter-process interrupts between cores —Routes interrupts to appropriate core —Includes timer so OS can interrupt core

Intel Core Duo Block Diagram

Intel x86 Multicore Organization - Core Duo (2) Power Management Logic —Monitors thermal conditions and CPU activity —Adjusts voltage and power consumption —Can switch individual logic subsystems 2MB shared L2 cache —Dynamic allocation —MESI support for L1 caches —Extended to support multiple Core Duo in SMP –L2 data shared between local cores or external Bus interface

Intel x86 Multicore Organization - Core i7 November 2008 Four x86 SMT processors Dedicated L2, shared L3 cache Speculative pre-fetch for caches On chip DDR3 memory controller —Three 8 byte channels (192 bits) giving 32GB/s —No front side bus QuickPath Interconnection —Cache coherent point-to-point link —High speed communications between processor chips —6.4G transfers per second, 16 bits per transfer —Dedicated bi-directional pairs —Total bandwidth 25.6GB/s

Intel Core i& Block Diagram

ARM11 MPCore Up to 4 processors each with own L1 instruction and data cache Distributed interrupt controller Timer per CPU Watchdog —Warning alerts for software failures —Counts down from predetermined values —Issues warning at zero CPU interface —Interrupt acknowledgement, masking and completion acknowledgement CPU —Single ARM11 called MP11 Vector floating-point unit —FP co-processor L1 cache Snoop control unit —L1 cache coherency

ARM11 MPCore Block Diagram

ARM11 MPCore Interrupt Handling Distributed Interrupt Controller (DIC) collates from many sources Masking Prioritization Distribution to target MP11 CPUs Status tracking Software interrupt generation Number of interrupts independent of MP11 CPU design Memory mapped Accessed by CPUs via private interface through SCU Can route interrupts to single or multiple CPUs Provides inter-process communication —Thread on one CPU can cause activity by thread on another CPU

DIC Routing Direct to specific CPU To defined group of CPUs To all CPUs OS can generate interrupt to: —All but self —Self —Other specific CPU Typically combined with shared memory for inter-process communication 16 interrupt ids available for inter-process communication

Interrupt States Inactive —Non-asserted —Completed by that CPU but pending or active in others Pending —Asserted —Processing not started on that CPU Active —Started on that CPU but not complete —Can be pre-empted by higher priority interrupt

Interrupt Sources Inter-process Interrupts (IPI) —Private to CPU —ID0-ID15 —Software triggered —Priority depends on target CPU not source Private timer and/or watchdog interrupt —ID29 and ID30 Legacy FIQ line —Legacy FIQ pin, per CPU, bypasses interrupt distributor —Directly drives interrupts to CPU Hardware —Triggered by programmable events on associated interrupt lines —Up to 224 lines —Start at ID32

ARM11 MPCore Interrupt Distributor

Cache Coherency Snoop Control Unit (SCU) resolves most shared data bottleneck issues L1 cache coherency based on MESI Direct data Intervention —Copying clean entries between L1 caches without accessing external memory —Reduces read after write from L1 to L2 —Can resolve local L1 miss from rmote L1 rather than L2 Duplicated tag RAMs —Cache tags implemented as separate block of RAM —Same length as number of lines in cache —Duplicates used by SCU to check data availability before sending coherency commands —Only send to CPUs that must update coherent data cache Migratory lines —Allows moving dirty data between CPUs without writing to L2 and reading back from external memory

Internet Resources - Web site for book William Stallings, 8 th Edition (2009) —Chapter 18 —links to sites of interest —links to sites for courses that use the book —information on other books by W. Stallings —Math —How-to —Research resources —Misc http: http:

Internet Resources - Web sites to look for WWW Computer Architecture Home Page CPU Info Center Processor Emporium ACM Special Interest Group on Computer Architecture IEEE Technical Committee on Computer Architecture Intel Technology Journal Manufacturer’s sites —Intel, IBM, etc.