1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology 20042005 2007 Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.

Slides:



Advertisements
Similar presentations
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
1 Microprocessor-based Systems Course 4 - Microprocessors.
CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Multi-core architectures Jernej Barbic , Spring 2007 May 3, 2007.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 13, 2002 Topic: Main Memory (DRAM) Organization.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
The Pentium 4 CPSC 321 Andreas Klappenecker. Today’s Menu Advanced Pipelining Brief overview of the Pentium 4.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Carnegie Mellon /18-243: Introduction to Computer Systems Instructors: Bill Nace and Gregory Kesden (c) All Rights Reserved. All work.
Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Practical PC, 7th Edition Chapter 17: Looking Under the Hood
Multi Core Processor Submitted by: Lizolen Pradhan
CPE232 Memory Hierarchy1 CPE 232 Computer Organization Spring 2006 Memory Hierarchy Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.
1 Recap (from Previous Lecture). 2 Computer Architecture Computer Architecture involves 3 inter- related components – Instruction set architecture (ISA):
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
1 Multi-core architectures Zonghua Gu Acknowledgement: Slides taken from Jernej Barbic’s lecture notes.
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
Computer Architecture By Chris Van Horn. CPU Basics “Brains of the Computer” Fetch Execute Cycle Instruction Branching.
Shashwat Shriparv InfinitySoft.
Hyper-Threading Technology Architecture and Microarchitecture
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
THE BRIEF HISTORY OF 8085 MICROPROCESSOR & THEIR APPLICATIONS
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Multi-Core Architectures 1. Single-Core Computer 2.
Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Carnegie Mellon /18-243: Introduction to Computer Systems Instructors: Anthony Rowe and Gregory Kesden 27 th (and last) Lecture, 28 April 2011 Multi-Core.
CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
CSE431 L18 Memory Hierarchy.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 18: Memory Hierarchy Review Mary Jane Irwin (
IFETCE/ME/CSE/B.V.R.Raju/Iyear/Isem/CP7103/MCA/Unit-1/PPt/Ver1.01.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
William Stallings Computer Organization and Architecture 6th Edition
Lynn Choi School of Electrical Engineering
Assembly Language for Intel-Based Computers, 5th Edition
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Multi-Core Computing Osama Awwad Department of Computer Science
Hyperthreading Technology
Multi-core architectures
Multi-Core Architectures
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Chapter 11: Alternative Architectures
8 – Simultaneous Multithreading
Presentation transcript:

1 Pipelining for Multi- Core Architectures

2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores + Cache Core Cache 2 or more cores + Cache 2X more cores

3 Why multi-core ? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits: –heat problems –Clock problems –Efficiency (Stall) problems Doubling issue rates above today’s 3-6 instructions per clock, say to 6 to 12 instructions, is extremely difficult –issue 3 or 4 data memory accesses per cycle, –rename and access more than 20 registers per cycle, and –fetch 12 to 24 instructions per cycle. Many new applications are multithreaded General trend in computer architecture (shift towards more parallelism)

4 Instruction-level parallelism Parallelism at the machine-instruction level The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years

5 Thread-level parallelism (TLP) This is parallelism on a more coarser scale Server can serve each client in a separate thread (Web server, database server) A computer game can do AI, graphics, and sound in three separate threads Single-core superscalar processors cannot fully exploit TLP Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP

6 What applications benefit from multi-core? Database servers Web servers (Web commerce) Multimedia applications Scientific applications, CAD/CAM In general, applications with Thread-level parallelism (as opposed to instruction- level parallelism) Each can run on its own core

7 More examples Editing a photo while recording a TV show through a digital video recorder Downloading software while running an anti-virus program “Anything that can be threaded today will map efficiently to multi-core” BUT: some applications difficult to parallelize

8 Core 2 Duo Microarchitecture

9 BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROMBTB L2 Cache and Control Bus Thread 1: floating point Without SMT, only a single thread can run at any given time

10 Without SMT, only a single thread can run at any given time BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROMBTB L2 Cache and Control Bus Thread 2: integer operation

11 SMT processor: both threads can run concurrently BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROMBTB L2 Cache and Control Bus Thread 1: floating pointThread 2: integer operation

12 But: Can’t simultaneously use the same functional unit BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROMBTB L2 Cache and Control Bus Thread 1Thread 2 This scenario is impossible with SMT on a single core (assuming a single integer unit) IMPOSSIBLE

13 Multi-core: threads can run on separate cores BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1 Thread 2

14 BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 3 Thread 4 Multi-core: threads can run on separate cores

15 Combining Multi-core and SMT Cores can be SMT-enabled (or not) The different combinations: –Single-core, non-SMT: standard uniprocessor –Single-core, with SMT –Multi-core, non-SMT –Multi-core, with SMT: our fish machines The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads Intel calls them “hyper-threads”

16 SMT Dual-core: all four threads can run concurrently BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers IntegerFloating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1Thread 3 Thread 2Thread 4

17 Multi-Core and caches coherence memory L2 cache L1 cache C O R E 1C O R E 0 L2 cache memory L2 cache L1 cache C O R E 1C O R E 0 L2 cache Both L1 and L2 are private Examples: AMD Opteron, AMD Athlon, Intel Pentium D L3 cache A design with L3 caches Example: Intel Itanium 2

18 The cache coherence problem Since we have private caches: How to keep the data consistent across caches? Each core should perceive the memory as a monolithic array, shared by all the cores

19 The cache coherence problem Suppose variable x initially contains Core 1Core 2Core 3Core 4 One or more levels of cache Main memory x=15213 multi-core chip

20 The cache coherence problem Core 1 reads x Core 1Core 2Core 3Core 4 One or more levels of cache x=15213 One or more levels of cache Main memory x=15213 multi-core chip

21 The cache coherence problem Core 2 reads x Core 1Core 2Core 3Core 4 One or more levels of cache x=15213 One or more levels of cache x=15213 One or more levels of cache Main memory x=15213 multi-core chip

22 The cache coherence problem Core 1 writes to x, setting it to Core 1Core 2Core 3Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches

23 The cache coherence problem Core 2 attempts to read x… gets a stale copy Core 1Core 2Core 3Core 4 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache Main memory x=21660 multi-core chip

24 The Memory Wall Problem

25 Memory Wall µProc 60%/yr. (2X/1.5yr ) DRAM 9%/yr. (2X/10 yrs) DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance “Moore’s Law”

26 Latency in a Single PC Memory Access Time CPU Time Ratio THE WALL

27 Pentium 4 Cache hierarchy Processor L1 I (12Ki)L1 D (8KiB) L2 cache (512 KiB) L3 cache (2 MiB) Cycles: 2 Cycles: 19 Cycles: 43 Memory Cycles: 206

28 Technology Trends Capacity Speed (latency) Logic: 2x in 3 years 2x in 3 years DRAM: 4x in 3 years 2x in 10 years Disk: 4x in 3 years 2x in 10 years DRAM Generations YearSize Cycle Time Kb 250 ns Kb 220 ns Mb 190 ns Mb 165 ns Mb 120 ns Mb 110 ns Mb 100 ns Mb 90 ns Mb 80 ns Mb 60ns 16000:1 4:1 (Capacity) (Latency)

29 Processor-DRAM Performance Gap Impact: Example To illustrate the performance impact, assume a single-issue pipelined CPU with CPI = 1 using non-ideal memory. The minimum cost of a full memory access in terms of number of wasted CPU cycles: CPU CPU Memory Minimum CPU cycles or Year speed cycle Access instructions wasted MHZ ns ns 1986: / = : /30 -1 = : / = : /5 -1 = : / = : /1 - 1 = : / = 159

30 Main Memory Main Memory Main memory generally uses Dynamic RAM (DRAM), which uses a single transistor to store a bit, but requires a periodic data refresh (~every 8 msec). Cache uses SRAM: Static Random Access Memory –No refresh (6 transistors/bit vs. 1 transistor/bit for DRAM) Size: DRAM/SRAM ­ 4-8, Cost & Cycle time: SRAM/DRAM ­ 8-16 Main memory performance : –Memory latency : Access time: The time it takes between a memory access request and the time the requested information is available to cache/CPU. Cycle time: The minimum time between requests to memory (greater than access time in DRAM to allow address lines to be stable) –Memory bandwidth: The maximum sustained data transfer rate between main memory and cache/CPU.

31 Architects Use Transistors to Tolerate Slow Memory Cache –Small, Fast Memory –Holds information (expected) to be used soon –Mostly Successful Apply Recursively –Level-one cache(s) –Level-two cache Most of microprocessor die area is cache!