Lynn Choi School of Electrical Engineering

Slides:



Advertisements
Similar presentations
Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.
Advertisements

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Structure of Computer Systems
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Computer Abstractions and Technology
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Lecture 2: Modern Trends 1. 2 Microprocessor Performance Only 7% improvement in memory performance every year! 50% improvement in microprocessor performance.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Parallel Computer Architectures
1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  logistics  why computer organization is important  modern trends.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
1 VLSI and Computer Architecture Trends ECE 25 Fall 2012.
Computer Architecture Challenges Shriniwas Gadage.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Multi-Core Architectures
Lynn Choi School of Electrical Engineering Microprocessor Microarchitecture The Past, Present, and Future of CPU Architecture.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
® 1 VLSI Design Challenges for Gigascale Integration Shekhar Borkar Intel Corp. October 25, 2005.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.
Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO Session 3 Computer Evolution.
BCS361: Computer Architecture I/O Devices. 2 Input/Output CPU Cache Bus MemoryDiskNetworkUSBDVD …
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Corse Overview Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip.
CS203 – Advanced Computer Architecture
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
William Stallings Computer Organization and Architecture 6th Edition
Microprocessor Microarchitecture Introduction
Multiprocessing.
Visit for more Learning Resources
Lynn Choi School of Electrical Engineering
Architecture & Organization 1
Hyperthreading Technology
Parallel Computers Today
Lecture 2: Performance Today’s topics: Technology wrap-up
Architecture & Organization 1
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Parallel Processing Sharing the load.
Multicore / Multiprocessor Architectures
ECE/CS 757: Advanced Computer Architecture II
Parallel Processing Architectures
Chapter 1 Introduction.
Leveraging Optical Technology in Future Bus-based Chip Multiprocessors
Die Stacking (3D) Microarchitecture -- from Intel Corporation
Computer Evolution and Performance
CSC3050 – Computer Architecture
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Utsunomiya University
Presentation transcript:

Lynn Choi School of Electrical Engineering Microprocessor Microarchitecture The Future of CPU Architecture Lynn Choi School of Electrical Engineering

Future CPU Microarchitecture - MANYCORE Idea Double the number of cores on a chip with each silicon generation 1000 cores will be possible with 30nm technology 1024 512 256 128 Intel Teraflops (80) # of Cores 64 32 Sun Victoria Falls (16) 16 IBM Cell (9) 8 Intel Core i7 (8) 4 Sun UltraSPARC T1 (8) Intel Dunnington (6) 2 IBM Power4 (2) Intel Core2 Quad (4) 1 Intel Pentium D (2) Intel Core 2 Duo (2) Intel Pentium 4 (1) 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

Future CPU Microarchitecture - MANYCORE Core architecture Should be the most efficient in MIPS/watt and MIPS/silicon. Modestly pipelined (5~9 stages) in-order pipeline System architecture Heterogeneous vs. homogeneous MP Heterogeneous in terms of functionality Heterogeneous in terms of performance Amdahl’s Law Shared vs. distributed memory MP Shared memory multicore Most of existing multicores Preserve the programming paradigm via binary compatibility and cache coherence Distributed memory multicores More scalable hardware and suitable for manycore architectures CPU DSP GPU CPU DSP GPU CPU DSP GPU CPU CPU CPU CPU CPU CPU

Future CPU Microarchitecture I - MANYCORE Issues On-chip interconnects Buses and crossbar will not be scalable to 1000 cores! Packet-switched point-to-point interconnects Ring (IBM Cell), 2D/3D mesh/torus (RAW) networks Can provide scalable bandwidth. But, how about latency? Cache coherence Bus-based snooping protocols cannot be used! Directory-based protocols for up to 100 cores More simplified and flexible coherence protocols will be needed to leverage the improved bandwidth and low latency. Caches can be adapted between private and shared configurations. More direct control over the memory hierarchy. Or, software-managed caches Off-chip pin bandwidth Manycores will unleash a much higher numbers of MIPS in a single chip. More demand on IO pin bandwidth Need to achieve 100 GB/s ~ 1TB/s memory bandwidth More demand on DRAM out of total system silicon

Future CPU Microarchitecture I - MANYCORE Projection Pin IO bandwidth cannot sustain the memory demands of manycores Multicores may work from 2 to 8 processors on a chip Diminishing returns as 16 or 32 processors are realized! Just as returns fell with ILP beyond 4~6 issue now available But for applications with high TLP, manycore will be a good design choice Network processors, Intel’s RMS (Recognition, Mining, Synthesis)

Future CPU Architecture II – Multiple SoC Idea – System on Chip! Integrate main memory on chip Much higher memory bandwidth and reduced memory access latencies Memory hierarchy issue For memory expansion, off-chip DRAMs may need to be provided This implies multiple levels of DRAM in the memory hierarchy On-chip DRAMs can be used as a cache for the off-chip DRAM On-chip memory is divided into SRAMs and DRAMs Should we use SRAMs for caches? Multiple systems on chip Single monolithic DRAM shared by multiple cores Distributed DRAM blocks across multiple cores CPU DRAM CPU CPU CPU CPU CPU DRAM CPU DRAM DRAM CPU DRAM CPU

Intel Terascale processor Features 80 3.13 GHz processor cores, 1.01 TFLOPS at 1.0V, 62W, 100M transistors 3D stacked memory Mesh interconnects – provides 80GB/s bandwidth Challenges On-die power dissipation Off-chip memory bandwidth Cache hierarchy design and coherence Intel Corp. All rights reserved 7

Intel Terascale processor Intel Corp. All rights reserved 8

Trend - Change of Wisdoms 1. Power is free, but transistors are expensive. “Power wall”: Power is expensive, but transistors are “free”. 2. Regarding power, the only concern is dynamic power. For desktops/servers, static power due to leakage can be 40% of total power. 3. Can reveal more ILP via compilers/arch innovation. “ILP wall”: There are diminishing returns on finding more ILP. 4. Multiply is slow, but load and store is fast. “Memory wall”: Load and store is slow, but multiply is fast. 200 clocks to access DRAM, but FP multiplies may take only 4 clock cycles. 5. Uniprocessor performance doubles every 18 months. Power Wall + Memory Wall + ILP Wall: The doubling of uniprocessor performance may now take 5 years. 6. Don’t bother parallelizing your application, as you can just wait and run it on a faster sequential computer. It will be a very long wait for a faster sequential computer. 7. Increasing clock frequency is the primary method of improving processor performance. Increasing parallelism is the primary method of improving processor performance.