Prof. Onur Mutlu Carnegie Mellon University

Slides:

Advertisements

Similar presentations

Distributed Systems CS

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Structure of Computer Systems

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Background Computer System Architectures Computer System Software.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.

Chapter 17 Parallel Processing.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Computer System Architectures Computer System Software

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.

Server HW CSIS 4490 n-Tier Client/Server Dr. Hoganson Server Hardware Mission-critical –High reliability –redundancy Massive storage (disk) –RAID for redundancy.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Background Computer System Architectures Computer System Software.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

740: Computer Architecture Programming Models and Architectures Prof. Onur Mutlu Carnegie Mellon University.

740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Spring 2011 Parallel Computer Architecture Lecture 3: Programming Models and Architectures Prof. Onur Mutlu Carnegie Mellon University © 2010 Onur.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

These slides are based on the book:

COMP 740: Computer Architecture and Implementation

18-447: Computer Architecture Lecture 30B: Multiprocessors

Lecture 5 Approaches to Concurrency: The Multiprocessor

Computer Architecture: Parallel Processing Basics

CS5102 High Performance Computer Systems Thread-Level Parallelism

Distributed Processors

Lynn Choi School of Electrical Engineering

Simultaneous Multithreading

Prof. Gennady Pekhimenko University of Toronto Fall 2017

Architecture & Organization 1

Multi-Core Computing Osama Awwad Department of Computer Science

Hyperthreading Technology

Multi-Processing in High Performance Computer Architecture:

CMSC 611: Advanced Computer Architecture

Computer Architecture: Multithreading (I)

What is Parallel and Distributed computing?

Architecture & Organization 1

Parallel and Multiprocessor Architectures – Shared Memory

Lecture 1: Parallel Architecture Intro

Chapter 17 Parallel Processing

Convergence of Parallel Architectures

CS 213: Parallel Processing Architectures

Introduction to Multiprocessors

Samira Khan University of Virginia Sep 12, 2018

Distributed Systems CS

Computer Evolution and Performance

CSC3050 – Computer Architecture

Chapter 4 Multiprocessors

An Overview of MIMD Architectures

Multiprocessor and Thread-Level Parallelism Chapter 4

Database System Architectures

Samira Khan University of Virginia Jan 30, 2019

Prof. Onur Mutlu Carnegie Mellon University 9/12/2012

Presentation transcript:

Prof. Onur Mutlu Carnegie Mellon University 18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University

Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic? Does everyone have a partner? Does anyone have too many partners?

Last Lecture Programming model vs. architecture Message Passing vs. Shared Memory Data Parallel Dataflow Generic Parallel Machine

Readings for Today Required: Recommended: Hill and Marty, “Amdahl’s Law in the Multi-Core Era,” IEEE Computer 2008. Annavaram et al., “Mitigating Amdahl’s Law Through EPI Throttling,” ISCA 2005. Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors,” ISCA 2007. Recommended: Olukotun et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS 1996. Barroso et al., “Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing,” ISCA 2000. Kongetira et al., “Niagara: A 32-Way Multithreaded SPARC Processor,” IEEE Micro 2005. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967.

Reviews Due Today (Jan 21) Due Next Tuesday (Jan 25) Seitz, “The Cosmic Cube,” CACM 1985. Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. Due Next Tuesday (Jan 25) Papamarcos and Patel, “A low-overhead coherence solution for multiprocessors with private cache memories,” ISCA 1984. Kelm et al., “Cohesion: a hybrid memory model for accelerators,” ISCA 2010. Due Next Friday (Jan 28) Suleman et al., “Data Marshaling for Multi-Core Architectures,” ISCA 2010.

Recap of Last Lecture

Review: Shared Memory vs. Message Passing Loosely coupled multiprocessors No shared global memory address space Multicomputer network Network-based multiprocessors Usually programmed via message passing Explicit calls (send, receive) for communication Tightly coupled multiprocessors Shared global memory address space Traditional multiprocessing: symmetric multiprocessing (SMP) Existing multi-core processors, multithreaded processors Programming model similar to uniprocessors except (multitasking uniprocessor) Operations on shared data require synchronization

Programming Models vs. Architectures Five major models Shared memory Message passing Data parallel (SIMD) Dataflow Systolic Hybrid models?

Scalability, Convergence, and Some Terminology

Scaling Shared Memory Architectures

Interconnection Schemes for Shared Memory Scalability dependent on interconnect

UMA/UCA: Uniform Memory or Cache Access All processors have the same uncontended latency to memory Latencies get worse as system grows Symmetric multiprocessing (SMP) ~ UMA with bus interconnect

Uniform Memory/Cache Access + Data placement unimportant/less important (easier to optimize code and make use of available memory space) - Scaling the system increases latencies - Contention could restrict bandwidth and increase latency

Example SMP Quad-pack Intel Pentium Pro

How to Scale Shared Memory Machines? Two general approaches Maintain UMA Provide a scalable interconnect to memory Downside: Every memory access incurs the round-trip network latency Interconnect complete processors with local memory NUMA (Non-uniform memory access) Local memory faster than remote memory Still needs a scalable interconnect for accessing remote memory Not on the critical path of local memory access

NUMA/NUCA: NonUniform Memory/Cache Access Shared memory as local versus remote memory + Low latency to local memory - Much higher latency to remote memories + Bandwidth to local memory may be higher - Performance very sensitive to data placement

Example NUMA MAchines Sun Enterprise Server Cray T3E

Convergence of Parallel Architectures Scalable shared memory architecture is similar to scalable message passing architecture Main difference: is remote memory accessible with loads/stores?

Historical Evolution: 1960s & 70s Early MPs Mainframes Small number of processors crossbar interconnect UMA

Historical Evolution: 1980s Bus-Based MPs enabler: processor-on-a-board economical scaling precursor of today’s SMPs UMA

Historical Evolution: Late 80s, mid 90s Large Scale MPs (Massively Parallel Processors) multi-dimensional interconnects each node a computer (proc + cache + memory) both shared memory and message passing versions NUMA still used for “supercomputing”

Historical Evolution: Current Chip multiprocessors (multi-core) Small to Mid-Scale multi-socket CMPs One module type: processor + caches + memory Clusters/Datacenters Use high performance LAN to connect SMP blades, racks Driven by economics and cost Smaller systems => higher volumes Off-the-shelf components Driven by applications Many more throughput applications (web servers) … than parallel applications (weather prediction) Cloud computing

Historical Evolution: Future Cluster/datacenter on a chip? Heterogeneous multi-core? Bounce back to small-scale multi-core? ???

Multi-Core Processors

Moore’s Law Moore, “Cramming more components onto integrated circuits,” Electronics, 1965.

Multi-Core Idea: Put multiple processors on the same die. Technology scaling (Moore’s Law) enables more transistors to be placed on the same die area What else could you do with the die area you dedicate to multiple processors? Have a bigger, more powerful core Have larger caches in the memory hierarchy Simultaneous multithreading Integrate platform components on chip (e.g., network interface, memory controllers)

Why Multi-Core? Alternative: Bigger, more powerful single core Larger superscalar issue width, larger instruction window, more execution units, large trace caches, large branch predictors, etc + Improves single-thread performance transparently to programmer, compiler - Very difficult to design (Scalable algorithms for improving single-thread performance elusive) - Power hungry – many out-of-order execution structures consume significant power/area when scaled. Why? - Diminishing returns on performance - Does not significantly help memory-bound application performance (Scalable algorithms for this elusive)

Large Superscalar vs. Multi-Core Olukotun et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS 1996.

Multi-Core vs. Large Superscalar Multi-core advantages + Simpler cores  more power efficient, lower complexity, easier to design and replicate, higher frequency (shorter wires, smaller structures) + Higher system throughput on multiprogrammed workloads  reduced context switches + Higher system throughput in parallel applications Multi-core disadvantages - Requires parallel tasks/threads to improve performance (parallel programming) - Resource sharing can reduce single-thread performance - Shared hardware resources need to be managed - Number of pins limits data supply for increased demand

Large Superscalar vs. Multi-Core Olukotun et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS 1996. Technology push Instruction issue queue size limits the cycle time of the superscalar, OoO processor  diminishing performance Quadratic increase in complexity with issue width Large, multi-ported register files to support large instruction windows and issue widths  reduced frequency or longer RF access, diminishing performance Application pull Integer applications: little parallelism? FP applications: abundant loop-level parallelism Others (transaction proc., multiprogramming): CMP better fit

Why Multi-Core? Alternative: Bigger caches + Improves single-thread performance transparently to programmer, compiler + Simple to design - Diminishing single-thread performance returns from cache size. Why? - Multiple levels complicate memory hierarchy

Cache vs. Core

Why Multi-Core? Alternative: (Simultaneous) Multithreading + Exploits thread-level parallelism (just like multi-core) + Good single-thread performance when there is a single thread + No need to have an entire core for another thread + Parallel performance aided by tight sharing of caches - Scalability is limited: need bigger register files, larger issue width (and associated costs) to have many threads  complex with many threads - Parallel performance limited by shared fetch bandwidth - Extensive resource sharing at the pipeline and memory system reduces both single-thread and parallel application performance