Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Slides:



Advertisements
Similar presentations
Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.
Advertisements

Symmetric Multiprocessors: Synchronization and Sequential Consistency.
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
The University of Adelaide, School of Computer Science
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.
CSE 502: Computer Architecture
Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Multiple Processor Systems
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
1 Interfacing Processors and Peripherals I/O Design affected by many factors (expandability, resilience) Performance: — access latency — throughput — connection.
Computer Architecture II 1 Computer architecture II Lecture 9.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Advances in Language Design
Computer System Architectures Computer System Software
Evaluation of Memory Consistency Models in Titanium.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
 Design model for a computer  Named after John von Neuman  Instructions that tell the computer what to do are stored in memory  Stored program Memory.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.
Interrupts, Buses Chapter 6.2.5, Introduction to Interrupts Interrupts are a mechanism by which other modules (e.g. I/O) may interrupt normal.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
BCS361: Computer Architecture I/O Devices. 2 Input/Output CPU Cache Bus MemoryDiskNetworkUSBDVD …
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 7.
The University of Adelaide, School of Computer Science
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Multiprocessors – Locks
Symmetric Multiprocessors: Synchronization and Sequential Consistency
CS5102 High Performance Computer Systems Thread-Level Parallelism
Memory Consistency Models
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Memory Consistency Models
Dr. Javier Navaridas COMP25212 System Architecture
The University of Adelaide, School of Computer Science
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Symmetric Multiprocessors: Synchronization and Sequential Consistency
The University of Adelaide, School of Computer Science
Multi-core and Beyond COMP25212 System Architecture
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
Programming with Shared Memory Specifying parallelism
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group

Applications on Multi-cores  Processes – operating system level processes e.g. separate applications – in many cases do not share any data – separate virtual memory spaces  Threads – parallel parts of the same application sharing the same memory – this is where the problems lie – assume we are talking about threads

An Example Core 0Core 4 // a = 0 & b = 0 while (a==0) {a = 1 }while (b == 0) { b = 1 } Will this always terminate? Why? Registers, Reorder Buffers, out-of-order execution

Memory Coherency/Consistency  Coherency: Hardware ensuring that all memories remain the same  This can become very expensive  It is not sufficient to address all of the problems in the last example  Consistency: The model presented to the programmer of when changes are written

Sequential Consistency  L. Lamport “the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program."  Informally memory operations appear to execute one at a time operations of a single core appear to execute in the order described by the program

Memory Consistency  Sequential Consistency is not the most stringent memory model.  It provides the behaviour that most software developers expect  Computer architectures and Java use relaxed consistency models  The compiler has to insert special instructions in order to maintain the program semantics fence, membar

Synchronization  How do we implement a lock? Regular read and write operations Label: read flag if (flag==0) {// lock is free write flag 1 } else {// wait until lock is free goto Label }  Does it work? Do we have everything that we need?

ISA support for Synchronization  Atomic compare and swap instruction Parameters x, old_value, new_value If [x] == old_value then [x] = new_value Return [x]  Load-linked and store-conditional instructions LL x - hardware locks the cache line corresponding to x and returns its contents SC x, new_value – hardware checks whether any instruction has modified x since LL, if intact the store succeeds. Otherwise, it leaves unmodified the contents of x.  Transactional Memory

The Need for Networks  Any multi-core system must clearly contain the means for cores to communicate With memory With each other (probably)  We have briefly discussed busses as the standard multi-core network  Others are possible But have different characteristics May provide different functionality

Evaluating Networks  Bandwidth Amount of data that can be moved per second Highest bandwidth device: Container ship full of hard disks  Latency How long it takes a given piece of the message to traverse the network Container ship (several weeks), ADSL microseconds  Congestion The effect on bandwidth and latency of the utilisation of the network by other processors

Bandwidth vs. Latency Definitely not the same thing  A truck carrying one million 16Gbyte flash memory cards to London Latency = 4 hours (14,400 secs) Bandwidth = 8Tbit/sec (8 * /sec)  A broadband internet connection Latency = 100 microsec (10 -4 sec) Bandwidth = 10Mbit/sec (10 * 10 6 /sec)

Bus  Common wire interconnection  Usually parallel wires – address + data  Only single usage at any point in time  Controlled by clock – divided into time slots  Sender must ‘grab’ a slot (via arbitration)  Then transmit (address + data)  Often ‘split transaction’ E.g send memory address in one slot Data returned by memory in later slot Intervening slots free for use by others

Crossbar  E.g to connect N cores to N memories  Can achieve ‘any to any’ (disjoint) in parallel

Crossbar  E.g to connect N cores to N memories  Can achieve ‘any to any’ (disjoint) in parallel

Ring  Simple but Low bandwidth Variable latency

Tree  Variable bandwidth (Switched vs Hubs) (Depth of the Tree)  Variable Latency  Reliability?

Tree

Mesh / Grid  Switched  Reasonable bandwidth  Variable Latency  Convenient for very large systems physical layout  ‘wrap around’ – becomes a toroid (ring doughnut)

Mesh / Grid

Connecting on-chip QPI or HT On Chip core L1 Inst L1 Data Memory Controller core L1 Inst L1 Data L2 Cache L3 Shared Cache

AMD Opteron (Istanbul)

AMD Magny-cours

Non Uniform Memory Address (NUMA)

Amdahl’s Law Speed up = S + P S + (P/#Processors) S = Fraction of the code which is serial P = Fraction of the code which can be parallel

Amdahl’s Law