CS 213 Lecture 7: Multiprocessor 3: Synchronization, Prefetching

Slides:



Advertisements
Similar presentations
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Advertisements

The University of Adelaide, School of Computer Science
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
12 – Shared Memory Synchronization. 2 Review Caches contain all information on state of cached memory blocks Snooping cache over shared medium for smaller.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
©UCR CS 162 Computer Architecture Lecture 8: Introduction to Network Processors (II) Instructor: L.N. Bhuyan
Multiprocessors and Thread-Level Parallelism Cont’d
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
Chapter 17 Parallel Processing.
©UCR CS 260 Lecture 1: Introduction to Network Processors Instructor: L.N. Bhuyan
©UCB CS 162 Computer Architecture Lecture 2: Introduction & Pipelining Instructor: L.N. Bhuyan
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Multi-core architectures. Single-core computer Single-core CPU chip.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
CPE 631 Session 24: Multiprocessors (cont’d) Department of Electrical and Computer Engineering University of Alabama in Huntsville.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.
Multiprocessors— Performance, Synchronization, Memory Consistency Models Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.
Multiprocessors – Locks
COMP 740: Computer Architecture and Implementation
Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization:
12 – Shared Memory Synchronization
Lecture 19: Coherence and Synchronization
Lecture 5: Synchronization
CSC 4250 Computer Architectures
Simultaneous Multithreading
Atomic Operations in Hardware
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Computer Structure Multi-Threading
Lecture 18: Coherence and Synchronization
5.2 Eleven Advanced Optimizations of Cache Performance
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
The University of Adelaide, School of Computer Science
CPE 631 Session 22: Multiprocessors (Part 3)
The University of Adelaide, School of Computer Science
Levels of Parallelism within a Single Processor
CPE 631 Session 21: Multiprocessors (Part 2)
David Patterson Electrical Engineering and Computer Sciences
Lecture 21: Synchronization and Consistency
CPE 631 Lecture 20: Multiprocessors
CSCI 211 Computer System Architecture Lec 9 – Multiprocessor II
Lecture: Coherence and Synchronization
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Professor David A. Patterson Computer Science 252 Spring 1998
Lecture 4: Synchronization
Levels of Parallelism within a Single Processor
12 – Shared Memory Synchronization
The University of Adelaide, School of Computer Science
CSC3050 – Computer Architecture
Lecture 17 Multiprocessors and Thread-Level Parallelism
Instructor: L.N. Bhuyan CS 213 Computer Architecture Lecture 7: Introduction to Network Processors Instructor: L.N. Bhuyan.
Lecture: Coherence, Synchronization
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 21: Synchronization & Consistency
Lecture: Coherence and Synchronization
Lecture 19: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Interconnection Network and Prefetching
Presentation transcript:

CS 213 Lecture 7: Multiprocessor 3: Synchronization, Prefetching

Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization: Uninterruptable instruction to fetch and update memory (atomic operation); User level synchronization operation using this primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization

Uninterruptable Instruction to Fetch and Update Memory Atomic exchange: interchange a value in a register for a value in memory 0 => synchronization variable is free 1 => synchronization variable is locked and unavailable Set register to 1 & swap New value in register determines success in getting lock 0 if you succeeded in setting the lock (you were first) 1 if other processor had already claimed access Key is that exchange operation is indivisible Test-and-set: tests a value and sets it if the value passes the test Fetch-and-increment: it returns the value of a memory location and atomically increments it

Uninterruptable Instruction to Fetch and Update Memory Hard to have read & write in 1 instruction: use 2 instead Load linked (or load locked) + store conditional Load linked returns the initial value Store conditional returns 1 if it succeeds (no other store to same memory location since preceeding load) and 0 otherwise Example doing atomic swap with LL & SC: try: mov R3,R4 ; mov exchange value ll R2,0(R1) ; load linked sc R3,0(R1) ; store conditional beqz R3,try ; branch store fails (R3 = 0) mov R4,R2 ; put load value in R4 Example doing fetch & increment with LL & SC: try: ll R2,0(R1) ; load linked addi R2,R2,#1 ; increment (OK if reg–reg) sc R2,0(R1) ; store conditional beqz R2,try ; branch store fails (R2 = 0)

User Level Synchronization—Operation Using this Primitive Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock li R2,#1 lockit: exch R2,0(R1) ;atomic exchange bnez R2,lockit ;already locked? What about MP with cache coherency? Want to spin on cache copy to avoid full memory latency Likely to get cache hits for such variables Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic Solution: start by simply repeatedly reading the variable; when it changes, then try exchange (“test and test&set”): try: li R2,#1 lockit: lw R3,0(R1) ;load var bnez R3,lockit ;not free=>spin exch R2,0(R1) ;atomic exchange bnez R2,try ;already locked?

Another MP Issue: Memory Consistency Models What is consistency? When must a processor see the new value? e.g., seems that P1: A = 0; P2: B = 0; ..... ..... A = 1; B = 1; L1: if (B == 0) ... L2: if (A == 0) ... Impossible for both if statements L1 & L2 to be true? What if write invalidate is delayed & processor continues? Memory consistency models: what are the rules for such cases? Sequential consistency: result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved => assignments before ifs above SC: delay all memory accesses until all invalidates done

Memory Consistency Model Schemes faster execution to sequential consistency Not really an issue for most programs; they are synchronized A program is synchronized if all access to shared data are ordered by synchronization operations write (x) ... release (s) {unlock} ... acquire (s) {lock} ... read(x) Only those programs willing to be nondeterministic are not synchronized: “data race”: outcome f(proc. speed) Several Relaxed Models for Memory Consistency since most programs are synchronized; characterized by their attitude towards: RAR, WAR, RAW, WAW to different addresses

Problems in Hardware Prefetching Unnecessary data being prefetched will result in increased bus and memory traffic degrading performance – for data not being used and for data arriving late Prefetched data may replace data in the processor working set – Cache pollution problem Invalidation of prefetched data by other processors or DMA Summary: Prefetch is necessary, but how to prefetch, which data to prefetch, and when to prefetch are questions that must be answered.

Problems Contd. Not all the data appear sequentially. How to avoid unnecessary data being prefetched? (1) Stride access for some scientific computations (2) Linked-list data – how to detect and prefetch? (3) predict data from program behavior? – EX. Mowry’s software data prefetch through compiler analysis and prediction, Hardare Reference Table (RPT) by Chen and Baer, Markov model by How to limit cache pollution? Stream Buffer technique by Jouppi is extremely helpful. What is a stream buffer compared to a victim buffer?

Prefetching in Multiprocessors Large Memory access latency, particularly in CC-NUMA, so prefetching is more useful Prefetches increase memory and IN traffic Prefetching shared data causes additional coherence traffic Invalidation misses are not predictable at compile time Dynamic task scheduling and migration may create further problem for prefetching.

Architectural Comparisons High-level organizations Aggressive superscalar (SS) Fine-grained multithreaded (FGMT) Chip multiprocessor (CMP) Simultaneous multithreaded (SMT) Superscalar describes a microprocessor design that makes it possible for more than one instruction at a time to be executed during a single clock cycle FGMT: instructions may be fetched and issued from a different thread each cycle. This architecture attempts to find all available ILP within a thread of execution. It can mitigate the effects of a stalled thread by quickly switching to another thread. Ideally the overall throughput is increased. CMP: the chip resources are partitioned among multiple processors rigidly. Each processor has its own execution pipeline, register files, etc. Its benefit is that multiple threads can be executed completely in parallel. The drawback is that each thread can only use the local resources. SMT: it can process instructions from different threads in each cycle. So the overall instruction throughput can be increased. Ref: [NPRD]

Architectural Comparisons (cont.) Simultaneous Multithreading Superscalar Fine-Grained Coarse-Grained Multiprocessing Time (processor cycle) Coarse-grained: when the pipeline is stalled, the thread is switched out and a new thread is put into execution Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot

Embedded Multiprocessors EmpowerTel MXP, for Voice over IP 4 MIPS processors, each with 12 to 24 KB of cache 13.5 million transistors, 133 MHz PCI master/slave + 100 Mbit Ethernet pipe Embedded Multiprocessing more popular in future as apps demand more performance No binary compatability; SW written from scratch Apps often have natural parallelism: set-top box, a network switch, or a game system Greater sensitivity to die cost (and hence efficient use of silicon)

Why Network Processors Current Situation Data rates are increasing Protocols are becoming more dynamic and sophisticated Protocols are being introduced more rapidly Processing Elements GP(General-purpose Processor) Programmable, Not optimized for networking applications ASIC(Application Specific Integrated Circuit) high processing capacity, long time to develop, Lack the flexibility NP(Network Processor) achieve high processing performance programming flexibility Cheaper than GP

IXP1200 Block Diagram Ref: [NPT] StrongARM processing core Microengines introduce new ISA I/O PCI SDRAM SRAM IX : PCI-like packet bus On chip FIFOs 16 entry 64B each Ref: [NPT]

IXP 2400 Block Diagram XScale core replaces StrongARM Microengines Faster More: 2 clusters of 4 microengines each Local memory Next neighbor routes added between microengines Hardware to accelerate CRC operations and Random number generation 16 entry CAM ME0 ME1 ME2 ME3 ME4 ME5 ME6 ME7 Scratch /Hash /CSR MSF Unit DDR DRAM controller XScale Core QDR SRAM PCI