Download presentation
Presentation is loading. Please wait.
1
TU/e 5kk70 Henk Corporaal Bart Mesman
Platform Design Multi-Processor Systems-on-Chip MPSoC TU/e 5kk70 Henk Corporaal Bart Mesman
2
Platform Design H. Corporaal and B. Mesman
Overview What is a platform, and why platform based design? Why parallel platforms? A first classification of parallel systems Design choices for parallel systems Shared memory systems Memory Coherency, Consistency, Synchronization, Mutual exlusion Message passing systems Further decisions 11/12/2018 Platform Design H. Corporaal and B. Mesman
3
Design & Product requirements?
Short Time-to-Market Reuse / Standards Short design time Flexible solution Reduces design time Extends product lifetime; remote inspect and debug, … Scalability High performance and Low power Memory bottleneck, Wiring bottleneck Low cost High quality, reliability, dependability RTOS and libs Good programming environment 11/12/2018 Platform Design H. Corporaal and B. Mesman
4
Platform Design H. Corporaal and B. Mesman
Solution ? Platforms Programmable One or more processor cores Reconfigurable Scalable and flexible Memory hierarchy Exploit locality Separate local and global wiring HW and SW IP reuse Standardization (on SW and HW-interfaces) Raising design abstraction level Reliable Cheaper Advanced Design Flow for Platforms 11/12/2018 Platform Design H. Corporaal and B. Mesman
5
What is a platform? Definition:
A platform is a generic, but domain specific information processing (sub-)system Generic means that it is flexible, containing programmable component(s). Platforms are meant to quickly realize your next system (in a certain domain). Single chip? 11/12/2018 Platform Design H. Corporaal and B. Mesman
6
Example Platform: Sanyo Camera
11/12/2018 Platform Design H. Corporaal and B. Mesman
7
Platform example: TI OMAP
Up to 192Mbyte off-chip memory 192Kbyte shared SRAM 8Kb data cache (2-way, 512 lines of 16 bytes) Write buffer (17 elements) 16Kb (2-way) 64Kb dual port (8x 4K x 16b) 96Kb single port (12x 4k x 16b) 32Kb ROM 16Kb (2-way) 8Kb mem (2x 4K) 11/12/2018 Platform Design H. Corporaal and B. Mesman
8
Platform and platform design
Applications SDT system design technology Design technology Platform PDT platform design technology Enabling technologies 11/12/2018 Platform Design H. Corporaal and B. Mesman
9
Why parallel processing
Performance drive Diminishing returns for exploiting ILP and OLP Multiple processors fit easily on a chip Cost effective (just connect existing processors or processor cores) Low power: parallelism may allow lowering Vdd However: Parallel programming is hard 11/12/2018 Platform Design H. Corporaal and B. Mesman
10
Low power through parallelism
Sequential Processor Switching capacitance C Frequency f Voltage V P = fCV2 Parallel Processor (two times the number of units) Switching capacitance 2C Frequency f/2 Voltage V’ < V P = f/2 2C V’2 = fCV’2 11/12/2018 Platform Design H. Corporaal and B. Mesman
11
Power efficiency: compare 2 examples
Intel Pentium-4 (Northwood) in 0.13 micron technology 3.0 GHz 20 pipeline stages Aggressive buffering to boost clock frequency 13 nano Joule / instruction Philips Trimedia “Lite” in 0.13 micron technology 250 MHz 8 pipeline stages Relaxed buffering, focus on instruction parallelism 0.2 nano Joule / instruction Trimedia is doing 65x better than Pentium 11/12/2018 Platform Design H. Corporaal and B. Mesman
12
Parallel Architecture
Parallel Architecture extends traditional computer architecture with a communication network abstractions (HW/SW interface) organizational structure to realize abstraction efficiently Communication Network Processing node Processing node Processing node Processing node Processing node 11/12/2018 Platform Design H. Corporaal and B. Mesman
13
Platform characteristics
System level Processor level Communication network Memory system Tooling 11/12/2018 Platform Design H. Corporaal and B. Mesman
14
System level characteristics
Homogeneous Heterogeneous Granularity of processing elements Type of supported parallelism: TLP, DLP Runtime mapping support? 11/12/2018 Platform Design H. Corporaal and B. Mesman
15
Homogeneous or Heterogeneous
Homogenous: replication effect memory dominated any way solve realization issues once and for all less flexible Typically: data level parallelism shared memory dynamic task mapping 11/12/2018 Platform Design H. Corporaal and B. Mesman
16
Example: Philips Wasabi
Homogeneous multiprocessor for media applications Two-level communication hierarchy Top: scalable message passing network plus tiles TM ARM pixel simd video scale picture improve memory Tile: shared memory plus processors, accelerators Fully cache coherent to support data parallelism
17
Homogeneous or Heterogeneous
better fit to application domain smaller increments Typically: task level parallelism message passing static task mapping 11/12/2018 Platform Design H. Corporaal and B. Mesman
18
Example: Viper2 Heterogeneous Platform based >60 different cores
TM3260 MIPS PR4450 QVCP2L QVCP5L VIP MSP TDCS MDCS MBS VMPG Heterogeneous Platform based >60 different cores Task parallelism Sync with interrupts Streaming communication Semi-static application graph 50 M transistors 120nm technology Powerful, efficient NB IP includes memories
19
Homogeneous or Heterogeneous
Middle of the road approach Flexibile tiles Fixed tile structure at top level 11/12/2018 Platform Design H. Corporaal and B. Mesman
20
Platform Design H. Corporaal and B. Mesman
Types of parallelism TLP Heterogenous Multi-threaded / MIMD Program/Thread level DLP Homogenous SIMD / Vector Module level Kernel level ILP Heterogenous VLIW / Superscalar/ Dataflow arch. 11/12/2018 Platform Design H. Corporaal and B. Mesman
21
Processor level characteristics
Processor consists of Instruction engine (Control Processor, Ifetch unit) Processing element (PE): Register file, Function unit(s), L1 DMem Single PE Multiple PEs (as in SIMD) Single FU/PE Multiple FUs/PE (as in VLIW) Granularity of PEs, FUs Specialized Generic Interruptable, pre-emption support Multithreading support (fast context switches) Clustering of PEs; Clustering of FUs Type of inter PE and inter FU communication network Others: MMU – virtual memory, ….. 11/12/2018 Platform Design H. Corporaal and B. Mesman
22
Generic or Specialized? Intrinsic computational efficiency
11/12/2018 Platform Design H. Corporaal and B. Mesman
23
General processor organization
u c i o m e y A d 4 3 2 l S h f F / D E X M W B x 1 P C a R g 6 L U Z PE: processing engine Instruction fetch - Control FU 11/12/2018 Platform Design H. Corporaal and B. Mesman
24
(Linear) SIMD Architecture
FU RF DMem FU RF DMem FU RF DMem FU RF DMem FU RF DMem FU RF DMem FU RF DMem FU RF DMem FU RF DMem Control Processor IMem PE1 PEn To be added: inter PE communication communication from PEs to Control Processor Input and Output 11/12/2018 Platform Design H. Corporaal and B. Mesman
25
Communication network
Bus (single all2all connection) Crossbar NoC with point-to-point connections Topology, Router degree Routing path, path control, collision resolvement, network support, deadlock handling, livelock handling virtual layer support flow control and buffering error handling Inter-chip network support Guarantees TDMA GT BE traffic etc, etc. 11/12/2018 Platform Design H. Corporaal and B. Mesman
26
Comm. Network: Performance metrics
Network Bandwidth Need high bandwidth in communication How does it scale with number of nodes? Communication Latency Affects performance, since processor may have to wait Affects ease of programming, since it requires more thought to overlap communication and computation Latency Hiding Global memory access can take hundreds of cycles How can a mechanism help hide latency? Examples: overlap message send with computation, prefetch data, switch to other tasks 11/12/2018 Platform Design H. Corporaal and B. Mesman
27
How good is your network?
Topology determines: Degree = number of links from a node Diameter = max number of links crossed between nodes Average distance = number of links to random destination Bisection = minimum number of links that separate the network into two halves Bisection bandwidth = link bandwidth x bisection 11/12/2018 Platform Design H. Corporaal and B. Mesman
28
Metrics for common topologies
Type Degree Diameter Ave Dist Bisection 1D mesh 2 N-1 N/3 1 2D mesh (N1/2 - 1) 2N1/2 / 3 N1/2 3D mesh (N1/3 - 1) 3N1/3 / 3 N2/3 nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n Ring 2 N/2 N/4 2 2D torus 4 N1/2 N1/2 / 2 2N1/2 Hypercube Log2N n=Log2N n/2 N/2 2D Tree 3 2Log2N ~2Log2 N 1 Crossbar N N2/2 N = number of nodes, n = dimension 11/12/2018 Platform Design H. Corporaal and B. Mesman
29
Platform Design H. Corporaal and B. Mesman
More topology metrics Hypercube Grid/Mesh Torus Assume 64 nodes: Criteria Bus Ring Mesh 2Dtorus 6-cube Fully connected Performance Bisection bandwidth 1 2 8 16 32 1024 Cost Ports/switch Total #links 3 128 5 176 192 7 256 64 2080 11/12/2018 Platform Design H. Corporaal and B. Mesman
30
Multi-stage network: Butterfly or Omega
All paths equal length Unique path from any input to any output Try to avoid conflicts !! 8 x 8 butterfly switch N/2 Butterfly How to make a bigger butterfly network? 11/12/2018 Platform Design H. Corporaal and B. Mesman
31
Platform Design H. Corporaal and B. Mesman
Multistage Fat Tree A multistage fat tree (CM-5) avoids congestion at the root node Randomly assign packets to different paths on way up to spread the load Increase degree near root, decrease congestion 11/12/2018 Platform Design H. Corporaal and B. Mesman
32
What did architects design in the 90ties? Old (off-chip) MP Networks
Name Number Topology Bits Clock Link Bis. BW Year nCube/ten cube MHz iPSC/ cube MHz MP D grid MHz 3 1, Delta 540 2D grid MHz CM fat tree MHz 20 10, CS fat tree MHz 50 50, Paragon D grid MHz 200 6, T3D D Torus MHz , MBytes/s No standard topology! However, for on-chip: mesh and torus are in favor ! 11/12/2018 Platform Design H. Corporaal and B. Mesman
33
Platform Design H. Corporaal and B. Mesman
Memory hierarchy Number of memory levels: 1, 2, 3, 4 HW SW controlled level 1 Cache or Scratchpad memory L1 Central Distributed memory Shared Distributed memory address space Intelligent DMA support: Communication Assist For shared memory: coherency consistency synchronization 11/12/2018 Platform Design H. Corporaal and B. Mesman
34
Intermezzo: What’s the problem with memory ?
Performance 1000 µProc: 55%/year [Patterson] Processor-Memory Performance Gap: (grows 50% / year) CPU 100 “Moore’s Law” 10 DRAM DRAM: 7%/year 1 1980 1985 1990 1995 2000 Time Memories can be also big power consumers ! 11/12/2018 Platform Design H. Corporaal and B. Mesman
35
Multiple levels of memory
Architecture concept: Reconfigurable HW blocks CPUs Accelerators Reconfigurable HW blocks CPUs Accelerators Reconfigurable HW blocks CPUs Accelerators Communication network Level 0 Memory Memory I/O Communication network Level 1 Communication network Level N Memory Memory I/O 11/12/2018 Platform Design H. Corporaal and B. Mesman
36
Communication models: Shared Memory
(read, write) (read, write) Process P2 Process P1 Coherence problem Memory consistency issue Synchronization problem 11/12/2018 Platform Design H. Corporaal and B. Mesman
37
Communication models: Shared memory
Shared address space Communication primitives: load, store, atomic swap Two varieties: Physically shared => Symmetric Multi-Processors (SMP) usually combined with local caching Physically distributed => Distributed Shared Memory (DSM) Models: 1st is easy, still useful: workstations within a building (entertainment) 11/12/2018 Platform Design H. Corporaal and B. Mesman
38
SMP: Symmetric Multi-Processor
Memory: centralized with uniform access time (UMA) and bus interconnect, I/O Examples: Sun Enterprise 6000, SGI Challenge, Intel One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor Main memory I/O System 11/12/2018 Platform Design H. Corporaal and B. Mesman
39
DSM: Distributed Shared Memory
Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Cache Processor Memory Cache Processor Memory Cache Processor Memory Cache Processor Memory T3E 480 MB/sec per link, 3 links per node memory on node switch based up to 2048 nodes $30M to $50M Interconnection Network Main memory I/O System 11/12/2018 Platform Design H. Corporaal and B. Mesman
40
Shared Address Model Summary
Each processor can name every physical location in the machine Each process can name all data it shares with other processes Data transfer via load and store Data size: byte, word, ... or cache blocks Memory hierarchy model applies: communication moves data to local proc. cache 11/12/2018 Platform Design H. Corporaal and B. Mesman
41
Communication models: Message Passing
Communication primitives e.g., send, receive library calls Note that MP can be build on top of SM and vice versa Process P1 Process P2 receive send FiFO 11/12/2018 Platform Design H. Corporaal and B. Mesman
42
Platform Design H. Corporaal and B. Mesman
Message Passing Model Explicit message send and receive operations Send specifies local buffer + receiving process on remote computer Receive specifies sending process on remote computer + local buffer to place data Typically blocking communication, but may use DMA Message structure Header Data Trailer 11/12/2018 Platform Design H. Corporaal and B. Mesman
43
Message passing communication
Processor Processor Processor Processor Cache Cache Cache Cache Memory Memory Memory Memory DMA DMA DMA DMA Network interface Network interface Network interface Network interface Interconnection Network 11/12/2018 Platform Design H. Corporaal and B. Mesman
44
Communication Models: Comparison
Shared-Memory Compatibility with well-understood (language) mechanisms Ease of programming for complex or dynamic communications patterns Shared-memory applications; sharing of large data structures Efficient for small items Supports hardware caching Messaging Passing Simpler hardware Explicit communication Improved synchronization 11/12/2018 Platform Design H. Corporaal and B. Mesman
45
Challenges of parallel processing
Q1: can we get linear speedup Suppose we want speedup 80 with 100 processors. What fraction of the original computation can be sequential (i.e. non-parallel)? Q2: how important is communication latency Suppose 0.2 % of all accesses are remote, and require 100 cycles on a processor with base CPI = 0.5 What’s the communication impact? 11/12/2018 Platform Design H. Corporaal and B. Mesman
46
Three fundamental issues for shared memory multiprocessors
Coherence, about: Do I see the most recent data? Consistency, about: When do I see a written value? e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? Synchronization How to synchronize processes? how to protect access to shared data? 11/12/2018 Platform Design H. Corporaal and B. Mesman
47
Coherence problem, in single CPU system
I/O a' b' b a cache memory 550 100 200 CPU I/O a' b' b a cache memory 100 440 200 cache a' 100 b' 200 memory a 100 b 200 I/O 11/12/2018 Platform Design H. Corporaal and B. Mesman
48
Coherence problem, in Multi-Proc system
CPU-1 CPU-2 cache cache a' 550 a'' 100 b' 200 b'' 200 memory a 100 b 200 11/12/2018 Platform Design H. Corporaal and B. Mesman
49
What Does Coherency Mean?
Informally: “Any read must return the most recent write” Too strict and too difficult to implement Better: “Any write must eventually be seen by a read” All writes are seen in proper order (“serialization”) 11/12/2018 Platform Design H. Corporaal and B. Mesman
50
Two rules to ensure coherency
“If P writes x and P1 reads it, P’s write will be seen by P1 if the read and write are sufficiently far apart” Writes to a single location are serialized: seen in one order Latest write will be seen Otherwise could see writes in illogical order (could see older value after a newer value) 11/12/2018 Platform Design H. Corporaal and B. Mesman
51
Potential HW Coherency Solutions
Snooping Solution (Snoopy Bus): Send all requests for data to all processors (or local caches) Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market) Directory-Based Schemes Keep track of what is being shared in one centralized place Distributed memory => distributed directory for scalability (avoids bottlenecks) Send point-to-point requests to processors via network Scales better than Snooping Actually existed BEFORE Snooping-based schemes 11/12/2018 Platform Design H. Corporaal and B. Mesman
52
Example Snooping protocol
3 states for each cache line: invalid, shared, modified (exclusive) FSM per cache, receives requests from both processor and bus Cache Processor Cache Processor Cache Processor Cache Processor Main memory I/O System 11/12/2018 Platform Design H. Corporaal and B. Mesman
53
Cache coherence protocal
Write invalidate protocol for write-back cache Showing state transitions for each block in the cache 11/12/2018 Platform Design H. Corporaal and B. Mesman
54
Synchronization problem
Computer system of bank has credit process (P_c) and debit process (P_d) /* Process P_c */ /* Process P_d */ shared int balance shared int balance private int amount private int amount balance += amount balance -= amount lw $t0,balance lw $t2,balance lw $t1,amount lw $t3,amount add $t0,$t0,t sub $t2,$t2,$t3 sw $t0,balance sw $t2,balance 11/12/2018 Platform Design H. Corporaal and B. Mesman
55
Critical Section Problem
n processes all competing to use some shared data Each process has code segment, called critical section, in which shared data is accessed. Problem – ensure that when one process is executing in its critical section, no other process is allowed to execute in its critical section Structure of process while (TRUE){ entry_section (); critical_section (); exit_section (); remainder_section (); } 11/12/2018 Platform Design H. Corporaal and B. Mesman
56
Attempt 1 – Strict Alternation
Process P0 Process P1 shared int turn; while (TRUE) { while (turn!=0); critical_section(); turn = 1; remainder_section(); } shared int turn; while (TRUE) { while (turn!=1); critical_section(); turn = 0; remainder_section(); } Two problems: Satisfies mutual exclusion, but not progress (works only when both processes strictly alternate) Busy waiting 11/12/2018 Platform Design H. Corporaal and B. Mesman
57
Attempt 2 – Warning Flags
Process P0 Process P1 shared int flag[2]; while (TRUE) { flag[0] = TRUE; while (flag[1]); critical_section(); flag[0] = FALSE; remainder_section(); } shared int flag[2]; while (TRUE) { flag[1] = TRUE; while (flag[0]); critical_section(); flag[1] = FALSE; remainder_section(); } Satisfies mutual exclusion P0 in critical section: flag[0]!flag[1] P1 in critical section: !flag[0]flag[1] However, contains a deadlock (both flags may be set to TRUE !!) 11/12/2018 Platform Design H. Corporaal and B. Mesman
58
Software solution: Peterson’s Algorithm
(combining warning flags and alternation) Process P0 Process P1 shared int flag[2]; shared int turn; while (TRUE) { flag[0] = TRUE; turn = 0; while (turn==0&&flag[1]); critical_section(); flag[0] = FALSE; remainder_section(); } shared int flag[2]; shared int turn; while (TRUE) { flag[1] = TRUE; turn = 1; while (turn==1&&flag[0]); critical_section(); flag[1] = FALSE; remainder_section(); } Software solution is slow ! 11/12/2018 Platform Design H. Corporaal and B. Mesman
59
Issues for Synchronization
Hardware support: Un-interruptable instruction to fetch-and-update memory (atomic operation) User level synchronization operation(s) using this primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization 11/12/2018 Platform Design H. Corporaal and B. Mesman
60
Uninterruptable Instructions to Fetch and Update Memory
Atomic exchange: interchange a value in a register for a value in memory 0 => synchronization variable is free 1 => synchronization variable is locked and unavailable Test-and-set: tests a value and sets it if the value passes the test (also Compare-and-swap) Fetch-and-increment: it returns the value of a memory location and atomically increments it 11/12/2018 Platform Design H. Corporaal and B. Mesman
61
User Level Synchronization—Operation
Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock LI R2,#1 ;load immediate lockit: EXCH R2,0(R1) ;atomic exchange BNEZ R2,lockit ;already locked? What about MP with cache coherency? Want to spin on cache copy to avoid full memory latency Likely to get cache hits for such variables Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic Solution: start by simply repeatedly reading the variable; when it changes, then try exchange (“test and test&set”): try: LI R2,#1 ;load immediate lockit: LW R3,0(R1) ;load var BNEZ R3,lockit ;not free=>spin EXCH R2,0(R1) ;atomic exchange BNEZ R2,try ;already locked? 11/12/2018 Platform Design H. Corporaal and B. Mesman
62
Fetch and Update (cont'd)
Hard to have read & write in 1 instruction: use 2 instead Load Linked (or load locked) + Store Conditional Load linked returns the initial value Store conditional returns 1 if it succeeds (no other store to same memory location since preceding load) and 0 otherwise Example doing atomic swap with LL & SC: try: OR R3,R4,R0 ; R4=R3 LL R2,0(R1) ; load linked SC R3,0(R1) ; store conditional BEQZ R3,try ; branch store fails (R3=0) MOV R4,R2 ; put load value in R4 Example doing fetch & increment with LL & SC: try: LL R2,0(R1) ; load linked ADDUI R3,R2,#1 ; increment SC R3,0(R1) ; store conditional BEQZ R3,try ; branch store fails (R2=0) 11/12/2018 Platform Design H. Corporaal and B. Mesman
63
Another MP Issue: Memory Consistency
What is consistency? When must a processor see a new memory value? Example: P1: A = 0; P2: B = 0; A = 1; B = 1; L1: if (B == 0) ... L2: if (A == 0) ... Seems impossible for both if-statements L1 & L2 to be true? What if write invalidate is delayed & processor continues? Memory consistency models: what are the rules for such cases? 11/12/2018 Platform Design H. Corporaal and B. Mesman
64
Platform Design H. Corporaal and B. Mesman
Tooling, OS, and Mapping Which mapping steps are performed in HW? Pre-emption support Programming model streaming or vector support (like KernelC and StreamingC for Imagine, StreamIT for RAW Process communication: Shared memory Message passing Process Synchronization 11/12/2018 Platform Design H. Corporaal and B. Mesman
65
A few platform examples
11/12/2018 Platform Design H. Corporaal and B. Mesman
66
Platform Design H. Corporaal and B. Mesman
Massively Parallel Processors Targeting Digital Signal Processing Applications 11/12/2018 Platform Design H. Corporaal and B. Mesman
67
Field Programmable Object Array MathStar
11/12/2018 Platform Design H. Corporaal and B. Mesman
68
PACT XPP-III Processor array
11/12/2018 Platform Design H. Corporaal and B. Mesman
69
Platform Design H. Corporaal and B. Mesman
RAW processor from MIT 11/12/2018 Platform Design H. Corporaal and B. Mesman
70
Platform Design H. Corporaal and B. Mesman
RAW: Switch Detail Raw exposes wire delay at the ISA level. This allows the compiler to explicitly manage static network, where routes compiled into static router and messages arrive in known order. Latency: 2 + #hops ; Throughput: 1 word/cycle per dir. Per network 11/12/2018 Platform Design H. Corporaal and B. Mesman
71
Platform Design H. Corporaal and B. Mesman
Philips AETHEREAL Router provides both guaranteed throughput (GT) and best effort (BE) services to communicate with IPs. Combination of GT and BE leads to efficient use of bandwidth and simple programming model. Router Network Network Interface R R IP R R R R R Network Interface Network Interface R IP R IP 11/12/2018 Platform Design H. Corporaal and B. Mesman
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.