Presentation is loading. Please wait.

Presentation is loading. Please wait.

Platform based design 5KK70 MPSoC Platforms With special emphasis on the Cell Bart Mesman and Henk Corporaal.

Similar presentations


Presentation on theme: "Platform based design 5KK70 MPSoC Platforms With special emphasis on the Cell Bart Mesman and Henk Corporaal."— Presentation transcript:

1 Platform based design 5KK70 MPSoC Platforms With special emphasis on the Cell Bart Mesman and Henk Corporaal

2 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman2 The Software Crisis

3 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman3 The first SW crisis Time Frame: ’60s and ’70s Problem: Assembly Language Programming –Computers could handle larger more complex programs Needed to get Abstraction and Portability without losing Performance Solution: –High-level languages for von-Neumann machines FORTRAN and C

4 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman4 The second SW crisis Time Frame: ’80s and ’90s Problem: Inability to build and maintain complex and robust applications requiring multi-million lines of code developed by hundreds of programmers –Computers could handle larger more complex programs Needed to get Composability and Maintainability –High-performance was not an issue: left for Moore’s Law

5 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman5 Solution Object Oriented Programming –C++, C# and Java Also… –Better tools Component libraries, Purify –Better software engineering methodology Design patterns, specification, testing, code reviews

6 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman6 Today: Programmers are Oblivious to Processors Solid boundary between Hardware and Software Programmers don’t have to know anything about the processor –High level languages abstract away the processors Ex: Java bytecode is machine independent –Moore’s law does not require the programmers to know anything about the processors to get good speedups Programs are oblivious of the processor -> work on all processors –A program written in ’70 using C still works and is much faster today This abstraction provides a lot of freedom for the programmers

7 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman7 The third crisis: Powered by PlayStation

8 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman8 Contents Hammer your head against 4 walls –Or: Why Multi-Processor Cell Architecture Programming and porting –plus case-study

9 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman9 Moore’s Law

10 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman10 Single Processor SPECint Performance

11 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman11 What’s stopping them? General-purpose uni-cores have stopped historic performance scaling –Power consumption –Wire delays –DRAM access latency –Diminishing returns of more instruction-level parallelism

12 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman12 Power density

13 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman13 Power Efficiency (Watts/Spec)

14 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman14 1 clock cycle wire range

15 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman15 Global wiring delay becomes dominant over gate delay

16 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman16 Memory

17 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman17 Now what? Latest research drained Tried every trick in the book So: We’re fresh out of ideas Multi-processor is all that’s left!

18 6/25/2015ECA - 5KK73 H. Corporaal and B. Mesman18 Low power through parallelism Sequential Processor –Switching capacitance C –Frequency f –Voltage V –P =  fCV 2 Parallel Processor (two times the number of units) –Switching capacitance 2C –Frequency f/2 –Voltage V’ < V –P =  f/2 2C V’ 2 =  fCV’ 2

19 6/25/2015ECA - 5KK73 H. Corporaal and B. Mesman19 Architecture methods Powerful Instructions (1) MD-technique Multiple data operands per operation SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3)

20 6/25/2015ECA - 5KK73 H. Corporaal and B. Mesman20 Architecture methods Powerful Instructions (1) Sub-word parallelism –SIMD on restricted scale: –Used for Multi-media instructions –Motivation: use a powerful 64-bit alu as 4 x 16-bit alus Examples –MMX, SUN-VIS, HP MAX-2, AMD- K7/Athlon 3Dnow, Trimedia II –Example:  i=1..4 |a i -b i | ****

21 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman21 MPSoC Issues Homogeneous vs Heterogeneous Shared memory vs local memory Topology Communication (Bus vs. Network) Granularity (many small vs few large) Mapping –Automatic vs manual parallelization –TLP vs DLP –Parallel vs Pipelined

22 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman22 Multi-core

23 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman23 Communication models: Shared Memory Process P1 Process P2 Shared Memory Coherence problem Memory consistency issue Synchronization problem (read, write)

24 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman24 SMP: Symmetric Multi-Processor Memory: centralized with uniform access time (UMA) and bus interconnect, I/O Examples: Sun Enterprise 6000, SGI Challenge, Intel Main memoryI/O System One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor

25 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman25 DSM: Distributed Shared Memory Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Interconnection Network Cache Processor Memory Cache Processor Memory Cache Processor Memory Cache Processor Memory Main memoryI/O System

26 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman26 Communication models: Message Passing Communication primitives –e.g., send, receive library calls Process P1 Process P2 receive send FiFO

27 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman27 Message passing communication Interconnection Network Network interface Network interface Network interface Network interface Cache Processor Memory DMA Cache Processor Memory DMA Cache Processor Memory DMA Cache Processor Memory DMA

28 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman28 Communication Models: Comparison Shared-Memory –Compatibility with well-understood (language) mechanisms –Ease of programming for complex or dynamic communications patterns –Shared-memory applications; sharing of large data structures –Efficient for small items –Supports hardware caching Messaging Passing –Simpler hardware –Explicit communication –Scalable!

29 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman29 Three fundamental issues for shared memory multiprocessors Coherence, about: Do I see the most recent data? Consistency, about: When do I see a written value? –e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? Synchronization How to synchronize processes? –how to protect access to shared data?

30 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman30 Coherence problem, in Multi-Proc system CPU-1 a' b' b a cache memory 550 100 200 CPU-2 a'' b'' cache 100 200

31 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman31 Potential HW Coherency Solutions Snooping Solution (Snoopy Bus): –Send all requests for data to all processors (or local caches) –Processors snoop to see if they have a copy and respond accordingly –Requires broadcast, since caching information is at processors –Works well with bus (natural broadcast medium) –Dominates for small scale machines (most of the market) Directory-Based Schemes –Keep track of what is being shared in one centralized place –Distributed memory => distributed directory for scalability (avoids bottlenecks) –Send point-to-point requests to processors via network –Scales better than Snooping –Actually existed BEFORE Snooping-based schemes

32 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman32 Example Snooping protocol 3 states for each cache line: –invalid, shared, modified (exclusive) FSM per cache, receives requests from both processor and bus Main memoryI/O System Cache Processor Cache Processor Cache Processor Cache Processor

33 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman33 Cache coherence protocol Write invalidate protocol for write-back cache Showing state transitions for each block in the cache

34 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman34 Synchronization problem Computer system of bank has credit process (P_c) and debit process (P_d) /* Process P_c */ /* Process P_d */shared int balanceprivate int amount balance += amount balance -= amount lw $t0,balance lw $t2,balance lw $t1,amount lw $t3,amount add $t0,$t0,t1 sub $t2,$t2,$t3 sw $t0,balance sw $t2,balance

35 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman35 Issues for Synchronization Hardware support: –Un-interruptable instruction to fetch-and- update memory (atomic operation) User level synchronization operation(s) using this primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization

36 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman36 Cell

37 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman37 What can it do?

38 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman38 Cell/B.E. - the history Sony/Toshiba/IBM consortium –Austin, TX – March 2001 –Initial investment: $400,000,000 Official name: STI Cell Broadband Engine –Also goes by Cell BE, STI Cell, Cell In production for: –PlayStation 3 from Sony –Mercury’s blades

39 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman39 Cell blade

40 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman40 Cell/B.E. – the architecture 1 x PPE 64-bit PowerPC L1: 32 KB I$ + 32 KB D$ L2: 512 KB 8 x SPE cores: Local store: 256 KB 128 x 128 bit vector registers Hybrid memory model: PPE: Rd/Wr SPEs: Asynchronous DMA EIB: 205 GB/s sustained aggregate bandwidth Processor-to-memory bandwidth: 25.6 GB/s Processor-to-processor: 20 GB/s in each direction

41 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman41 Cell chip

42 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman42 SPE

43 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman43 SPE

44 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman44 SPE pipeline

45 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman45 Communication

46 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman46 8 parallel transactions

47 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman47 C++ on Cell 1 2 3 4 Send the code of the function to be run on SPE Send address to fetch the data DMA data in LS from the main memory Run the code on the SPE 5 6 DMA data out of LS to the main memory Signal the PPE that the SPE has finished the function

48 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman48 Cell/B.E. – the future (multi-tile?)

49 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman49 Porting C++ 1 2 3 4 Detect & isolate kernels to be ported Replace kernels with C++ stubs Implement the data transfers and move kernels on SPEs Iteratively optimize SPE code

50 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman50 Performance estimation Based on Amdhal’s law … where –K i fr = the fraction of the execution time for kernel K i –K i speed-up = the speed-up of kernel K i compared with the sequential version

51 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman51 Performance estimation Based on Amdhal’s law: –Sequential use of kernels: –Parallel use of kernels: ?

52 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman52 MARVEL case-study Multimedia content retrieval and analysis For each picture, we extract the values for the features of interest: ColorHistogram, ColorCorrelogram, Texture, EdgeHistogram Compares the image features with the model features and generates an overall confidence score http://www.research.ibm.com/marvel

53 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman53 MarCell = MARVEL on Cell Identified 5 kernels to port on the SPEs: –4 feature extraction algorithms ColorHistogram (CHExtract) ColorCorrelogram(CCExtract) Texture (TXExtract) EdgeHistogram (EHExtract) –1 common concept detection, repeated for each feature

54 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman54 MarCell – kernels speed-up KernelSPE[ms] Speed-up vs. PPE Speed-up vs. Desktop Speed- up vs. Laptop Overall contribution AppStart7.170.950.670.838 % CHExtract0.8252.2221.0030.178 % CCExtract5.8755.4421.2622.4554 % TXExtract2.0115.567.088.046 % EHExtract2.4891.0518.7930.8528 % CDetect0.417.153.754.882 %

55 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman55 MarCell – kernels execution times

56 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman56 Task parallelism – setup

57 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman57 Task parallelism – results * *reported on PS3

58 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman58 Data parallelism – setup Data parallel requires all SPEs to execute the same kernel in SPMD fashion Requires SPE reconfiguration: –Thread re-creation –Overlays

59 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman59 Data parallelism – results * [1/2] *reported on PS3 ► Kernels do scale when run alone

60 6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman60 Conclusions Multi-processors inevitable Huge performance increase, but… Hell to program –Got to be an architecture expert –Portability?


Download ppt "Platform based design 5KK70 MPSoC Platforms With special emphasis on the Cell Bart Mesman and Henk Corporaal."

Similar presentations


Ads by Google