Presentation is loading. Please wait.

Presentation is loading. Please wait.

Platform based design 5KK70 MPSoC Platforms Overview and Cell platform Bart Mesman and Henk Corporaal.

Similar presentations


Presentation on theme: "Platform based design 5KK70 MPSoC Platforms Overview and Cell platform Bart Mesman and Henk Corporaal."— Presentation transcript:

1 Platform based design 5KK70 MPSoC Platforms Overview and Cell platform Bart Mesman and Henk Corporaal

2 5/16/2015Platform Design. H.Corporaal and B. Mesman2 The Software Crisis

3 5/16/2015Platform Design. H.Corporaal and B. Mesman3 The first SW crisis Time Frame: ’60s and ’70s Problem: Assembly Language Programming –Computers could handle larger more complex programs Needed to get Abstraction and Portability without losing Performance Solution: –High-level languages for von-Neumann machines FORTRAN and C

4 5/16/2015Platform Design. H.Corporaal and B. Mesman4 The second SW crisis Time Frame: ’80s and ’90s Problem: Inability to build and maintain complex and robust applications requiring multi-million lines of code developed by hundreds of programmers –Computers could handle larger more complex programs Needed to get Composability and Maintainability –High-performance was not an issue: left for Moore’s Law

5 5/16/2015Platform Design. H.Corporaal and B. Mesman5 Solution Object Oriented Programming –C++, C# and Java Also… –Better tools Component libraries, Purify –Better software engineering methodology Design patterns, specification, testing, code reviews

6 5/16/2015Platform Design. H.Corporaal and B. Mesman6 Today: Programmers are Oblivious to Processors Solid boundary between Hardware and Software Programmers don’t have to know anything about the processor –High level languages abstract away the processors Ex: Java bytecode is machine independent –Moore’s law does not require the programmers to know anything about the processors to get good speedups Programs are oblivious of the processor -> work on all processors –A program written in ’70 using C still works and is much faster today This abstraction provides a lot of freedom for the programmers

7 5/16/2015Platform Design. H.Corporaal and B. Mesman7 The third crisis: Powered by PlayStation

8 5/16/2015Platform Design. H.Corporaal and B. Mesman8 Contents Hammer your head against 4 walls –Or: Why Multi-Processor Cell Architecture Programming and porting –plus case-study

9 5/16/2015Platform Design. H.Corporaal and B. Mesman9 Moore’s Law

10 5/16/2015Platform Design. H.Corporaal and B. Mesman10 Single Processor SPECint Performance

11 5/16/2015Platform Design. H.Corporaal and B. Mesman11 What’s stopping them? General-purpose uni-cores have stopped historic performance scaling –Power consumption –Wire delays –DRAM access latency –Diminishing returns of more instruction-level parallelism

12 5/16/2015Platform Design. H.Corporaal and B. Mesman12 Power density

13 5/16/2015Platform Design. H.Corporaal and B. Mesman13 Power Efficiency (Watts/Spec)

14 5/16/2015Platform Design. H.Corporaal and B. Mesman14 1 clock cycle wire range

15 5/16/2015Platform Design. H.Corporaal and B. Mesman15 Global wiring delay becomes dominant over gate delay

16 5/16/2015Platform Design. H.Corporaal and B. Mesman16 Memory

17 5/16/2015Platform Design. H.Corporaal and B. Mesman17 Now what? Latest research drained Tried every trick in the book So: We’re fresh out of ideas Multi-processor is all that’s left!

18 5/16/2015Platform Design. H.Corporaal and B. Mesman18 MPSoC Issues Homogeneous vs Heterogeneous Shared memory vs local memory Topology Communication (Bus vs. Network) Granularity (many small vs few large) Mapping –Automatic vs manual parallelization –TLP vs DLP –Parallel vs Pipelined

19 5/16/2015Platform Design. H.Corporaal and B. Mesman19 Multi-core

20 5/16/2015Platform Design. H.Corporaal and B. Mesman20 Communication models: Shared Memory Process P1 Process P2 Shared Memory Coherence problem Memory consistency issue Synchronization problem (read, write)

21 5/16/2015Platform Design. H.Corporaal and B. Mesman21 SMP: Symmetric Multi-Processor Memory: centralized with uniform access time (UMA) and bus interconnect, I/O Examples: Sun Enterprise 6000, SGI Challenge, Intel Main memoryI/O System One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor

22 5/16/2015Platform Design. H.Corporaal and B. Mesman22 DSM: Distributed Shared Memory Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Interconnection Network Cache Processor Memory Cache Processor Memory Cache Processor Memory Cache Processor Memory Main memoryI/O System

23 5/16/2015Platform Design. H.Corporaal and B. Mesman23 Communication models: Message Passing Communication primitives –e.g., send, receive library calls Process P1 Process P2 receive send FiFO

24 5/16/2015Platform Design. H.Corporaal and B. Mesman24 Message passing communication Interconnection Network Network interface Network interface Network interface Network interface Cache Processor Memory DMA Cache Processor Memory DMA Cache Processor Memory DMA Cache Processor Memory DMA

25 5/16/2015Platform Design. H.Corporaal and B. Mesman25 Communication Models: Comparison Shared-Memory –Compatibility with well-understood (language) mechanisms –Ease of programming for complex or dynamic communications patterns –Shared-memory applications; sharing of large data structures –Efficient for small items –Supports hardware caching Messaging Passing –Simpler hardware –Explicit communication –Scalable!

26 5/16/2015Platform Design. H.Corporaal and B. Mesman26 Three fundamental issues for shared memory multiprocessors Coherence, about: Do I see the most recent data? Consistency, about: When do I see a written value? –e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? Synchronization How to synchronize processes? –how to protect access to shared data?

27 5/16/2015Platform Design. H.Corporaal and B. Mesman27 Coherence problem, in Multi-Proc system CPU-1 a' b' b a cache memory 550 100 200 CPU-2 a'' b'' cache 100 200

28 5/16/2015Platform Design. H.Corporaal and B. Mesman28 Potential HW Coherency Solutions Snooping Solution (Snoopy Bus): –Send all requests for data to all processors (or local caches) –Processors snoop to see if they have a copy and respond accordingly –Requires broadcast, since caching information is at processors –Works well with bus (natural broadcast medium) –Dominates for small scale machines (most of the market) Directory-Based Schemes –Keep track of what is being shared in one centralized place –Distributed memory => distributed directory for scalability (avoids bottlenecks) –Send point-to-point requests to processors via network –Scales better than Snooping –Actually existed BEFORE Snooping-based schemes

29 5/16/2015Platform Design. H.Corporaal and B. Mesman29 Example Snooping protocol 3 states for each cache line: –invalid, shared, modified (exclusive) FSM per cache, receives requests from both processor and bus Main memoryI/O System Cache Processor Cache Processor Cache Processor Cache Processor

30 5/16/2015Platform Design. H.Corporaal and B. Mesman30 Cache coherence protocol Write invalidate protocol for write-back cache Showing state transitions for each block in the cache

31 5/16/2015Platform Design. H.Corporaal and B. Mesman31 Synchronization problem Computer system of bank has credit process (P_c) and debit process (P_d) /* Process P_c */ /* Process P_d */shared int balanceprivate int amount balance += amount balance -= amount lw $t0,balance lw $t2,balance lw $t1,amount lw $t3,amount add $t0,$t0,t1 sub $t2,$t2,$t3 sw $t0,balance sw $t2,balance

32 5/16/2015Platform Design. H.Corporaal and B. Mesman32 Issues for Synchronization Hardware support: –Un-interruptable instruction to fetch-and- update memory (atomic operation) User level synchronization operation(s) using this primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization

33 5/16/2015Platform Design. H.Corporaal and B. Mesman33 Cell

34 5/16/2015Platform Design. H.Corporaal and B. Mesman34 What can it do?

35 5/16/2015Platform Design. H.Corporaal and B. Mesman35 Cell/B.E. - the history Sony/Toshiba/IBM consortium –Austin, TX – March 2001 –Initial investment: $400,000,000 Official name: STI Cell Broadband Engine –Also goes by Cell BE, STI Cell, Cell In production for: –PlayStation 3 from Sony –Mercury’s blades

36 5/16/2015Platform Design. H.Corporaal and B. Mesman36 Cell blade

37 5/16/2015Platform Design. H.Corporaal and B. Mesman37 Cell/B.E. – the architecture 1 x PPE 64-bit PowerPC L1: 32 KB I$ + 32 KB D$ L2: 512 KB 8 x SPE cores: Local store: 256 KB 128 x 128 bit vector registers Hybrid memory model: PPE: Rd/Wr SPEs: Asynchronous DMA EIB: 205 GB/s sustained aggregate bandwidth Processor-to-memory bandwidth: 25.6 GB/s Processor-to-processor: 20 GB/s in each direction

38 5/16/2015Platform Design. H.Corporaal and B. Mesman38 Cell chip

39 5/16/2015Platform Design. H.Corporaal and B. Mesman39 SPE

40 5/16/2015Platform Design. H.Corporaal and B. Mesman40 SPE

41 5/16/2015Platform Design. H.Corporaal and B. Mesman41 SPE pipeline

42 5/16/2015Platform Design. H.Corporaal and B. Mesman42 Communication

43 5/16/2015Platform Design. H.Corporaal and B. Mesman43 8 parallel transactions

44 5/16/2015Platform Design. H.Corporaal and B. Mesman44 C++ on Cell 1 2 3 4 Send the code of the function to be run on SPE Send address to fetch the data DMA data in LS from the main memory Run the code on the SPE 5 6 DMA data out of LS to the main memory Signal the PPE that the SPE has finished the function

45 5/16/2015Platform Design. H.Corporaal and B. Mesman45 Cell/B.E. – the future (multi- tile?)

46 5/16/2015Platform Design. H.Corporaal and B. Mesman46 Porting C++ 1 2 3 4 Detect & isolate kernels to be ported Replace kernels with C++ stubs Implement the data transfers and move kernels on SPEs Iteratively optimize SPE code

47 5/16/2015Platform Design. H.Corporaal and B. Mesman47 Performance estimation Based on Amdhal’s law … where –K i fr = the fraction of the execution time for kernel K i –K i speed-up = the speed-up of kernel K i compared with the sequential version

48 5/16/2015Platform Design. H.Corporaal and B. Mesman48 Performance estimation Based on Amdhal’s law: –Sequential use of kernels: –Parallel use of kernels: ?

49 5/16/2015Platform Design. H.Corporaal and B. Mesman49 MARVEL case-study Multimedia content retrieval and analysis For each picture, we extract the values for the features of interest: ColorHistogram, ColorCorrelogram, Texture, EdgeHistogram Compares the image features with the model features and generates an overall confidence score http://www.research.ibm.com/marvel

50 5/16/2015Platform Design. H.Corporaal and B. Mesman50 MarCell = MARVEL on Cell Identified 5 kernels to port on the SPEs: –4 feature extraction algorithms ColorHistogram (CHExtract) ColorCorrelogram(CCExtract) Texture (TXExtract) EdgeHistogram (EHExtract) –1 common concept detection, repeated for each feature

51 5/16/2015Platform Design. H.Corporaal and B. Mesman51 MarCell – kernels speed-up KernelSPE[ms] Speed-up vs. PPE Speed-up vs. Desktop Speed- up vs. Laptop Overall contribution AppStart7.170.950.670.838 % CHExtract0.8252.2221.0030.178 % CCExtract5.8755.4421.2622.4554 % TXExtract2.0115.567.088.046 % EHExtract2.4891.0518.7930.8528 % CDetect0.417.153.754.882 %

52 5/16/2015Platform Design. H.Corporaal and B. Mesman52 MarCell – kernels execution times

53 5/16/2015Platform Design. H.Corporaal and B. Mesman53 Task parallelism – setup

54 5/16/2015Platform Design. H.Corporaal and B. Mesman54 Task parallelism – results * *reported on PS3

55 5/16/2015Platform Design. H.Corporaal and B. Mesman55 Data parallelism – setup Data parallel requires all SPEs to execute the same kernel in SPMD fashion Requires SPE reconfiguration: –Thread re-creation –Overlays

56 5/16/2015Platform Design. H.Corporaal and B. Mesman56 Data parallelism – results * [1/2] *reported on PS3 ► Kernels do scale when run alone

57 5/16/2015Platform Design. H.Corporaal and B. Mesman57 Conclusions Multi-processors inevitable Huge performance increase, but… Hell to program –Got to be an architecture expert –Portability? Material (suggested for assignment): http://www.blachford.info/computer/Cell/Ce ll0_v2.html


Download ppt "Platform based design 5KK70 MPSoC Platforms Overview and Cell platform Bart Mesman and Henk Corporaal."

Similar presentations


Ads by Google