Presentation is loading. Please wait.

Presentation is loading. Please wait.

Platform based design 5KK70 MPSoC Platforms

Similar presentations


Presentation on theme: "Platform based design 5KK70 MPSoC Platforms"— Presentation transcript:

1 Platform based design 5KK70 MPSoC Platforms
Part1: Cell

2 Platform Design. H.Corporaal and B. Mesman
The Software Crisis 11/22/2018 Platform Design H.Corporaal and B. Mesman

3 Platform Design. H.Corporaal and B. Mesman
The first SW crisis Time Frame: ’60s and ’70s Problem: Assembly Language Programming Computers could handle larger more complex programs Needed to get Abstraction and Portability without losing Performance Solution: High-level languages for von-Neumann machines FORTRAN and C 11/22/2018 Platform Design H.Corporaal and B. Mesman

4 Platform Design. H.Corporaal and B. Mesman
The second SW crisis Time Frame: ’80s and ’90s Problem: Inability to build and maintain complex and robust applications requiring multi-million lines of code developed by hundreds of programmers Computers could handle larger more complex programs Needed to get Composability and Maintainability High-performance was not an issue: left for Moore’s Law 11/22/2018 Platform Design H.Corporaal and B. Mesman

5 Platform Design. H.Corporaal and B. Mesman
Solution Object Oriented Programming C++, C# and Java Also… Better tools Component libraries, Purify Better software engineering methodology Design patterns, specification, testing, code reviews 11/22/2018 Platform Design H.Corporaal and B. Mesman

6 Today: Programmers are Oblivious to Processors
Solid boundary between Hardware and Software Programmers don’t have to know anything about the processor High level languages abstract away the processors Ex: Java bytecode is machine independent Moore’s law does not require the programmers to know anything about the processors to get good speedups Programs are oblivious of the processor -> work on all processors A program written in ’70 using C still works and is much faster today This abstraction provides a lot of freedom for the programmers 11/22/2018 Platform Design H.Corporaal and B. Mesman

7 The third crisis: Powered by PlayStation
11/22/2018 Platform Design H.Corporaal and B. Mesman

8 Platform Design. H.Corporaal and B. Mesman
Contents Hammer your head against 4 walls Or: Why Multi-Processor Cell Architecture Programming and porting plus case-study 11/22/2018 Platform Design H.Corporaal and B. Mesman

9 Platform Design. H.Corporaal and B. Mesman
Moore’s Law 11/22/2018 Platform Design H.Corporaal and B. Mesman

10 Single Processor SPECint Performance
11/22/2018 Platform Design H.Corporaal and B. Mesman

11 Platform Design. H.Corporaal and B. Mesman
What’s stopping them? General-purpose unicores have stopped historic performance scaling Power consumption Wire delays DRAM access latency Diminishing returns of more instruction-level parallelism 11/22/2018 Platform Design H.Corporaal and B. Mesman

12 Platform Design. H.Corporaal and B. Mesman
Power density 11/22/2018 Platform Design H.Corporaal and B. Mesman

13 Power Efficiency (Watts/Spec)
11/22/2018 Platform Design H.Corporaal and B. Mesman

14 Platform Design. H.Corporaal and B. Mesman
1 clock cycle wire range 11/22/2018 Platform Design H.Corporaal and B. Mesman

15 Global wiring delay becomes dominant over gate delay
11/22/2018 Platform Design H.Corporaal and B. Mesman

16 Platform Design. H.Corporaal and B. Mesman
Memory µProc: 55%/year CPU DRAM: 7%/year DRAM 1 10 100 1000 1980 1985 1990 1995 2000 Processor-Memory Performance Gap: (grows 50% / year) Performance Time “Moore’s Law” [Patterson] 2005 11/22/2018 Platform Design H.Corporaal and B. Mesman

17 Platform Design. H.Corporaal and B. Mesman
Now what? Latest research drained Tried every trick in the book So: We’re fresh out of ideas Multi-processor is all that’s left! 11/22/2018 Platform Design H.Corporaal and B. Mesman

18 Platform Design. H.Corporaal and B. Mesman
Multi-core 11/22/2018 Platform Design H.Corporaal and B. Mesman

19 Platform Design. H.Corporaal and B. Mesman
Cell 11/22/2018 Platform Design H.Corporaal and B. Mesman

20 Platform Design. H.Corporaal and B. Mesman
What can it do? 11/22/2018 Platform Design H.Corporaal and B. Mesman

21 Platform Design. H.Corporaal and B. Mesman
Cell/B.E. - the history Sony/Toshiba/IBM consortium Austin, TX – March 2001 Initial investment: $400,000,000 Official name: STI Cell Broadband Engine Also goes by Cell BE, STI Cell, Cell In production for: PlayStation 3 from Sony Mercury’s blades 11/22/2018 Platform Design H.Corporaal and B. Mesman

22 Platform Design. H.Corporaal and B. Mesman
Cell blade 11/22/2018 Platform Design H.Corporaal and B. Mesman

23 Cell/B.E. – the architecture
1 x PPE 64-bit PowerPC L1: 32 KB I$ + 32 KB D$ L2: 512 KB 8 x SPE cores: Local store: 256 KB 128 x 128 bit vector registers Hybrid memory model: PPE: Rd/Wr SPEs: Asynchronous DMA EIB: 205 GB/s sustained aggregate bandwidth Processor-to-memory bandwidth: 25.6 GB/s Processor-to-processor: 20 GB/s in each direction 11/22/2018 Platform Design H.Corporaal and B. Mesman

24 Platform Design. H.Corporaal and B. Mesman
Cell chip 11/22/2018 Platform Design H.Corporaal and B. Mesman

25 Platform Design. H.Corporaal and B. Mesman
SPE 11/22/2018 Platform Design H.Corporaal and B. Mesman

26 Platform Design. H.Corporaal and B. Mesman
SPE 11/22/2018 Platform Design H.Corporaal and B. Mesman

27 Platform Design. H.Corporaal and B. Mesman
Why is it efficient? 11/22/2018 Platform Design H.Corporaal and B. Mesman

28 Platform Design. H.Corporaal and B. Mesman
SPE pipeline 11/22/2018 Platform Design H.Corporaal and B. Mesman

29 Platform Design. H.Corporaal and B. Mesman
Communication 11/22/2018 Platform Design H.Corporaal and B. Mesman

30 8 parallel transactions
11/22/2018 Platform Design H.Corporaal and B. Mesman

31 Platform Design. H.Corporaal and B. Mesman
C++ on Cell 1 2 3 4 Send the code of the function to be run on SPE Send address to fetch the data DMA data in LS from the main memory Run the code on the SPE 5 6 DMA data out of LS to the main memory Signal the PPE that the SPE has finished the function The code and data from the main C++ MARVEL application (running on the PPE) are replaced by a stub that “communicates” with a dispatcher implemented on the SPE, which is order to executes all the data-in, processing and data-out operations. The stub on the SPE sends commands to the SPE dispatcher using mailboxes. 11/22/2018 Platform Design H.Corporaal and B. Mesman

32 Platform Design. H.Corporaal and B. Mesman
Porting C++ 1 2 3 4 Detect & isolate kernels to be ported Replace kernels with C++ stubs Implement the data transfers and move kernels on SPEs Iteratively optimize SPE code 11/22/2018 Platform Design H.Corporaal and B. Mesman

33 Performance estimation
Based on Amdhal’s law … where K ifr = the fraction of the execution time for kernel Ki K ispeed-up = the speed-up of kernel Ki compared with the sequential version 11/22/2018 Platform Design H.Corporaal and B. Mesman

34 Performance estimation
Based on Amdhal’s law: Sequential use of kernels: Parallel use of kernels: ? 11/22/2018 Platform Design H.Corporaal and B. Mesman

35 Platform Design. H.Corporaal and B. Mesman
MARVEL case-study Multimedia content retrieval and analysis Compares the image features with the model features and generates an overall confidence score For each picture, we extract the values for the features of interest: ColorHistogram, ColorCorrelogram, Texture, EdgeHistogram We focus on the multimedia analysis and retrieval part of MARVEL (a simplified version of the entire application). Basically, the applications takes a picture as input, runs a number of feature extraction algorithms on the image. Having all the interesting features from the image as vectors, the “concept detection” compares them against the features stored in the model files, fusing all the results in an overall score of the image against the model. So, the output for a tuple (picture, model) is a confidence score that answers the question “does this photograph picture this model”? 11/22/2018 Platform Design H.Corporaal and B. Mesman

36 MarCell = MARVEL on Cell
Identified 5 kernels to port on the SPEs: 4 feature extraction algorithms ColorHistogram (CHExtract) ColorCorrelogram(CCExtract) Texture (TXExtract) EdgeHistogram (EHExtract) 1 common concept detection, repeated for each feature 11/22/2018 Platform Design H.Corporaal and B. Mesman

37 MarCell – kernels speed-up
SPE[ms] Speed-up vs. PPE Speed-up vs. Desktop Speed-up vs. Laptop Overall contribution AppStart 7.17 0.95 0.67 0.83 8 % CHExtract 0.82 52.22 21.00 30.17 CCExtract 5.87 55.44 21.26 22.45 54 % TXExtract 2.01 15.56 7.08 8.04 6 % EHExtract 2.48 91.05 18.79 30.85 28 % CDetect 0.41 7.15 3.75 4.88 2 % 11/22/2018 Platform Design H.Corporaal and B. Mesman

38 MarCell – kernels execution times
11/22/2018 Platform Design H.Corporaal and B. Mesman

39 Task parallelism – setup
The initial C++ application is a sequential graph - CH_Ex, CH_Det, CC_Ex, CC_Det, TX_Ex, TX_Det, EH_Ex, EH_Det The first porting preserves the sequential application (NO 2 kernels run in parallel), but the kernels are run on the SPEs, with the speed-ups seen in the previous slide – this is the SPUSeq case (SingleSPE) The parallel case make the independent processes run in parallel => all the extractions run in parallel, all the detections can run in parallel – this is the SPUPar case (MultipleSPE) 11/22/2018 Platform Design H.Corporaal and B. Mesman

40 Platform Design. H.Corporaal and B. Mesman
Discussion TLP parallel or pipelined? 11/22/2018 Platform Design H.Corporaal and B. Mesman

41 Task parallelism – results*
Task sequential and task-parallel application – results *reported on PS3 11/22/2018 Platform Design H.Corporaal and B. Mesman

42 Data parallelism – setup
Data parallel requires all SPEs to execute the same kernel in SPMD fashion Requires SPE reconfiguration: Thread re-creation Overlays Add the data parallel graph and the kernels numbers. 11/22/2018 Platform Design H.Corporaal and B. Mesman

43 Data parallelism – results* [1/2]
Kernels do scale when run alone Note that the kernel-only data parallelism [no reconfiguration, no other threads running] scales with the number of SPEs The superlinear speed-up is due to the use of more DMA data in one chunk, which becomes available because the required amount of local data to be processed is decreasing. *reported on PS3 11/22/2018 Platform Design H.Corporaal and B. Mesman

44 Platform Design. H.Corporaal and B. Mesman
Conclusions Multi-processors inevitable Huge performance increase, but… Hell to program Got to be an architecture expert Portability? Material (mandatory): 11/22/2018 Platform Design H.Corporaal and B. Mesman


Download ppt "Platform based design 5KK70 MPSoC Platforms"

Similar presentations


Ads by Google