Platform based design 5KK70 MPSoC Platforms With special emphasis on the Cell Bart Mesman and Henk Corporaal.

Platform based design 5KK70 MPSoC Platforms With special emphasis on the Cell Bart Mesman and Henk Corporaal

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman2 The Software Crisis

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman3 The first SW crisis Time Frame: ’60s and ’70s Problem: Assembly Language Programming –Computers could handle larger more complex programs Needed to get Abstraction and Portability without losing Performance Solution: –High-level languages for von-Neumann machines FORTRAN and C

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman4 The second SW crisis Time Frame: ’80s and ’90s Problem: Inability to build and maintain complex and robust applications requiring multi-million lines of code developed by hundreds of programmers –Computers could handle larger more complex programs Needed to get Composability and Maintainability –High-performance was not an issue: left for Moore’s Law

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman5 Solution Object Oriented Programming –C++, C# and Java Also… –Better tools Component libraries, Purify –Better software engineering methodology Design patterns, specification, testing, code reviews

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman6 Today: Programmers are Oblivious to Processors Solid boundary between Hardware and Software Programmers don’t have to know anything about the processor –High level languages abstract away the processors Ex: Java bytecode is machine independent –Moore’s law does not require the programmers to know anything about the processors to get good speedups Programs are oblivious of the processor -> work on all processors –A program written in ’70 using C still works and is much faster today This abstraction provides a lot of freedom for the programmers

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman7 The third crisis: Powered by PlayStation

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman8 Contents Hammer your head against 4 walls –Or: Why Multi-Processor Cell Architecture Programming and porting –plus case-study

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman9 Moore’s Law

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman10 Single Processor SPECint Performance

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman11 What’s stopping them? General-purpose uni-cores have stopped historic performance scaling –Power consumption –Wire delays –DRAM access latency –Diminishing returns of more instruction-level parallelism

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman12 Power density

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman13 Power Efficiency (Watts/Spec)

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman14 1 clock cycle wire range

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman15 Global wiring delay becomes dominant over gate delay

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman16 Memory

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman17 Now what? Latest research drained Tried every trick in the book So: We’re fresh out of ideas Multi-processor is all that’s left!

6/25/2015ECA - 5KK73 H. Corporaal and B. Mesman18 Low power through parallelism Sequential Processor –Switching capacitance C –Frequency f –Voltage V –P =  fCV 2 Parallel Processor (two times the number of units) –Switching capacitance 2C –Frequency f/2 –Voltage V’ < V –P =  f/2 2C V’ 2 =  fCV’ 2

6/25/2015ECA - 5KK73 H. Corporaal and B. Mesman19 Architecture methods Powerful Instructions (1) MD-technique Multiple data operands per operation SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3)

6/25/2015ECA - 5KK73 H. Corporaal and B. Mesman20 Architecture methods Powerful Instructions (1) Sub-word parallelism –SIMD on restricted scale: –Used for Multi-media instructions –Motivation: use a powerful 64-bit alu as 4 x 16-bit alus Examples –MMX, SUN-VIS, HP MAX-2, AMD- K7/Athlon 3Dnow, Trimedia II –Example:  i=1..4 |a i -b i | ****

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman21 MPSoC Issues Homogeneous vs Heterogeneous Shared memory vs local memory Topology Communication (Bus vs. Network) Granularity (many small vs few large) Mapping –Automatic vs manual parallelization –TLP vs DLP –Parallel vs Pipelined

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman22 Multi-core

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman23 Communication models: Shared Memory Process P1 Process P2 Shared Memory Coherence problem Memory consistency issue Synchronization problem (read, write)

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman24 SMP: Symmetric Multi-Processor Memory: centralized with uniform access time (UMA) and bus interconnect, I/O Examples: Sun Enterprise 6000, SGI Challenge, Intel Main memoryI/O System One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman25 DSM: Distributed Shared Memory Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Interconnection Network Cache Processor Memory Cache Processor Memory Cache Processor Memory Cache Processor Memory Main memoryI/O System

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman26 Communication models: Message Passing Communication primitives –e.g., send, receive library calls Process P1 Process P2 receive send FiFO

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman27 Message passing communication Interconnection Network Network interface Network interface Network interface Network interface Cache Processor Memory DMA Cache Processor Memory DMA Cache Processor Memory DMA Cache Processor Memory DMA

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman28 Communication Models: Comparison Shared-Memory –Compatibility with well-understood (language) mechanisms –Ease of programming for complex or dynamic communications patterns –Shared-memory applications; sharing of large data structures –Efficient for small items –Supports hardware caching Messaging Passing –Simpler hardware –Explicit communication –Scalable!

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman29 Three fundamental issues for shared memory multiprocessors Coherence, about: Do I see the most recent data? Consistency, about: When do I see a written value? –e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? Synchronization How to synchronize processes? –how to protect access to shared data?

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman30 Coherence problem, in Multi-Proc system CPU-1 a' b' b a cache memory 550 100 200 CPU-2 a'' b'' cache 100 200

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman31 Potential HW Coherency Solutions Snooping Solution (Snoopy Bus): –Send all requests for data to all processors (or local caches) –Processors snoop to see if they have a copy and respond accordingly –Requires broadcast, since caching information is at processors –Works well with bus (natural broadcast medium) –Dominates for small scale machines (most of the market) Directory-Based Schemes –Keep track of what is being shared in one centralized place –Distributed memory => distributed directory for scalability (avoids bottlenecks) –Send point-to-point requests to processors via network –Scales better than Snooping –Actually existed BEFORE Snooping-based schemes

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman32 Example Snooping protocol 3 states for each cache line: –invalid, shared, modified (exclusive) FSM per cache, receives requests from both processor and bus Main memoryI/O System Cache Processor Cache Processor Cache Processor Cache Processor

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman33 Cache coherence protocol Write invalidate protocol for write-back cache Showing state transitions for each block in the cache

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman34 Synchronization problem Computer system of bank has credit process (P_c) and debit process (P_d) /* Process P_c */ /* Process P_d */shared int balanceprivate int amount balance += amount balance -= amount lw $t0,balance lw $t2,balance lw $t1,amount lw $t3,amount add $t0,$t0,t1 sub $t2,$t2,$t3 sw $t0,balance sw $t2,balance

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman35 Issues for Synchronization Hardware support: –Un-interruptable instruction to fetch-and- update memory (atomic operation) User level synchronization operation(s) using this primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman36 Cell

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman37 What can it do?

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman38 Cell/B.E. - the history Sony/Toshiba/IBM consortium –Austin, TX – March 2001 –Initial investment: $400,000,000 Official name: STI Cell Broadband Engine –Also goes by Cell BE, STI Cell, Cell In production for: –PlayStation 3 from Sony –Mercury’s blades

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman39 Cell blade

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman40 Cell/B.E. – the architecture 1 x PPE 64-bit PowerPC L1: 32 KB I$ + 32 KB D$ L2: 512 KB 8 x SPE cores: Local store: 256 KB 128 x 128 bit vector registers Hybrid memory model: PPE: Rd/Wr SPEs: Asynchronous DMA EIB: 205 GB/s sustained aggregate bandwidth Processor-to-memory bandwidth: 25.6 GB/s Processor-to-processor: 20 GB/s in each direction

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman41 Cell chip

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman42 SPE

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman43 SPE

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman44 SPE pipeline

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman45 Communication

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman46 8 parallel transactions

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman47 C++ on Cell 1 2 3 4 Send the code of the function to be run on SPE Send address to fetch the data DMA data in LS from the main memory Run the code on the SPE 5 6 DMA data out of LS to the main memory Signal the PPE that the SPE has finished the function

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman48 Cell/B.E. – the future (multi-tile?)

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman49 Porting C++ 1 2 3 4 Detect & isolate kernels to be ported Replace kernels with C++ stubs Implement the data transfers and move kernels on SPEs Iteratively optimize SPE code

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman50 Performance estimation Based on Amdhal’s law … where –K i fr = the fraction of the execution time for kernel K i –K i speed-up = the speed-up of kernel K i compared with the sequential version

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman51 Performance estimation Based on Amdhal’s law: –Sequential use of kernels: –Parallel use of kernels: ?

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman52 MARVEL case-study Multimedia content retrieval and analysis For each picture, we extract the values for the features of interest: ColorHistogram, ColorCorrelogram, Texture, EdgeHistogram Compares the image features with the model features and generates an overall confidence score http://www.research.ibm.com/marvel

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman53 MarCell = MARVEL on Cell Identified 5 kernels to port on the SPEs: –4 feature extraction algorithms ColorHistogram (CHExtract) ColorCorrelogram(CCExtract) Texture (TXExtract) EdgeHistogram (EHExtract) –1 common concept detection, repeated for each feature

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman54 MarCell – kernels speed-up KernelSPE[ms] Speed-up vs. PPE Speed-up vs. Desktop Speed- up vs. Laptop Overall contribution AppStart7.170.950.670.838 % CHExtract0.8252.2221.0030.178 % CCExtract5.8755.4421.2622.4554 % TXExtract2.0115.567.088.046 % EHExtract2.4891.0518.7930.8528 % CDetect0.417.153.754.882 %

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman55 MarCell – kernels execution times

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman56 Task parallelism – setup

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman57 Task parallelism – results * *reported on PS3

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman58 Data parallelism – setup Data parallel requires all SPEs to execute the same kernel in SPMD fashion Requires SPE reconfiguration: –Thread re-creation –Overlays

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman59 Data parallelism – results * [1/2] *reported on PS3 ► Kernels do scale when run alone

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman60 Conclusions Multi-processors inevitable Huge performance increase, but… Hell to program –Got to be an architecture expert –Portability?

Platform based design 5KK70 MPSoC Platforms With special emphasis on the Cell Bart Mesman and Henk Corporaal.

Similar presentations

Presentation on theme: "Platform based design 5KK70 MPSoC Platforms With special emphasis on the Cell Bart Mesman and Henk Corporaal."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Platform based design 5KK70 MPSoC Platforms With special emphasis on the Cell Bart Mesman and Henk Corporaal.

Similar presentations

Presentation on theme: "Platform based design 5KK70 MPSoC Platforms With special emphasis on the Cell Bart Mesman and Henk Corporaal."— Presentation transcript:

Similar presentations

About project

Feedback