Platform based design 5KK70 MPSoC Platforms Overview and Cell platform Bart Mesman and Henk Corporaal.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Cache Optimization Summary
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
The University of Adelaide, School of Computer Science
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Lecture 18: Multiprocessors
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
Chapter 6 Multiprocessors and Thread-Level Parallelism 吳俊興 高雄大學資訊工程學系 December 2004 EEF011 Computer Architecture 計算機結構.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
Platform based design 5KK70 MPSoC Platforms With special emphasis on the Cell Bart Mesman and Henk Corporaal.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Embedded Computer Architecture 5KK73 MPSoC Platforms Part2: Cell Bart Mesman and Henk Corporaal.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Multiprocessor Cache Coherency
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Computer Organization & Assembly Language © by DR. M. Amer.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Outline Why this subject? What is High Performance Computing?
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Embedded Computer Architecture 5SAI0 Multi-Processor Systems
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Embedded Computer Architecture 5KK73 Going Multi-Core Henk Corporaal TUEindhoven December.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
The University of Adelaide, School of Computer Science
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Background Computer System Architectures Computer System Software.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13: Multiprocessors Kai Bu
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Architecture & Organization 1
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Architecture & Organization 1
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Platform based design 5KK70 MPSoC Platforms
Multiprocessors - Flynn’s taxonomy (1966)
Embedded Computer Architecture 5KK73 Going Multi-Core
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Platform based design 5KK70 MPSoC Platforms Overview and Cell platform Bart Mesman and Henk Corporaal

5/16/2015Platform Design. H.Corporaal and B. Mesman2 The Software Crisis

5/16/2015Platform Design. H.Corporaal and B. Mesman3 The first SW crisis Time Frame: ’60s and ’70s Problem: Assembly Language Programming –Computers could handle larger more complex programs Needed to get Abstraction and Portability without losing Performance Solution: –High-level languages for von-Neumann machines FORTRAN and C

5/16/2015Platform Design. H.Corporaal and B. Mesman4 The second SW crisis Time Frame: ’80s and ’90s Problem: Inability to build and maintain complex and robust applications requiring multi-million lines of code developed by hundreds of programmers –Computers could handle larger more complex programs Needed to get Composability and Maintainability –High-performance was not an issue: left for Moore’s Law

5/16/2015Platform Design. H.Corporaal and B. Mesman5 Solution Object Oriented Programming –C++, C# and Java Also… –Better tools Component libraries, Purify –Better software engineering methodology Design patterns, specification, testing, code reviews

5/16/2015Platform Design. H.Corporaal and B. Mesman6 Today: Programmers are Oblivious to Processors Solid boundary between Hardware and Software Programmers don’t have to know anything about the processor –High level languages abstract away the processors Ex: Java bytecode is machine independent –Moore’s law does not require the programmers to know anything about the processors to get good speedups Programs are oblivious of the processor -> work on all processors –A program written in ’70 using C still works and is much faster today This abstraction provides a lot of freedom for the programmers

5/16/2015Platform Design. H.Corporaal and B. Mesman7 The third crisis: Powered by PlayStation

5/16/2015Platform Design. H.Corporaal and B. Mesman8 Contents Hammer your head against 4 walls –Or: Why Multi-Processor Cell Architecture Programming and porting –plus case-study

5/16/2015Platform Design. H.Corporaal and B. Mesman9 Moore’s Law

5/16/2015Platform Design. H.Corporaal and B. Mesman10 Single Processor SPECint Performance

5/16/2015Platform Design. H.Corporaal and B. Mesman11 What’s stopping them? General-purpose uni-cores have stopped historic performance scaling –Power consumption –Wire delays –DRAM access latency –Diminishing returns of more instruction-level parallelism

5/16/2015Platform Design. H.Corporaal and B. Mesman12 Power density

5/16/2015Platform Design. H.Corporaal and B. Mesman13 Power Efficiency (Watts/Spec)

5/16/2015Platform Design. H.Corporaal and B. Mesman14 1 clock cycle wire range

5/16/2015Platform Design. H.Corporaal and B. Mesman15 Global wiring delay becomes dominant over gate delay

5/16/2015Platform Design. H.Corporaal and B. Mesman16 Memory

5/16/2015Platform Design. H.Corporaal and B. Mesman17 Now what? Latest research drained Tried every trick in the book So: We’re fresh out of ideas Multi-processor is all that’s left!

5/16/2015Platform Design. H.Corporaal and B. Mesman18 MPSoC Issues Homogeneous vs Heterogeneous Shared memory vs local memory Topology Communication (Bus vs. Network) Granularity (many small vs few large) Mapping –Automatic vs manual parallelization –TLP vs DLP –Parallel vs Pipelined

5/16/2015Platform Design. H.Corporaal and B. Mesman19 Multi-core

5/16/2015Platform Design. H.Corporaal and B. Mesman20 Communication models: Shared Memory Process P1 Process P2 Shared Memory Coherence problem Memory consistency issue Synchronization problem (read, write)

5/16/2015Platform Design. H.Corporaal and B. Mesman21 SMP: Symmetric Multi-Processor Memory: centralized with uniform access time (UMA) and bus interconnect, I/O Examples: Sun Enterprise 6000, SGI Challenge, Intel Main memoryI/O System One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor

5/16/2015Platform Design. H.Corporaal and B. Mesman22 DSM: Distributed Shared Memory Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Interconnection Network Cache Processor Memory Cache Processor Memory Cache Processor Memory Cache Processor Memory Main memoryI/O System

5/16/2015Platform Design. H.Corporaal and B. Mesman23 Communication models: Message Passing Communication primitives –e.g., send, receive library calls Process P1 Process P2 receive send FiFO

5/16/2015Platform Design. H.Corporaal and B. Mesman24 Message passing communication Interconnection Network Network interface Network interface Network interface Network interface Cache Processor Memory DMA Cache Processor Memory DMA Cache Processor Memory DMA Cache Processor Memory DMA

5/16/2015Platform Design. H.Corporaal and B. Mesman25 Communication Models: Comparison Shared-Memory –Compatibility with well-understood (language) mechanisms –Ease of programming for complex or dynamic communications patterns –Shared-memory applications; sharing of large data structures –Efficient for small items –Supports hardware caching Messaging Passing –Simpler hardware –Explicit communication –Scalable!

5/16/2015Platform Design. H.Corporaal and B. Mesman26 Three fundamental issues for shared memory multiprocessors Coherence, about: Do I see the most recent data? Consistency, about: When do I see a written value? –e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? Synchronization How to synchronize processes? –how to protect access to shared data?

5/16/2015Platform Design. H.Corporaal and B. Mesman27 Coherence problem, in Multi-Proc system CPU-1 a' b' b a cache memory CPU-2 a'' b'' cache

5/16/2015Platform Design. H.Corporaal and B. Mesman28 Potential HW Coherency Solutions Snooping Solution (Snoopy Bus): –Send all requests for data to all processors (or local caches) –Processors snoop to see if they have a copy and respond accordingly –Requires broadcast, since caching information is at processors –Works well with bus (natural broadcast medium) –Dominates for small scale machines (most of the market) Directory-Based Schemes –Keep track of what is being shared in one centralized place –Distributed memory => distributed directory for scalability (avoids bottlenecks) –Send point-to-point requests to processors via network –Scales better than Snooping –Actually existed BEFORE Snooping-based schemes

5/16/2015Platform Design. H.Corporaal and B. Mesman29 Example Snooping protocol 3 states for each cache line: –invalid, shared, modified (exclusive) FSM per cache, receives requests from both processor and bus Main memoryI/O System Cache Processor Cache Processor Cache Processor Cache Processor

5/16/2015Platform Design. H.Corporaal and B. Mesman30 Cache coherence protocol Write invalidate protocol for write-back cache Showing state transitions for each block in the cache

5/16/2015Platform Design. H.Corporaal and B. Mesman31 Synchronization problem Computer system of bank has credit process (P_c) and debit process (P_d) /* Process P_c */ /* Process P_d */shared int balanceprivate int amount balance += amount balance -= amount lw $t0,balance lw $t2,balance lw $t1,amount lw $t3,amount add $t0,$t0,t1 sub $t2,$t2,$t3 sw $t0,balance sw $t2,balance

5/16/2015Platform Design. H.Corporaal and B. Mesman32 Issues for Synchronization Hardware support: –Un-interruptable instruction to fetch-and- update memory (atomic operation) User level synchronization operation(s) using this primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization

5/16/2015Platform Design. H.Corporaal and B. Mesman33 Cell

5/16/2015Platform Design. H.Corporaal and B. Mesman34 What can it do?

5/16/2015Platform Design. H.Corporaal and B. Mesman35 Cell/B.E. - the history Sony/Toshiba/IBM consortium –Austin, TX – March 2001 –Initial investment: $400,000,000 Official name: STI Cell Broadband Engine –Also goes by Cell BE, STI Cell, Cell In production for: –PlayStation 3 from Sony –Mercury’s blades

5/16/2015Platform Design. H.Corporaal and B. Mesman36 Cell blade

5/16/2015Platform Design. H.Corporaal and B. Mesman37 Cell/B.E. – the architecture 1 x PPE 64-bit PowerPC L1: 32 KB I$ + 32 KB D$ L2: 512 KB 8 x SPE cores: Local store: 256 KB 128 x 128 bit vector registers Hybrid memory model: PPE: Rd/Wr SPEs: Asynchronous DMA EIB: 205 GB/s sustained aggregate bandwidth Processor-to-memory bandwidth: 25.6 GB/s Processor-to-processor: 20 GB/s in each direction

5/16/2015Platform Design. H.Corporaal and B. Mesman38 Cell chip

5/16/2015Platform Design. H.Corporaal and B. Mesman39 SPE

5/16/2015Platform Design. H.Corporaal and B. Mesman40 SPE

5/16/2015Platform Design. H.Corporaal and B. Mesman41 SPE pipeline

5/16/2015Platform Design. H.Corporaal and B. Mesman42 Communication

5/16/2015Platform Design. H.Corporaal and B. Mesman43 8 parallel transactions

5/16/2015Platform Design. H.Corporaal and B. Mesman44 C++ on Cell Send the code of the function to be run on SPE Send address to fetch the data DMA data in LS from the main memory Run the code on the SPE 5 6 DMA data out of LS to the main memory Signal the PPE that the SPE has finished the function

5/16/2015Platform Design. H.Corporaal and B. Mesman45 Cell/B.E. – the future (multi- tile?)

5/16/2015Platform Design. H.Corporaal and B. Mesman46 Porting C Detect & isolate kernels to be ported Replace kernels with C++ stubs Implement the data transfers and move kernels on SPEs Iteratively optimize SPE code

5/16/2015Platform Design. H.Corporaal and B. Mesman47 Performance estimation Based on Amdhal’s law … where –K i fr = the fraction of the execution time for kernel K i –K i speed-up = the speed-up of kernel K i compared with the sequential version

5/16/2015Platform Design. H.Corporaal and B. Mesman48 Performance estimation Based on Amdhal’s law: –Sequential use of kernels: –Parallel use of kernels: ?

5/16/2015Platform Design. H.Corporaal and B. Mesman49 MARVEL case-study Multimedia content retrieval and analysis For each picture, we extract the values for the features of interest: ColorHistogram, ColorCorrelogram, Texture, EdgeHistogram Compares the image features with the model features and generates an overall confidence score

5/16/2015Platform Design. H.Corporaal and B. Mesman50 MarCell = MARVEL on Cell Identified 5 kernels to port on the SPEs: –4 feature extraction algorithms ColorHistogram (CHExtract) ColorCorrelogram(CCExtract) Texture (TXExtract) EdgeHistogram (EHExtract) –1 common concept detection, repeated for each feature

5/16/2015Platform Design. H.Corporaal and B. Mesman51 MarCell – kernels speed-up KernelSPE[ms] Speed-up vs. PPE Speed-up vs. Desktop Speed- up vs. Laptop Overall contribution AppStart % CHExtract % CCExtract % TXExtract % EHExtract % CDetect %

5/16/2015Platform Design. H.Corporaal and B. Mesman52 MarCell – kernels execution times

5/16/2015Platform Design. H.Corporaal and B. Mesman53 Task parallelism – setup

5/16/2015Platform Design. H.Corporaal and B. Mesman54 Task parallelism – results * *reported on PS3

5/16/2015Platform Design. H.Corporaal and B. Mesman55 Data parallelism – setup Data parallel requires all SPEs to execute the same kernel in SPMD fashion Requires SPE reconfiguration: –Thread re-creation –Overlays

5/16/2015Platform Design. H.Corporaal and B. Mesman56 Data parallelism – results * [1/2] *reported on PS3 ► Kernels do scale when run alone

5/16/2015Platform Design. H.Corporaal and B. Mesman57 Conclusions Multi-processors inevitable Huge performance increase, but… Hell to program –Got to be an architecture expert –Portability? Material (suggested for assignment): ll0_v2.html