Platform based design 5KK70 MPSoC Platforms With special emphasis on the Cell Bart Mesman and Henk Corporaal.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

The University of Adelaide, School of Computer Science
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Cache Optimization Summary
Platform based design 5KK70 MPSoC Platforms Overview and Cell platform Bart Mesman and Henk Corporaal.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
The University of Adelaide, School of Computer Science
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Lecture 18: Multiprocessors
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Embedded Computer Architecture 5KK73 MPSoC Platforms Part2: Cell Bart Mesman and Henk Corporaal.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Multiprocessor Cache Coherency
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Lecture 13: Multiprocessors Kai Bu
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Outline Why this subject? What is High Performance Computing?
Embedded Computer Architecture 5SAI0 Multi-Processor Systems
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Embedded Computer Architecture 5KK73 Going Multi-Core Henk Corporaal TUEindhoven December.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
The University of Adelaide, School of Computer Science
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Background Computer System Architectures Computer System Software.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13: Multiprocessors Kai Bu
Parallel Processing - introduction
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Architecture & Organization 1
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Architecture & Organization 1
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Platform based design 5KK70 MPSoC Platforms
Multiprocessors - Flynn’s taxonomy (1966)
Embedded Computer Architecture 5KK73 Going Multi-Core
High Performance Computing
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Platform based design 5KK70 MPSoC Platforms With special emphasis on the Cell Bart Mesman and Henk Corporaal

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman2 The Software Crisis

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman3 The first SW crisis Time Frame: ’60s and ’70s Problem: Assembly Language Programming –Computers could handle larger more complex programs Needed to get Abstraction and Portability without losing Performance Solution: –High-level languages for von-Neumann machines FORTRAN and C

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman4 The second SW crisis Time Frame: ’80s and ’90s Problem: Inability to build and maintain complex and robust applications requiring multi-million lines of code developed by hundreds of programmers –Computers could handle larger more complex programs Needed to get Composability and Maintainability –High-performance was not an issue: left for Moore’s Law

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman5 Solution Object Oriented Programming –C++, C# and Java Also… –Better tools Component libraries, Purify –Better software engineering methodology Design patterns, specification, testing, code reviews

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman6 Today: Programmers are Oblivious to Processors Solid boundary between Hardware and Software Programmers don’t have to know anything about the processor –High level languages abstract away the processors Ex: Java bytecode is machine independent –Moore’s law does not require the programmers to know anything about the processors to get good speedups Programs are oblivious of the processor -> work on all processors –A program written in ’70 using C still works and is much faster today This abstraction provides a lot of freedom for the programmers

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman7 The third crisis: Powered by PlayStation

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman8 Contents Hammer your head against 4 walls –Or: Why Multi-Processor Cell Architecture Programming and porting –plus case-study

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman9 Moore’s Law

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman10 Single Processor SPECint Performance

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman11 What’s stopping them? General-purpose uni-cores have stopped historic performance scaling –Power consumption –Wire delays –DRAM access latency –Diminishing returns of more instruction-level parallelism

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman12 Power density

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman13 Power Efficiency (Watts/Spec)

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman14 1 clock cycle wire range

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman15 Global wiring delay becomes dominant over gate delay

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman16 Memory

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman17 Now what? Latest research drained Tried every trick in the book So: We’re fresh out of ideas Multi-processor is all that’s left!

6/25/2015ECA - 5KK73 H. Corporaal and B. Mesman18 Low power through parallelism Sequential Processor –Switching capacitance C –Frequency f –Voltage V –P =  fCV 2 Parallel Processor (two times the number of units) –Switching capacitance 2C –Frequency f/2 –Voltage V’ < V –P =  f/2 2C V’ 2 =  fCV’ 2

6/25/2015ECA - 5KK73 H. Corporaal and B. Mesman19 Architecture methods Powerful Instructions (1) MD-technique Multiple data operands per operation SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3)

6/25/2015ECA - 5KK73 H. Corporaal and B. Mesman20 Architecture methods Powerful Instructions (1) Sub-word parallelism –SIMD on restricted scale: –Used for Multi-media instructions –Motivation: use a powerful 64-bit alu as 4 x 16-bit alus Examples –MMX, SUN-VIS, HP MAX-2, AMD- K7/Athlon 3Dnow, Trimedia II –Example:  i=1..4 |a i -b i | ****

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman21 MPSoC Issues Homogeneous vs Heterogeneous Shared memory vs local memory Topology Communication (Bus vs. Network) Granularity (many small vs few large) Mapping –Automatic vs manual parallelization –TLP vs DLP –Parallel vs Pipelined

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman22 Multi-core

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman23 Communication models: Shared Memory Process P1 Process P2 Shared Memory Coherence problem Memory consistency issue Synchronization problem (read, write)

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman24 SMP: Symmetric Multi-Processor Memory: centralized with uniform access time (UMA) and bus interconnect, I/O Examples: Sun Enterprise 6000, SGI Challenge, Intel Main memoryI/O System One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman25 DSM: Distributed Shared Memory Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Interconnection Network Cache Processor Memory Cache Processor Memory Cache Processor Memory Cache Processor Memory Main memoryI/O System

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman26 Communication models: Message Passing Communication primitives –e.g., send, receive library calls Process P1 Process P2 receive send FiFO

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman27 Message passing communication Interconnection Network Network interface Network interface Network interface Network interface Cache Processor Memory DMA Cache Processor Memory DMA Cache Processor Memory DMA Cache Processor Memory DMA

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman28 Communication Models: Comparison Shared-Memory –Compatibility with well-understood (language) mechanisms –Ease of programming for complex or dynamic communications patterns –Shared-memory applications; sharing of large data structures –Efficient for small items –Supports hardware caching Messaging Passing –Simpler hardware –Explicit communication –Scalable!

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman29 Three fundamental issues for shared memory multiprocessors Coherence, about: Do I see the most recent data? Consistency, about: When do I see a written value? –e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? Synchronization How to synchronize processes? –how to protect access to shared data?

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman30 Coherence problem, in Multi-Proc system CPU-1 a' b' b a cache memory CPU-2 a'' b'' cache

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman31 Potential HW Coherency Solutions Snooping Solution (Snoopy Bus): –Send all requests for data to all processors (or local caches) –Processors snoop to see if they have a copy and respond accordingly –Requires broadcast, since caching information is at processors –Works well with bus (natural broadcast medium) –Dominates for small scale machines (most of the market) Directory-Based Schemes –Keep track of what is being shared in one centralized place –Distributed memory => distributed directory for scalability (avoids bottlenecks) –Send point-to-point requests to processors via network –Scales better than Snooping –Actually existed BEFORE Snooping-based schemes

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman32 Example Snooping protocol 3 states for each cache line: –invalid, shared, modified (exclusive) FSM per cache, receives requests from both processor and bus Main memoryI/O System Cache Processor Cache Processor Cache Processor Cache Processor

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman33 Cache coherence protocol Write invalidate protocol for write-back cache Showing state transitions for each block in the cache

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman34 Synchronization problem Computer system of bank has credit process (P_c) and debit process (P_d) /* Process P_c */ /* Process P_d */shared int balanceprivate int amount balance += amount balance -= amount lw $t0,balance lw $t2,balance lw $t1,amount lw $t3,amount add $t0,$t0,t1 sub $t2,$t2,$t3 sw $t0,balance sw $t2,balance

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman35 Issues for Synchronization Hardware support: –Un-interruptable instruction to fetch-and- update memory (atomic operation) User level synchronization operation(s) using this primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman36 Cell

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman37 What can it do?

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman38 Cell/B.E. - the history Sony/Toshiba/IBM consortium –Austin, TX – March 2001 –Initial investment: $400,000,000 Official name: STI Cell Broadband Engine –Also goes by Cell BE, STI Cell, Cell In production for: –PlayStation 3 from Sony –Mercury’s blades

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman39 Cell blade

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman40 Cell/B.E. – the architecture 1 x PPE 64-bit PowerPC L1: 32 KB I$ + 32 KB D$ L2: 512 KB 8 x SPE cores: Local store: 256 KB 128 x 128 bit vector registers Hybrid memory model: PPE: Rd/Wr SPEs: Asynchronous DMA EIB: 205 GB/s sustained aggregate bandwidth Processor-to-memory bandwidth: 25.6 GB/s Processor-to-processor: 20 GB/s in each direction

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman41 Cell chip

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman42 SPE

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman43 SPE

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman44 SPE pipeline

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman45 Communication

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman46 8 parallel transactions

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman47 C++ on Cell Send the code of the function to be run on SPE Send address to fetch the data DMA data in LS from the main memory Run the code on the SPE 5 6 DMA data out of LS to the main memory Signal the PPE that the SPE has finished the function

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman48 Cell/B.E. – the future (multi-tile?)

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman49 Porting C Detect & isolate kernels to be ported Replace kernels with C++ stubs Implement the data transfers and move kernels on SPEs Iteratively optimize SPE code

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman50 Performance estimation Based on Amdhal’s law … where –K i fr = the fraction of the execution time for kernel K i –K i speed-up = the speed-up of kernel K i compared with the sequential version

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman51 Performance estimation Based on Amdhal’s law: –Sequential use of kernels: –Parallel use of kernels: ?

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman52 MARVEL case-study Multimedia content retrieval and analysis For each picture, we extract the values for the features of interest: ColorHistogram, ColorCorrelogram, Texture, EdgeHistogram Compares the image features with the model features and generates an overall confidence score

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman53 MarCell = MARVEL on Cell Identified 5 kernels to port on the SPEs: –4 feature extraction algorithms ColorHistogram (CHExtract) ColorCorrelogram(CCExtract) Texture (TXExtract) EdgeHistogram (EHExtract) –1 common concept detection, repeated for each feature

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman54 MarCell – kernels speed-up KernelSPE[ms] Speed-up vs. PPE Speed-up vs. Desktop Speed- up vs. Laptop Overall contribution AppStart % CHExtract % CCExtract % TXExtract % EHExtract % CDetect %

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman55 MarCell – kernels execution times

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman56 Task parallelism – setup

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman57 Task parallelism – results * *reported on PS3

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman58 Data parallelism – setup Data parallel requires all SPEs to execute the same kernel in SPMD fashion Requires SPE reconfiguration: –Thread re-creation –Overlays

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman59 Data parallelism – results * [1/2] *reported on PS3 ► Kernels do scale when run alone

6/25/2015ECA - 5KK73. H.Corporaal and B. Mesman60 Conclusions Multi-processors inevitable Huge performance increase, but… Hell to program –Got to be an architecture expert –Portability?