Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

SE-292 High Performance Computing
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Lecture 6: Multicore Systems
Slide 2-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 2 Using the Operating System 2.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Reference: Message Passing Fundamentals.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.
Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.
Computer System Overview
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer Architecture Lecture 01 Fasih ur Rehman.
PVTOL: A High-Level Signal Processing Library for Multicore Processors
Computer Systems Overview. Page 2 W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware resources of one.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.
HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.
Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.
MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.
Haney - 1 HPEC 9/28/2004 MIT Lincoln Laboratory pMatlab Takes the HPCchallenge Ryan Haney, Hahn Kim, Andrew Funk, Jeremy Kepner, Charles Rader, Albert.
CSE 241 Computer Engineering (1) هندسة الحاسبات (1) Lecture #3 Ch. 6 Memory System Design Dr. Tamer Samy Gaafar Dept. of Computer & Systems Engineering.
The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.
CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Slide-1 Parallel MATLAB MIT Lincoln Laboratory Multicore Programming in pMatlab using Distributed Arrays Jeremy Kepner MIT Lincoln Laboratory This work.
Slide-1 Multicore Productivity MIT Lincoln Laboratory Evaluating the Productivity of a Multicore Architecture Jeremy Kepner and Nadya Bliss MIT Lincoln.
HPEC HN 9/23/2009 MIT Lincoln Laboratory RAPID: A Rapid Prototyping Methodology for Embedded Systems Huy Nguyen, Thomas Anderson, Stuart Chen, Ford.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory
Memory-Aware Compilation Philip Sweany 10/20/2011.
Chapter 5: Computer Systems Organization Invitation to Computer Science,
MIT Lincoln Laboratory 1 of 4 MAA 3/8/2016 Development of a Real-Time Parallel UHF SAR Image Processor Matthew Alexander, Michael Vai, Thomas Emberley,
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
Computer Systems Overview. Lecture 1/Page 2AE4B33OSS W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware.
HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.
Parallel programs Inf-2202 Concurrent and Data-intensive Programming Fall 2016 Lars Ailo Bongo
These slides are based on the book:
Invitation to Computer Science 6th Edition
Parallel Matlab programming using Distributed Arrays
CS5102 High Performance Computer Systems Thread-Level Parallelism
Conception of parallel algorithms
The Echo Algorithm The echo algorithm can be used to collect and disperse information in a distributed system It was originally designed for learning network.
课程名 编译原理 Compiling Techniques
Department of Computer Science University of California,Santa Barbara
Parallel Processing in ROSA II
Multi-Processing in High Performance Computer Architecture:
Foundations of Computer Science
CMSC 611: Advanced Computer Architecture
Multicore Programming in pMatlab using Distributed Arrays
Background and Motivation
Chapter 4 The Von Neumann Model
Presentation transcript:

Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work is sponsored by the Department of Defense under Air Force Contract FA C Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government.

MIT Lincoln Laboratory Slide-2 Multicore Theory Programming Challenge Example Issues Theoretical Approach Integrated Process Outline Parallel Design Distributed Arrays Kuck Diagrams Hierarchical Arrays Tasks and Conduits Summary

MIT Lincoln Laboratory Slide-3 Multicore Theory Multicore Programming Challenge Great success of Moore’s Law era –Simple model: load, op, store –Many transistors devoted to delivering this model Moore’s Law is ending –Need transistors for performance Past Programming Model: Von Neumann Future Programming Model: ??? Can we describe how an algorithm is suppose to behave on a hierarchical heterogeneous multicore processor? Processor topology includes: –Registers, cache, local memory, remote memory, disk Cell has multiple programming models

MIT Lincoln Laboratory Slide-4 Multicore Theory Example Issues Where is the data? How is distributed? Where is it running? Y = X + 1 X,Y : N x N A serial algorithm can run a serial processor with relatively little specification A hierarchical heterogeneous multicore algorithm requires a lot more information Which binary to run? Initialization policy? How does the data flow? What are the allowed messages size? Overlap computations and communications?

MIT Lincoln Laboratory Slide-5 Multicore Theory Theoretical Approach Task1 S1 () A : R N  P(N) M0M0 P0P0 0 M0M0 P0P0 1 M0M0 P0P0 2 M0M0 P0P0 3 Task2 S2 () B : R N  P(N) M0M0 P0P0 4 M0M0 P0P0 5 M0M0 P0P0 6 M0M0 P0P0 7 A n  Topic12  m B Topic12 Conduit 2 Replica 0Replica 1 Provide notation and diagrams that allow hierarchical heterogeneous multicore algorithms to be specified

MIT Lincoln Laboratory Slide-6 Multicore Theory Integrated Development Process Cluster 2. Parallelize code Embedded Computer 3. Deploy code 1. Develop serial code Desktop 4. Automatically parallelize code Y = X + 1 X,Y : N x N Y = X + 1 X,Y : P(N) x N Y = X + 1 X,Y : P(P(N)) x N Should naturally support standard parallel embedded software development practices

MIT Lincoln Laboratory Slide-7 Multicore Theory Serial Program Parallel Execution Distributed Arrays Redistribution Outline Parallel Design Distributed Arrays Kuck Diagrams Hierarchical Arrays Tasks and Conduits Summary

MIT Lincoln Laboratory Slide-8 Multicore Theory Serial Program Math is the language of algorithms Allows mathematical expressions to be written concisely Multi-dimensional arrays are fundamental to mathematics Y = X + 1 X,Y : N x N

MIT Lincoln Laboratory Slide-9 Multicore Theory Parallel Execution Run N P copies of same program –Single Program Multiple Data (SPMD) Each copy has a unique P ID Every array is replicated on each copy of the program Y = X + 1 X,Y : N x N P ID =1 P ID =0 P ID =N P -1

MIT Lincoln Laboratory Slide-10 Multicore Theory Distributed Array Program Use P() notation to make a distributed array Tells program which dimension to distribute data Each program implicitly operates on only its own data (owner computes rule) Y = X + 1 X,Y : P(N) x N P ID =1 P ID =0 P ID =N P -1

MIT Lincoln Laboratory Slide-11 Multicore Theory Explicitly Local Program Use.loc notation to explicitly retrieve local part of a distributed array Operation is the same as serial program, but with different data on each processor (recommended approach) Y.loc = X.loc + 1 X,Y : P(N) x N

MIT Lincoln Laboratory Slide-12 Multicore Theory Parallel Data Maps A map is a mapping of array indices to processors Can be block, cyclic, block-cyclic, or block w/overlap Use P() notation to set which dimension to split among processors P(N) x N Math Computer P ID Array N x P(N)P(N) x P(N)

MIT Lincoln Laboratory Slide-13 Multicore Theory Redistribution of Data Different distributed arrays can have different maps Assignment between arrays with the “=” operator causes data to be redistributed Underlying library determines all the message to send Y = X + 1 Y : N x P(N) X : P(N) x N P0 P1 P2 P3 X = P0P1P2P3 Y = Data Sent

MIT Lincoln Laboratory Slide-14 Multicore Theory Serial Parallel Hierarchical Cell Outline Parallel Design Distributed Arrays Kuck Diagrams Hierarchical Arrays Tasks and Conduits Summary

MIT Lincoln Laboratory Slide-15 Multicore Theory M0M0 P0P0 A : R N  N Single Processor Kuck Diagram Processors denoted by boxes Memory denoted by ovals Lines connected associated processors and memories Subscript denotes level in the memory hierarchy

MIT Lincoln Laboratory Slide-16 Multicore Theory Net 0.5 M0M0 P0P0 M0M0 P0P0 M0M0 P0P0 M0M0 P0P0 A : R N  P(N) Parallel Kuck Diagram Replicates serial processors Net denotes network connecting memories at a level in the hierarchy (incremented by 0.5) Distributed array has a local piece on each memory

MIT Lincoln Laboratory Slide-17 Multicore Theory Hierarchical Kuck Diagram The Kuck notation provides a clear way of describing a hardware architecture along with the memory and communication hierarchy Net 1.5 SM 2 SMNet 2 M0M0 P0P0 M0M0 P0P0 Net 0.5 SM 1 M0M0 P0P0 M0M0 P0P0 Net 0.5 SM 1 SMNet 1 Legend: P - processor Net - inter-processor network M - memory SM - shared memory SMNet - shared memory network 2-LEVEL HIERARCHY Subscript indicates hierarchy level x.5 subscript for Net indicates indirect memory access *High Performance Computing: Challenges for Future Systems, David Kuck, 1996

MIT Lincoln Laboratory Slide-18 Multicore Theory Cell Example M0M0 P0P0 M0M0 P0P0 M0M0 P0P0 Net 0.5 M0M0 P0P0 M0M0 P0P0 M0M0 P0P MNet 1 M1M1 PPE SPEs P PPE = PPE speed (GFLOPS) M 0,PPE = size of PPE cache (bytes) P PPE -M 0,PPE =PPE to cache bandwidth (GB/sec) P SPE = SPE speed (GFLOPS) M 0,SPE = size of SPE local store (bytes) P SPE -M 0,SPE = SPE to LS memory bandwidth (GB/sec) Net 0.5 = SPE to SPE bandwidth (matrix encoding topology, GB/sec) MNet 1 = PPE,SPE to main memory bandwidth (GB/sec) M 1 = size of main memory (bytes) Kuck diagram for the Sony/Toshiba/IBM processor

MIT Lincoln Laboratory Slide-19 Multicore Theory Hierarchical Arrays Hierarchical Maps Kuck Diagram Explicitly Local Program Outline Parallel Design Distributed Arrays Kuck Diagrams Hierarchical Arrays Tasks and Conduits Summary

MIT Lincoln Laboratory Slide-20 Multicore Theory Hierarchical Arrays Local arrays Global array Hierarchical arrays allow algorithms to conform to hierarchical multicore processors Each processor in P controls another set of processors P Array local to P is sub-divided among local P processors P ID

MIT Lincoln Laboratory Slide-21 Multicore Theory net 1.5 M0M0 P0P0 M0M0 P0P0 net 0.5 SM 1 M0M0 P0P0 M0M0 P0P0 net 0.5 SM 1 SM net 1 A : R N  P(P(N)) Hierarchical Array and Kuck Diagram Array is allocated across SM 1 of P processors Within each SM 1 responsibility of processing is divided among local P processors P processors will move their portion to their local M 0 A.loc A.loc.loc

MIT Lincoln Laboratory Slide-22 Multicore Theory Explicitly Local Hierarchical Program Extend.loc notation to explicitly retrieve local part of a local distributed array.loc.loc (assumes SPMD on P ) Subscript p and p provide explicit access to (implicit otherwise) Y.loc p.loc p = X.loc p.loc p + 1 X,Y : P(P(N)) x N

MIT Lincoln Laboratory Slide-23 Multicore Theory Block Hierarchical Arrays Local arrays Global array Memory constraints are common at the lowest level of the hierarchy Blocking at this level allows control of the size of data operated on by each P P ID b=4 in-core out-of- core Core blocks blk

MIT Lincoln Laboratory Slide-24 Multicore Theory Block Hierarchical Program P b(4) indicates each sub-array should be broken up into blocks of size 4..n blk provides the number of blocks for looping over each block; allows controlling size of data on lowest level for i=0, X.loc.loc.n blk -1 Y.loc.loc.blk i = X.loc.loc.blk i + 1 X,Y : P(P b(4) (N)) x N

MIT Lincoln Laboratory Slide-25 Multicore Theory Basic Pipeline Replicated Tasks Replicated Pipelines Outline Parallel Design Distributed Arrays Kuck Diagrams Hierarchical Arrays Tasks and Conduits Summary

MIT Lincoln Laboratory Slide-26 Multicore Theory Task1 S1 () A : R N  P(N) M0M0 P0P0 0 M0M0 P0P0 1 M0M0 P0P0 2 M0M0 P0P0 3 Task2 S2 () B : R N  P(N) M0M0 P0P0 4 M0M0 P0P0 5 M0M0 P0P0 6 M0M0 P0P0 7 Tasks and Conduits S1 superscript runs task on a set of processors; distributed arrays allocated relative to this scope Pub/sub conduits move data between tasks A n  Topic12  m B Topic12 Conduit

MIT Lincoln Laboratory Slide-27 Multicore Theory Task1 S1 () A : R N  P(N) M0M0 P0P0 0 M0M0 P0P0 1 M0M0 P0P0 2 M0M0 P0P0 3 Task2 S2 () B : R N  P(N) M0M0 P0P0 4 M0M0 P0P0 5 M0M0 P0P0 6 M0M0 P0P0 7 A n  Topic12  m B Topic12 Conduit 2 Replica 0Replica 1 Replicated Tasks 2 subscript creates tasks replicas; conduit will round-robin

MIT Lincoln Laboratory Slide-28 Multicore Theory Task1 S1 () A : R N  P(N) M0M0 P0P0 0 M0M0 P0P0 1 M0M0 P0P0 2 M0M0 P0P0 3 Task2 S2 () B : R N  P(N) M0M0 P0P0 4 M0M0 P0P0 5 M0M0 P0P0 6 M0M0 P0P0 7 A n  Topic12  m B Topic12 Conduit 2 2 Replica 0Replica 1 Replica 0Replica 1 Replicated Pipelines 2 identical subscript on tasks creates replicated pipeline

MIT Lincoln Laboratory Slide-29 Multicore Theory Summary Hierarchical heterogeneous multicore processors are difficult to program Specifying how an algorithm is suppose to behave on such a processor is critical Proposed notation provides mathematical constructs for concisely describing hierarchical heterogeneous multicore algorithms