Download presentation
Presentation is loading. Please wait.
Published byJoan Jennings Modified over 9 years ago
2
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government.
3
MIT Lincoln Laboratory Slide-2 Multicore Theory Programming Challenge Example Issues Theoretical Approach Integrated Process Outline Parallel Design Distributed Arrays Kuck Diagrams Hierarchical Arrays Tasks and Conduits Summary
4
MIT Lincoln Laboratory Slide-3 Multicore Theory Multicore Programming Challenge Great success of Moore’s Law era –Simple model: load, op, store –Many transistors devoted to delivering this model Moore’s Law is ending –Need transistors for performance Past Programming Model: Von Neumann Future Programming Model: ??? Can we describe how an algorithm is suppose to behave on a hierarchical heterogeneous multicore processor? Processor topology includes: –Registers, cache, local memory, remote memory, disk Cell has multiple programming models
5
MIT Lincoln Laboratory Slide-4 Multicore Theory Example Issues Where is the data? How is distributed? Where is it running? Y = X + 1 X,Y : N x N A serial algorithm can run a serial processor with relatively little specification A hierarchical heterogeneous multicore algorithm requires a lot more information Which binary to run? Initialization policy? How does the data flow? What are the allowed messages size? Overlap computations and communications?
6
MIT Lincoln Laboratory Slide-5 Multicore Theory Theoretical Approach Task1 S1 () A : R N P(N) M0M0 P0P0 0 M0M0 P0P0 1 M0M0 P0P0 2 M0M0 P0P0 3 Task2 S2 () B : R N P(N) M0M0 P0P0 4 M0M0 P0P0 5 M0M0 P0P0 6 M0M0 P0P0 7 A n Topic12 m B Topic12 Conduit 2 Replica 0Replica 1 Provide notation and diagrams that allow hierarchical heterogeneous multicore algorithms to be specified
7
MIT Lincoln Laboratory Slide-6 Multicore Theory Integrated Development Process Cluster 2. Parallelize code Embedded Computer 3. Deploy code 1. Develop serial code Desktop 4. Automatically parallelize code Y = X + 1 X,Y : N x N Y = X + 1 X,Y : P(N) x N Y = X + 1 X,Y : P(P(N)) x N Should naturally support standard parallel embedded software development practices
8
MIT Lincoln Laboratory Slide-7 Multicore Theory Serial Program Parallel Execution Distributed Arrays Redistribution Outline Parallel Design Distributed Arrays Kuck Diagrams Hierarchical Arrays Tasks and Conduits Summary
9
MIT Lincoln Laboratory Slide-8 Multicore Theory Serial Program Math is the language of algorithms Allows mathematical expressions to be written concisely Multi-dimensional arrays are fundamental to mathematics Y = X + 1 X,Y : N x N
10
MIT Lincoln Laboratory Slide-9 Multicore Theory Parallel Execution Run N P copies of same program –Single Program Multiple Data (SPMD) Each copy has a unique P ID Every array is replicated on each copy of the program Y = X + 1 X,Y : N x N P ID =1 P ID =0 P ID =N P -1
11
MIT Lincoln Laboratory Slide-10 Multicore Theory Distributed Array Program Use P() notation to make a distributed array Tells program which dimension to distribute data Each program implicitly operates on only its own data (owner computes rule) Y = X + 1 X,Y : P(N) x N P ID =1 P ID =0 P ID =N P -1
12
MIT Lincoln Laboratory Slide-11 Multicore Theory Explicitly Local Program Use.loc notation to explicitly retrieve local part of a distributed array Operation is the same as serial program, but with different data on each processor (recommended approach) Y.loc = X.loc + 1 X,Y : P(N) x N
13
MIT Lincoln Laboratory Slide-12 Multicore Theory Parallel Data Maps A map is a mapping of array indices to processors Can be block, cyclic, block-cyclic, or block w/overlap Use P() notation to set which dimension to split among processors P(N) x N Math 01 23 Computer P ID Array N x P(N)P(N) x P(N)
14
MIT Lincoln Laboratory Slide-13 Multicore Theory Redistribution of Data Different distributed arrays can have different maps Assignment between arrays with the “=” operator causes data to be redistributed Underlying library determines all the message to send Y = X + 1 Y : N x P(N) X : P(N) x N P0 P1 P2 P3 X = P0P1P2P3 Y = Data Sent
15
MIT Lincoln Laboratory Slide-14 Multicore Theory Serial Parallel Hierarchical Cell Outline Parallel Design Distributed Arrays Kuck Diagrams Hierarchical Arrays Tasks and Conduits Summary
16
MIT Lincoln Laboratory Slide-15 Multicore Theory M0M0 P0P0 A : R N N Single Processor Kuck Diagram Processors denoted by boxes Memory denoted by ovals Lines connected associated processors and memories Subscript denotes level in the memory hierarchy
17
MIT Lincoln Laboratory Slide-16 Multicore Theory Net 0.5 M0M0 P0P0 M0M0 P0P0 M0M0 P0P0 M0M0 P0P0 A : R N P(N) Parallel Kuck Diagram Replicates serial processors Net denotes network connecting memories at a level in the hierarchy (incremented by 0.5) Distributed array has a local piece on each memory
18
MIT Lincoln Laboratory Slide-17 Multicore Theory Hierarchical Kuck Diagram The Kuck notation provides a clear way of describing a hardware architecture along with the memory and communication hierarchy Net 1.5 SM 2 SMNet 2 M0M0 P0P0 M0M0 P0P0 Net 0.5 SM 1 M0M0 P0P0 M0M0 P0P0 Net 0.5 SM 1 SMNet 1 Legend: P - processor Net - inter-processor network M - memory SM - shared memory SMNet - shared memory network 2-LEVEL HIERARCHY Subscript indicates hierarchy level x.5 subscript for Net indicates indirect memory access *High Performance Computing: Challenges for Future Systems, David Kuck, 1996
19
MIT Lincoln Laboratory Slide-18 Multicore Theory Cell Example M0M0 P0P0 M0M0 P0P0 M0M0 P0P0 Net 0.5 M0M0 P0P0 M0M0 P0P0 M0M0 P0P0 10 23 7 MNet 1 M1M1 PPE SPEs P PPE = PPE speed (GFLOPS) M 0,PPE = size of PPE cache (bytes) P PPE -M 0,PPE =PPE to cache bandwidth (GB/sec) P SPE = SPE speed (GFLOPS) M 0,SPE = size of SPE local store (bytes) P SPE -M 0,SPE = SPE to LS memory bandwidth (GB/sec) Net 0.5 = SPE to SPE bandwidth (matrix encoding topology, GB/sec) MNet 1 = PPE,SPE to main memory bandwidth (GB/sec) M 1 = size of main memory (bytes) Kuck diagram for the Sony/Toshiba/IBM processor
20
MIT Lincoln Laboratory Slide-19 Multicore Theory Hierarchical Arrays Hierarchical Maps Kuck Diagram Explicitly Local Program Outline Parallel Design Distributed Arrays Kuck Diagrams Hierarchical Arrays Tasks and Conduits Summary
21
MIT Lincoln Laboratory Slide-20 Multicore Theory Hierarchical Arrays 0 1 2 3 Local arrays Global array Hierarchical arrays allow algorithms to conform to hierarchical multicore processors Each processor in P controls another set of processors P Array local to P is sub-divided among local P processors P ID 0 1 0 1 0 1 0 1
22
MIT Lincoln Laboratory Slide-21 Multicore Theory net 1.5 M0M0 P0P0 M0M0 P0P0 net 0.5 SM 1 M0M0 P0P0 M0M0 P0P0 net 0.5 SM 1 SM net 1 A : R N P(P(N)) Hierarchical Array and Kuck Diagram Array is allocated across SM 1 of P processors Within each SM 1 responsibility of processing is divided among local P processors P processors will move their portion to their local M 0 A.loc A.loc.loc
23
MIT Lincoln Laboratory Slide-22 Multicore Theory Explicitly Local Hierarchical Program Extend.loc notation to explicitly retrieve local part of a local distributed array.loc.loc (assumes SPMD on P ) Subscript p and p provide explicit access to (implicit otherwise) Y.loc p.loc p = X.loc p.loc p + 1 X,Y : P(P(N)) x N
24
MIT Lincoln Laboratory Slide-23 Multicore Theory Block Hierarchical Arrays 0 1 2 3 Local arrays Global array Memory constraints are common at the lowest level of the hierarchy Blocking at this level allows control of the size of data operated on by each P P ID 0 1 0 1 0 1 0 1 b=4 in-core out-of- core Core blocks blk 0 1 2 3 0 1 2 3
25
MIT Lincoln Laboratory Slide-24 Multicore Theory Block Hierarchical Program P b(4) indicates each sub-array should be broken up into blocks of size 4..n blk provides the number of blocks for looping over each block; allows controlling size of data on lowest level for i=0, X.loc.loc.n blk -1 Y.loc.loc.blk i = X.loc.loc.blk i + 1 X,Y : P(P b(4) (N)) x N
26
MIT Lincoln Laboratory Slide-25 Multicore Theory Basic Pipeline Replicated Tasks Replicated Pipelines Outline Parallel Design Distributed Arrays Kuck Diagrams Hierarchical Arrays Tasks and Conduits Summary
27
MIT Lincoln Laboratory Slide-26 Multicore Theory Task1 S1 () A : R N P(N) M0M0 P0P0 0 M0M0 P0P0 1 M0M0 P0P0 2 M0M0 P0P0 3 Task2 S2 () B : R N P(N) M0M0 P0P0 4 M0M0 P0P0 5 M0M0 P0P0 6 M0M0 P0P0 7 Tasks and Conduits S1 superscript runs task on a set of processors; distributed arrays allocated relative to this scope Pub/sub conduits move data between tasks A n Topic12 m B Topic12 Conduit
28
MIT Lincoln Laboratory Slide-27 Multicore Theory Task1 S1 () A : R N P(N) M0M0 P0P0 0 M0M0 P0P0 1 M0M0 P0P0 2 M0M0 P0P0 3 Task2 S2 () B : R N P(N) M0M0 P0P0 4 M0M0 P0P0 5 M0M0 P0P0 6 M0M0 P0P0 7 A n Topic12 m B Topic12 Conduit 2 Replica 0Replica 1 Replicated Tasks 2 subscript creates tasks replicas; conduit will round-robin
29
MIT Lincoln Laboratory Slide-28 Multicore Theory Task1 S1 () A : R N P(N) M0M0 P0P0 0 M0M0 P0P0 1 M0M0 P0P0 2 M0M0 P0P0 3 Task2 S2 () B : R N P(N) M0M0 P0P0 4 M0M0 P0P0 5 M0M0 P0P0 6 M0M0 P0P0 7 A n Topic12 m B Topic12 Conduit 2 2 Replica 0Replica 1 Replica 0Replica 1 Replicated Pipelines 2 identical subscript on tasks creates replicated pipeline
30
MIT Lincoln Laboratory Slide-29 Multicore Theory Summary Hierarchical heterogeneous multicore processors are difficult to program Specifying how an algorithm is suppose to behave on such a processor is critical Proposed notation provides mathematical constructs for concisely describing hierarchical heterogeneous multicore algorithms
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.