Download presentation
Presentation is loading. Please wait.
Published byCecilia Gardner Modified over 9 years ago
1
The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant CCF-0937832 Joshua Slocum Xiaoxuan Meng Brian Lucas
2
Problem: I/O Performance Culprits: Large Units of Data Transfer Operating System Overhead/Noise Managed by hardware Managed by software (OS) $ Main Memory L3 $ P Disk Other File Memory Few Concurrent Transfers (OS Limits) 2 $P $P
3
Solution: Integrated Memory Hierarchy Features: Many Concurrent Transactions – High Bandwidth Global Virtual Store Managed by hardware $ Main Memory L3 $ P Disk Other File Memory Superior Basis for Security / Privacy 3 $P $P
4
The Fresh Breeze Memory Model Fan-out as large as 16 Data Chunks e.g. 128 Bytes Master Chunk Write-Once then Read Only Cycle-Free HeapArrays as Trees of Chunks 4 Arrays: Three levels yields 4096 elements (longs)
5
Fresh Breeze System Vision Many Core Processing Chips Switch Main Memory: Associative Directories and DRAM Chips Backing Store: Access Controllers and Flash Devices ADDRAM Switch ADDRAMADDRAMACFlashACFlashACFlashACFlashACFlash L2 Cache P L1 P P P L2 Cache P L1 P P P L2 Cache P L1 P P P
6
Unique Features of the Processor Cache Lines are 128-byte Chunks. Registers are tagged to indicate those holding handles of chunks. Hardware task scheduler: Active Task List and Pending Task Queue. 6
7
Spawn and Join 7 Master Task spawn (n) Join_fetch Join Worker 0Worker n-1 join_update Spawned Worker Tasks Continuation Task
8
The Dot Product A B * Sum AB 5 levels: Vector length = 16 5 = 1,048,576 * + scalar result * * Each Leaf Task: Dot Product of two 16-element vectors: 16 multiplies; 15 adds
9
Linear Algebra: Three Algorithms Dot Product Matrix Multiply Fast Fourier Transform 9 Let’s consider the special characteristics of each.
10
Dot Product 16 10 Segment A 15 Multiplies Adds 31Operations No data reuse No intermediate data No chunks written Large volume of input data Leaf Task: Dot Product of 16-element segments A and B Segment B + *
11
Matrix Multiply 64 11 48 112 Multiplies Adds Operations Each input chunk used many times Result chunks written to memory No chunks written Relatively small input data Leaf Task: Product of two 4-by-4 matrices 16 dot products of four-element vectors + *
12
Fast Fourier Transform 4 Eight Data Samples Four Twiddle Factors 6 Multiplies Adds 10Operations Log 2 (n) stages Intermediate data Chunks written and read Leaf Task: Group of Four Butterfly Computations BFLY Eight Results BFLY 16 40 One ButterflyFour Butterflies 24
13
Simulated System Model L1 P Memory Switch Processors – 16, 24, 32, 40 Up to 64 Independent Storage Units L1 P
14
Simulation Modes Non-Blocking: A task executing a read simply waits for operation to complete. Models an L2 cache with short access time. Models a main memory with long access time. Blocking: A task executing a read suspends to permit other tasks to run.
15
Estimated Performance L2 Cache Model - NonBlocking Dot Product A: 16 3 B: 16 4 C: 16 5 Matrix Multiply A: 16 x 16 B: 32 x 32 C: 48 x 48 Fourier Transform A: 1024 B: 2048 C: 4096
16
Estimated Performance L2 Cache Model - NonBlocking Dot Product A: 16 3 B: 16 4 C: 16 5 Matrix Multiply A: 16 x 16 B: 32 x 32 C: 48 x 48 Fourier Transfor A: 1024 B: 2048 C: 4096
17
Findings Data locality of chunks works well. L1 Cache is not very important; only a buffer for input and result chunks of tasks. Task switching for chunk reads is costly; percolation would make a big difference. 17 Percolation: Ensuring that input chunks of a task have been retrieved before the task is scheduled.
18
Further Work Model a Three-Level or Four-Level Storage System Develop Hierarchical Work Stealing to improve Load Distribution for Massively Parallel Systems. Compiler Development: Automatic Mapping of Objects to Trees of Chunks. Expand Test Programs to include Transaction Processing and Database Operations. 18
19
Distributed Discrete-Event Simulation Accurate timing for interconnected communicating components. Packet Communication Architecture is a model for distributed systems especially amenable to efficient distributed simulation. UDel and MIT have begun a project to develop a PCA-based Simulation Sandbox for evaluating alternate PXMs for massively parallel computing. 19
20
Relevance to FSIO The Fresh Breeze Memory Model extends to 2 64 chunks, about 10 21 chunks or 100,000 exabytes. The tree-structure is an excellent basis for advanced security and data object sharing. Can simplify check-point / restart. Seamless transition from “in-core” to “out-of-core” operation of arbitrary parallel programs. Provides ability to use any program as a component of new programs – including any parallel program. 20
21
Conclusion Making the best of many-core technology requires study of new program execution models (PXMs). The FSIO challenge can be met by integrating the file system into the system memory hierarchy as a global virtual store. The Fresh Breeze PXM demonstrates some of the benefits from departing from conventional system organization. More exploration of new PXMs is needed! 21
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.