The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
11/13/01CS-550 Presentation - Overview of Microsoft disk operating system. 1 An Overview of Microsoft Disk Operating System.
1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Dr. Mohamed Hefeeda.
Memory Design Example. Selecting Memory Chip Selecting SRAM Memory Chip.
Operating System Support Focus on Architecture
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
PEMWS – April 5, 2011 Program Execution Models What we can Learn from the Past Jack Dennis MIT Computer Science and Artificial Intelligence Laboratory.
Memory Management CSCI 3753 Operating Systems Spring 2005 Prof. Rick Han.
Memory Management 2010.
Computer System Overview
Chapter 1: Introduction
Figure 1.1 Interaction between applications and the operating system.
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
1 OS & Computer Architecture Modern OS Functionality (brief review) Architecture Basics Hardware Support for OS Features.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
© 2011 IBM Corporation 11 April 2011 IDS Architecture.
Where do We Go from Here? Thoughts on Computer Architecture TIP For additional advice see Dale Canegie Training® Presentation Guidelines Jack Dennis MIT.
Advances in Language Design
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 1: Introduction.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Operating System. Architecture of Computer System Hardware Operating System (OS) Programming Language (e.g. PASCAL) Application Programs (e.g. WORD, EXCEL)
Chapter 1. Introduction What is an Operating System? Mainframe Systems
Simulation of the Fresh Breeze Storage System Xiaoxuan MengJack Dennis MIT CSAILUniversity of Delaware Funded by NSF HECURA Grant CCF
Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Operating Systems and Networks AE4B33OSS Introduction.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
INVITATION TO COMPUTER SCIENCE, JAVA VERSION, THIRD EDITION Chapter 6: An Introduction to System Software and Virtual Machines.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
IT253: Computer Organization
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.
Boot Sequence, Internal Component & Cisco 3 Layer Model 1.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Lab 2 Parallel processing using NIOS II processors
Introduction to z/OS Basics © 2006 IBM Corporation Chapter 7: Batch processing and the Job Entry Subsystem (JES) Batch processing and JES.
UNIX Unit 1- Architecture of Unix - By Pratima.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
Lecture on Central Process Unit (CPU)
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
بسم الله الرحمن الرحيم MEMORY AND I/O.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Hello world !!! ASCII representation of hello.c.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Vector computers.
Operating System (Reference : OS[Silberschatz] + Norton 6e book slides)
Operating Systems A Biswas, Dept. of Information Technology.
4.4.1 The Operating System.
Applied Operating System Concepts
Chapter 2 Memory and process management
Operating System.
For Massively Parallel Computation The Chaotic State of the Art
Real-time Software Design
Chapter 15, Exploring the Digital Domain
Introduction to Operating Systems
Introduction to Operating Systems
Hardware Organization
Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment. Phone:
Presentation transcript:

The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant CCF Joshua Slocum Xiaoxuan Meng Brian Lucas

Problem: I/O Performance Culprits: Large Units of Data Transfer Operating System Overhead/Noise Managed by hardware Managed by software (OS) $ Main Memory L3 $ P Disk Other File Memory Few Concurrent Transfers (OS Limits) 2 $P $P

Solution: Integrated Memory Hierarchy Features: Many Concurrent Transactions – High Bandwidth Global Virtual Store Managed by hardware $ Main Memory L3 $ P Disk Other File Memory Superior Basis for Security / Privacy 3 $P $P

The Fresh Breeze Memory Model Fan-out as large as 16 Data Chunks e.g. 128 Bytes Master Chunk Write-Once then Read Only Cycle-Free HeapArrays as Trees of Chunks 4 Arrays: Three levels yields 4096 elements (longs)

Fresh Breeze System Vision Many Core Processing Chips Switch Main Memory: Associative Directories and DRAM Chips Backing Store: Access Controllers and Flash Devices ADDRAM Switch ADDRAMADDRAMACFlashACFlashACFlashACFlashACFlash L2 Cache P L1 P P P L2 Cache P L1 P P P L2 Cache P L1 P P P

Unique Features of the Processor Cache Lines are 128-byte Chunks. Registers are tagged to indicate those holding handles of chunks. Hardware task scheduler: Active Task List and Pending Task Queue. 6

Spawn and Join 7 Master Task spawn (n) Join_fetch Join Worker 0Worker n-1 join_update Spawned Worker Tasks Continuation Task

The Dot Product A B * Sum AB 5 levels: Vector length = 16 5 = 1,048,576 * + scalar result * * Each Leaf Task: Dot Product of two 16-element vectors: 16 multiplies; 15 adds

Linear Algebra: Three Algorithms Dot Product Matrix Multiply Fast Fourier Transform 9 Let’s consider the special characteristics of each.

Dot Product Segment A 15 Multiplies Adds 31Operations No data reuse No intermediate data No chunks written Large volume of input data Leaf Task: Dot Product of 16-element segments A and B Segment B + *

Matrix Multiply Multiplies Adds Operations Each input chunk used many times Result chunks written to memory No chunks written Relatively small input data Leaf Task: Product of two 4-by-4 matrices 16 dot products of four-element vectors + *

Fast Fourier Transform 4 Eight Data Samples Four Twiddle Factors 6 Multiplies Adds 10Operations Log 2 (n) stages Intermediate data Chunks written and read Leaf Task: Group of Four Butterfly Computations BFLY Eight Results BFLY One ButterflyFour Butterflies 24

Simulated System Model L1 P Memory Switch Processors – 16, 24, 32, 40 Up to 64 Independent Storage Units L1 P

Simulation Modes  Non-Blocking: A task executing a read simply waits for operation to complete. Models an L2 cache with short access time. Models a main memory with long access time. Blocking: A task executing a read suspends to permit other tasks to run. 

Estimated Performance L2 Cache Model - NonBlocking Dot Product A: 16 3 B: 16 4 C: 16 5 Matrix Multiply A: 16 x 16 B: 32 x 32 C: 48 x 48 Fourier Transform A: 1024 B: 2048 C: 4096

Estimated Performance L2 Cache Model - NonBlocking Dot Product A: 16 3 B: 16 4 C: 16 5 Matrix Multiply A: 16 x 16 B: 32 x 32 C: 48 x 48 Fourier Transfor A: 1024 B: 2048 C: 4096

Findings Data locality of chunks works well. L1 Cache is not very important; only a buffer for input and result chunks of tasks. Task switching for chunk reads is costly; percolation would make a big difference. 17 Percolation: Ensuring that input chunks of a task have been retrieved before the task is scheduled.

Further Work Model a Three-Level or Four-Level Storage System Develop Hierarchical Work Stealing to improve Load Distribution for Massively Parallel Systems. Compiler Development: Automatic Mapping of Objects to Trees of Chunks. Expand Test Programs to include Transaction Processing and Database Operations. 18

Distributed Discrete-Event Simulation Accurate timing for interconnected communicating components. Packet Communication Architecture is a model for distributed systems especially amenable to efficient distributed simulation. UDel and MIT have begun a project to develop a PCA-based Simulation Sandbox for evaluating alternate PXMs for massively parallel computing. 19

Relevance to FSIO The Fresh Breeze Memory Model extends to 2 64 chunks, about chunks or 100,000 exabytes. The tree-structure is an excellent basis for advanced security and data object sharing. Can simplify check-point / restart. Seamless transition from “in-core” to “out-of-core” operation of arbitrary parallel programs. Provides ability to use any program as a component of new programs – including any parallel program. 20

Conclusion Making the best of many-core technology requires study of new program execution models (PXMs). The FSIO challenge can be met by integrating the file system into the system memory hierarchy as a global virtual store. The Fresh Breeze PXM demonstrates some of the benefits from departing from conventional system organization. More exploration of new PXMs is needed! 21