Using Multiphase Shared Arrays

Slides:

Advertisements

Similar presentations

MPI3 RMA William Gropp Rajeev Thakur. 2 MPI-3 RMA Presented an overview of some of the issues and constraints at last meeting Homework - read Bonachea's.

Advertisements

Computer Organization and Architecture

The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

CS 153 Design of Operating Systems Spring 2015

Memory Management and Paging CSCI 3753 Operating Systems Spring 2005 Prof. Rick Han.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Peer-to-peer Multimedia Streaming and Caching Service by Won J. Jeon and Klara Nahrstedt University of Illinois at Urbana-Champaign, Urbana, USA.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

Copyright 2005, The Ohio State University 1 Pointers, Dynamic Data, and Reference Types Review on Pointers Reference Variables Dynamic Memory Allocation.

Lab 2 Parallel processing using NIOS II processors

1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

An Accurate and Detailed Prefetching Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture.

CS61C L20 Datapath © UC Regents 1 Microprocessor James Tan Adapted from D. Patterson’s CS61C Copyright 2000.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Multi-Grid Esteban Pauli 4/25/06. Overview Problem Description Problem Description Implementation Implementation –Shared Memory –Distributed Memory –Other.

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Memory Management & Virtual Memory. Hierarchy Cache Memory : Provide invisible speedup to main memory.

The Promotion of Serviceability Achievement (Phase 12)

Displacement (Indexed) Stack

CE 454 Computer Architecture

William Stallings Computer Organization and Architecture 6th Edition

Chapter 11: File System Implementation

Memory and cache CPU Memory I/O.

Dynamic Branch Prediction

Chapter 12: File System Implementation

Section 9: Virtual Memory (VM)

William Stallings Computer Organization and Architecture 8th Edition

Computer Architecture

SHARED MEMORY PROGRAMMING WITH OpenMP

RV-Monitor: Efficient Parametric Runtime Verification with Simultaneous Properties Qingzhou Luo, Yi Zhang, Choonghwan Lee,

Alvaro Mauricio Peña Dariusz Niworowski Frank Rodriguez

Consider a Direct Mapped Cache with 4 word blocks

Lecture 6 Memory Hierarchy

William Stallings Computer Organization and Architecture 8th Edition

Local secondary storage (local disks)

Database Performance Tuning and Query Optimization

CS 286: Memory Paging and Virtual Memory

Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.

Data Representation – Instructions

Introduction to Computer Systems

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Memory and cache CPU Memory I/O.

A Practical Stride Prefetching Implementation in Global Optimizer

Page that info back into your memory!

Interconnect with Cache Coherency Manager

So far in memory management…

Chapter 11 Database Performance Tuning and Query Optimization

NT Executive Resources

Module IV Memory Organization.

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

CS703 - Advanced Operating Systems

Paging Memory Relocation and Fragmentation Paging

Module IV Memory Organization.

Chapter 11 Processor Structure and function

An Orchestration Language for Parallel Objects

Principle of Locality: Memory Hierarchies

Higher Level Languages on Adaptive Run-Time System

Support for Adaptivity in ARMCI Using Migratable Objects

Presentation transcript:

Using Multiphase Shared Arrays Jayant DeSouza Parallel Programming Lab University of Illinois, Urbana

Motivation No shared-memory facility in Charm++ General shared memory is expensive Page faults and fetching of data Cache coherence Observation: data is accessed differently in phases of the program A restricted form of shared memory could Provide shared memory paradigm And good performance 3/11/2004 http://charm.cs.uiuc.edu

Phases Read-only: page fetch, no coherence Write-only: no page fetch, no coherence Accumulate: no page fetch, no coherence Prefetch, Waitall, getKnownLocal, setKnownLocal, release 3/11/2004 http://charm.cs.uiuc.edu

API Create Init: arr1->enroll(numWorkers); Access: arr1= new MSA2D<double, ITEMS_PER_PAGE, ROW_MAJOR>(ROWS1, COLS1, NUM_WORKERS, bytes); new MSA2D<…>(rows,cols,arr1->getCacheGroup()); Init: arr1->enroll(numWorkers); Access: arr1->get(i, j) arr1->set(i,j) = 1.0; arr1->accumulate(i, j, value); Sync: arr1->sync(); 3/11/2004 http://charm.cs.uiuc.edu

Usage Details “build LIBS”, or #include “msa/msa.h” Compile with: cd charm/src/libs/ck-libs/multiphaseSharedArrays make #include “msa/msa.h” Compile with: charmc –module msa Documentation: charm/doc/libraries/ Examples: charm/pgms/charm++/multiphaseSharedArrays. 3/11/2004 http://charm.cs.uiuc.edu

Performance Issues Every access does page-table lookup. getKnownLocal, setKnownLocal Size of page table. 1000 CPUs, 10 MB/CPU, 1KB/page implies 10M pointers, i.e. 40 MB page table per CPU. We are investigating a 2-level table. 3/11/2004 http://charm.cs.uiuc.edu

Comparison with Global Arrays Array data is distributed blockwise across processes. Each block is local to one process. Home can change. No caching, no fetch. GA has (remote) operations get, put, float accumulate, int read-increment. No phases. 3/11/2004 http://charm.cs.uiuc.edu

Conclusion and Future Work Multiphase shared arrays have been designed and implemented. Performance studies needed. Add support for MSA in Jade. Extend accumulate to set-theoretic union. Choose a nice name: MiSa ? http://charm.cs.uiuc.edu/research/msa/ 3/11/2004 http://charm.cs.uiuc.edu