Stefan Kaestle, Reto Achermann, Timothy Roscoe, Tim Harris ATC’15

Slides:

Advertisements

Similar presentations

Introduction to Grid Application On-Boarding Nick Werstiuk

Advertisements

Technology from seed Cloud-TM: A distributed transactional memory platform for the Cloud Paolo Romano INESC ID Lisbon, Portugal 1st Plenary EuroTM Meeting,

Distributed Systems CS

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

8. Code Generation. Generate executable code for a target machine that is a faithful representation of the semantics of the source code Depends not only.

Department of Computer Science and Engineering University of Washington Brian N. Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski,

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.

BROADWAY: A SOFTWARE ARCHITECTURE FOR SCIENTIFIC COMPUTING Samuel Z. Guyer and Calvin Lin The University of Texas.

Disco Running Commodity Operating Systems on Scalable Multiprocessors.

Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.

Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.

Copyright Arshi Khan1 System Programming Instructor Arshi Khan.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Computer System Architectures Computer System Software

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Operating System Chapter 7. Memory Management Lynn Choi School of Electrical Engineering.

CASTNESS‘11 Computer Architectures and Software Tools for Numerical Embedded Scalable Systems Workshop & School: Roma January 17-18th 2011 Frédéric ROUSSEAU.

Cluster Reliability Project ISIS Vanderbilt University.

SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,

INTRODUCTION SOFTWARE HARDWARE DIFFERENCE BETWEEN THE S/W AND H/W.

Embedded System Lab. 김해천 Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist,

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

(Mis)Understanding the NUMA Memory System Performance of Multithreaded Workloads Zoltán Majó Thomas R. Gross Department of Computer Science ETH Zurich,

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard et al. Madhura S Rama.

Embedded System Lab 김해천 Thread and Memory Placement on NUMA Systems: Asymmetry Matters.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

CIS250 OPERATING SYSTEMS Chapter One Introduction.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

ORACLE & VLDB Nilo Segura IT/DB - CERN. VLDB The real world is in the Tb range (British Telecom - 80Tb using Sun+Oracle) Data consolidated from different.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.

Background Computer System Architectures Computer System Software.

Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I,

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

The Post Windows Operating System

HPC In The Cloud Case Study: Proteomics Workflow

Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.

Code Optimization.

Introduction to Compiler Construction

Parallel Programming By J. H. Wang May 2, 2017.

Chapter 9 – Real Memory Organization and Management

Parallel Algorithm Design

Jozsef Patvarczki, Elke A. Rundensteiner, and Neil T. Heffernan

Many-core Software Development Platforms

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

What we need to be able to count to tune programs

The Multikernel A new OS architecture for scalable multicore systems

A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory Nandita Vijaykumar Abhilasha Jain, Diptesh Majumdar, Kevin.

Overview of big data tools

Distributed Systems CS

Operating System Chapter 7. Memory Management

CS510 - Portland State University

Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J

JIT Compiler Design Maxine Virtual Machine Dhwani Pandya

Presentation transcript:

Shoal: smart allocation and replication of memory for parallel programs Stefan Kaestle, Reto Achermann, Timothy Roscoe, Tim Harris ATC’15 March 31st, 2016 Cho, Hyojae

CONTENTS Introduction Motivation Array

1. Introduction Memory allocation in NUMA multi-core machines NUMA(Non-Uniform Memory Access)

1. Introduction Methods: Manual configuration by programmers They struggle to develop software applying these techniques Programmers must repeatedly make manual changes Relying on automatic online monitoring to decide how to migrate data Maybe expensive Small number of optimizations

2. Motivation “memset()” considered harmful on multi-core

2. Motivation Shoal A system that abstracts memory access and provides rich programming interface It automatically tune data placement and access based on memory access patterns Programmers need not to know where the data is saved

2. Motivation Shoal A new interface for memory allocation including machine aware “malloc” call An abstraction for data access based on arrays. All implementations can be interchanged transparently without the need to change programs.

3. Array Array types Single-node allocation Distribution Replication Allocates the entire array on the local node Distribution Allocate data split equally across NUMA nodes Replication Several copies of the array are allocated. Partitioning Allocate data where work units can be executed local

3. Array Selection of arrays Maximize local access to minimize interconnect traffic Load-balance memory on all available controllers Partitioning If the array is only accessed via an index Replication If the array is read-only and fits into every NUMA node Otherwise use a uniform distribution

3. Array Selection of arrays

4. Implementation The Shoal runtime library A high-level array representation based on C++ templates. A low-level, OS-specific backend

4. Implementation An example of high-level DSLs. DSL : Domain-Specific Language Foreach (t: G.Nodes) means the nodes-array will be accessed sequentially, and with an index Sum(w: t.InNbrs) implies read-only, indexed accesses on in-neighbors array.

4. Implementation High-level compiler High-level program Written in high-level parallel language Such as Green-Marl, OptiML High-level compiler It translates high-level code to low-level code Low-level code with array abstractions Written in C++ It uses Shoal’s abstraction to allocate and access memory At compile time the concrete choice of array implementation is not made.

4. Implementation Access patterns Shoal library OS-specific backends A information about load/store patterns Read/write ratio Shoal library It takes care of selecting array implementations based on extracted access patterns OS-specific backends It runs on the Linux and Barrelfish OS currently.

5. Evaluation Goal: Comparison of Shore and a regular memory runtime Shore’s array implementations Analyze shoal’s initialization cost Investigate the benefits of using a DMA engine for array copy

5. Evaluation Machines

5. Evaluation Scalability (Green-Marl) ) Almost 2x faster than the original implementation

5. Evaluation Scalability (PARSEC - Streamcluster) One of the used arrays is replaced with Shoal array 4x faster than original implementation

5. Evaluation Use DMA engines

6. Conclusion Shoal, a library that provides an array abstraction rich memory allocation functions allow automatic tuning of data placement and access depending on workload and machine characteristics 2x improvement for Green-Marl program without changing the Green-Marl input program