University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.

Slides:

Advertisements

Similar presentations

The Effect of Network Total Order, Broadcast, and Remote-Write on Network- Based Shared Memory Computing Robert Stets, Sandhya Dwarkadas, Leonidas Kontothanassis,

Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Department of Computer Science and Engineering University of Washington Brian N. Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski,

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.

Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Introduction to MIMD architectures

Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.

Fall 2006Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory.

Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.

1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Jonathan.

INTRODUCTION OS/2 was initially designed to extend the capabilities of DOS by IBM and Microsoft Corporations. To create a single industry-standard operating.

CS 550 Amoeba-A Distributed Operation System by Saie M Mulay.

Extensibility, Safety and Performance in the SPIN Operating System Brian Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski,

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Chapter 5: Memory Management Dhamdhere: Operating Systems— A Concept-Based Approach Slide No: 1 Copyright ©2005 Memory Management Chapter 5.

Chapter 6 - Implementing Processes, Threads and Resources Kris Hansen Shelby Davis Jeffery Brass 3/7/05 & 3/9/05 Kris Hansen Shelby Davis Jeffery Brass.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

ITEC 325 Lecture 29 Memory(6). Review P2 assigned Exam 2 next Friday Demand paging –Page faults –TLB intro.

AICS Café – 2013/01/18 AICS System Software team Akio SHIMADA.

9/13/20151 Threads ICS 240: Operating Systems –William Albritton Information and Computer Sciences Department at Leeward Community College –Original slides.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &

Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.

High Performance Computing on Virtualized Environments Ganesh Thiagarajan Fall 2014 Instructor: Yuzhe(Richard) Tang Syracuse University.

Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.

The Performance of Micro-Kernel- Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presentation by: Seungweon Park.

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.

OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.

Embedded System Lab 김해천 Thread and Memory Placement on NUMA Systems: Asymmetry Matters.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

EXTENSIBILITY, SAFETY AND PERFORMANCE IN THE SPIN OPERATING SYSTEM

Miseon Han Thomas W. Barr, Alan L. Cox, Scott Rixner Rice Computer Architecture Group, Rice University ISCA, June 2011.

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

Computer Architecture Lecture 27 Fasih ur Rehman.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Deterministic Communication with SpaceWire

Efficient Instrumentation for Code Coverage Testing

Current Generation Hypervisor Type 1 Type 2.

Chapter 9: Virtual Memory

KERNEL ARCHITECTURE.

What we need to be able to count to tune programs

Department of Computer Science University of California, Santa Barbara

Chapter 9: Virtual-Memory Management

Page Replacement.

Hardware Profile-guided Automatic Page Placement on ccNUMA systems

Outline Midterm results summary Distributed file systems – continued

Department of Computer Science University of California, Santa Barbara

CSE 542: Operating Systems

Presentation transcript:

University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth

University of Maryland 2/17 Introduction Cache-coherent SMPs are widely used –High performance computing –Large-scale applications –Client-server computing cc-NUMA is the dominant architecture –Allows construction of large servers –Data locality is an important consideration Faster access to local memory units Platform Local Access Time Remote Access Time Ratio Sun Fire ns300ns1:1.33 Sun Fire 15K225ns400ns1:1.78

University of Maryland 3/17 Data Placement Memory intensive applications on cc-NUMA servers –May have significant non-local memory accesses Possible optimization to increase locality –First-touch placement of memory pages Commonly used in modern systems May not place pages local to the processors accessing them most –Dynamic page placement/migration Page access frequencies at runtime

University of Maryland 4/17 Our Page Migration Approach User-level dynamic page migration –Profiling and page migration during the same run Application Profiling –Gathers data from hardware counters Sample the interconnect transactions –Transaction Type + Physical Address + Processor ID –Identifies preferred locations of memory pages Memory unit local to the processor that accesses most Page Placement –Kernel moves memory pages to their preferred locations –At fixed time intervals –Pages are frozen for a while if recently migrated Eliminates ping-ponging of memory pages

University of Maryland 5/17 Application Hardware/Software Components Processor 1 Memory Unit System Board 1 Processor 2 Processor 3 Processor 4 Processor 1 Memory Unit System Board 2 Processor 2 Processor 3 Processor 4 Sun Fire 6800 Address Bus Sun Fire Link Hardware Counters Transaction Sampling Instrumentation Software Virtual Page Virtual to Physical Mapping (meminfo) Physical Page Page Migration using move-on-next-touch feature (madvise) Thread 1 Thread j Explicit binding (processor_bind)

University of Maryland 6/17 Instrumentation Code Insertion Instrumentation using Dyninst –Entry point of main Loads a shared library Creates two helper threads –One for address transaction sampling –Other for actual migrations of the pages –Exit point(s) of thr_create Calls processor_bind –Binds new threads to available processors –Helper threads are bound to dedicated processors –Entry point of exithandle Termination detection Clean-up hardware counters

University of Maryland 7/17 Preliminary Experiment Impractical to record all transactions –Interval sampling Sampling at every N th transaction –Continuous sampling Sampling at the maximum speed of the instrumentation software Are samples representative of transactions?

University of Maryland 8/17 Representative Sampling Technique Potential sampling error –How much do sampled transactions deviate from all transactions? Distance between two sets –S ALL and S SAMPLE –Ratio of transactions requested by a processor, P S All S Sample PAPA PSPS

University of Maryland 9/17 Sampling Error for CG Interval sampling is more representative –Interval used also has an impact Continuous sampling is less representative due to difference between the rates –Transaction samples are taken –Processor requests transactions Continuous Sampling Interval Sampling at Every 4K1K25664 Proc Proc Proc Proc Proc Proc Average Error % Rate Sampled

University of Maryland 10/17 Page Migration Experiments Applications –OpenMP C implementation of NAS Parallel Benchmark suite BT(B), CG(C), EP(C), FT(B), LU(C), MG(B), SP(C) Optimized to support parallelized code Platform –24 processor Sun Fire 6800 –24 GB main memory Execution –12 threads 2 threads on each system board –Page migration at every 5 seconds –Interval sampling at every 1K transactions

University of Maryland 11/17 Reduction in Non-Local Memory Accesses Reduction38.0%81.0%67.0%54.0%19.7%89.6%58.8% Trans.(M)38,50715,721422,29748, ,116

University of Maryland 12/17 Performance Improvement # Migrations112,31047,2132,071177,602132,69649,884138,943 Original Time ,981313,901 % Overhead1.2%0.8%0.1%12.8%0.7%10.2%0.5%

University of Maryland 13/17 SPECjbb2001 Results Potential improvement? –Migration working at object granularity % Reduction25.3%26.1%24.4% Trans.(M)2,0172,6262,621

University of Maryland 14/17 MG.B Address Space [0-512MB)

University of Maryland 15/17 MG.B with Page Migration

University of Maryland 16/17 Conclusions Our dynamic page migration approach –Reduced non-local memory accesses by upto 90% –Improved the execution times by upto 16% Potentially more effective on larger cc- NUMA servers –Sun Fire 15K (latency ratio => 1:1.78) User level page migration approach –Relies on the OS kernel to provide the actual migration mechanism.

University of Maryland 17/17 Questions???