Operating System Support for improving data locality on CC-NUMA machines CSE597A Presentation By V.N.Murali.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Background Virtual memory – separation of user logical memory from physical memory. Only part of the program needs to be in memory for execution. Logical.

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

Disco Running Commodity Operating Systems on Scalable Multiprocessors Presented by Petar Bujosevic 05/17/2005 Paper by Edouard Bugnion, Scott Devine, and.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA Babak Falsafi and David A. Wood University of Wisconsin, Madison, 1997 Presented by: Jie Xiao.

Shared Memory Multiprocessors Ravikant Dintyala. Trends Higher memory latencies Large write sharing costs Large secondary caches NUMA False sharing of.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.

G Robert Grimm New York University Disco.

Disco Running Commodity Operating Systems on Scalable Multiprocessors.

Chapter 17 Parallel Processing.

Translation Buffers (TLB’s)

Disco Running Commodity Operating Systems on Scalable Multiprocessors.

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

CSCI2413 Lecture 6 Operating Systems Memory Management 2 phones off (please)

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

ITEC 325 Lecture 29 Memory(6). Review P2 assigned Exam 2 next Friday Demand paging –Page faults –TLB intro.

Cellular Disco: resource management using virtual clusters on shared memory multiprocessors Published in ACM 1999 by K.Govil, D. Teodosiu,Y. Huang, M.

1 Design and Performance of a Web Server Accelerator Eric Levy-Abegnoli, Arun Iyengar, Junehwa Song, and Daniel Dias INFOCOM ‘99.

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)

CS533 Concepts of Operating Systems Jonathan Walpole.

Page 19/17/2015 CSE 30341: Operating Systems Principles Optimal Algorithm  Replace page that will not be used for longest period of time  Used for measuring.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Kit Cischke 09/09/08 CS Overview  Background  What are we doing here?  A Return to Virtual Machine Monitors  What does Disco do?  Disco: A.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

On Tuning Microarchitecture for Programs Daniel Crowell, Wenbin Fang, and Evan Samanas.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Chapter 8 Virtual Memory Operating Systems: Internals and Design Principles Seventh Edition William Stallings.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard et al. Madhura S Rama.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

Supporting Multi-Processors Bernard Wong February 17, 2003.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

WildFire: A Scalable Path for SMPs Erick Hagersten and Michael Koster Sun Microsystems Inc. Presented by Terry Arnold II.

Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.

Net-Centric Software and Systems I/UCRC A Framework for QoS and Power Management for Mobile Devices in Service Clouds Project Lead: I-Ling Yen, Farokh.

Region-Based Software Distributed Shared Memory Song Li, Yu Lin, and Michael Walker CS Operating Systems May 1, 2000.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Efficient data maintenance in GlusterFS using databases

Reactive NUMA A Design for Unifying S-COMA and CC-NUMA

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA

The University of Adelaide, School of Computer Science

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

The University of Adelaide, School of Computer Science

What we need to be able to count to tune programs

Page Replacement.

The Stanford FLASH Multiprocessor

Outline Midterm results summary Distributed file systems – continued

Translation Buffers (TLB’s)

Translation Buffers (TLB’s)

The University of Adelaide, School of Computer Science

CSE 542: Operating Systems

CSE 542: Operating Systems

Translation Buffers (TLBs)

The University of Adelaide, School of Computer Science

Review What are the advantages/disadvantages of pages versus segments?

Presentation transcript:

Operating System Support for improving data locality on CC-NUMA machines CSE597A Presentation By V.N.Murali

WHY CC-NUMA? Scalable with increase in number of nodes Attractive properties.Transparent access to local and remote memory at the cost of increased access latency to remote memory. 2 variations,CC-NUMA-(Stanford DASH,MIT Alewife,Sequent),CC- NOW(SUN s3.mp).

OS support Most important issue :Data locality, Performance enhancement provided by OS supported page migration and replication by as much as 30%

Issues in Migration/Replication When should pages be migrated? When should pages be replicated? Both are needed to boost performance. When not to migrate/replicate is also important. Which system parameter can be used to decide? Ideas?

Differences with S/W shared memory M & R in S/W DSM is needed for correctness.On CC-NUMA M&R is purely an optimization. M & R in S/W DSM is triggered by page faults.On CC-NUMA M&R is triggered by cache misses.

If workload exhibits good cache locality,less benefits from M&R.Hence selective criteria for moving pages. Study based on SimOS environment.

Solution How do we improve data locality? 3 access patterns a)primarily accessed by a single process b)mostly read access by many processes c)both read and write access by many processes Which method has to be applied for a),b),c)?

Costs to be considered 1)Cost of determining candidate pages for M&R. (Cost of cache misses/TLB misses) 2)Overhead of M&R.(new mappings,allocating a page,flushing TLB) 3)Actual data transfer 4)Memory pressure!

Key Parameters Parameters Semantics Reset intervalNumber of cycles for reset of all counters Trigger thresholdNumber of misses after which page is “hot” for M/R Sharing thresholdNumber of misses from another processor for R. Write thresholdNumber of writes after which no R Migrate thresholdNumber of migrates after which no M.

Summary of the algorithm “Hot page”:page whose counter for a processor reaches the trigger threshold If the miss counter for this page (on any other processor) reaches the sharing threshold then it is considered for replication else it is considered for migration. Replicated only if write counter has not exceeded write threshold.Migrated only if the migrate counter has not exceeded migrate threshold

Implementation details Directory controller maintains the miss counters and generates a low-priority interrupt. Bunches a couple of pages before raising interrupt. Writes to replicated pages are collapsed to a single page

IRIX changes Replication support Finer grain locking Page table back mappings

Workloads Engineering workload:large sequential + memory intensive,used Verilog simulator,Flashlite. Parallel application : Raytrace which is a parallel graphics algorithm Scientific workload : Splash Decision support database Multiprogrammed software: Pmake

Performance analysis 3 factors a)user stall time,b)fraction of misses satisfied in local memory,c)kernel overhead. Engineering:large user stall time=>best performance gain.M&R were used successfully Raytrace: read only accesses mostly.Mainly benefits from replication.

Splash:3 parallel applications,Raytrace,Ocean,Volume rendering.For ocean migration is helpful.Raytrace and Volume can benefit from replication Database:mostly read access and hence replication

Alternative policies Static policies,dynamic policies. Static:Round robin,First touch,Post facto(similar to optimal page replacement algorithm) Dynamic:Migration only,replication only,Migration-Replication.