Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.

Slides:



Advertisements
Similar presentations
Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Advertisements

Chapter 3 Process Description and Control
Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
OS Fall’02 Virtual Memory Operating Systems Fall 2002.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented by Alana Sweat.
Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.
Multiprocessors CS 6410 Ashik Ratnani, Cornell University.
Shared Memory Multiprocessors Ravikant Dintyala. Trends Higher memory latencies Large write sharing costs Large secondary caches NUMA False sharing of.
Threads Section 2.2. Introduction to threads A thread (of execution) is a light-weight process –Threads reside within processes. –They share one address.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
Paging and Virtual Memory. Memory management: Review  Fixed partitioning, dynamic partitioning  Problems Internal/external fragmentation A process can.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, Henry M. Levy ACM Transactions Vol. 8, No. 1, February 1990,
1 Process Description and Control Chapter 3. 2 Process Management—Fundamental task of an OS The OS is responsible for: Allocation of resources to processes.
Inter Process Communication:  It is an essential aspect of process management. By allowing processes to communicate with each other: 1.We can synchronize.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
Memory Management 2010.
Tornado: Maximizing Locality and Concurrency in a SMMP OS.
3.5 Interprocess Communication
Memory Organization.
Computer Organization and Architecture
1 Supporting Hot-Swappable Components for System Software Kevin Hui, Jonathan Appavoo, Robert Wisniewski, Marc Auslander, David Edelsohn, Ben Gamsa Orran.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
PRASHANTHI NARAYAN NETTEM.
CS364 CH08 Operating System Support TECH Computer Science Operating System Overview Scheduling Memory Management Pentium II and PowerPC Memory Management.
Multiprocessors Deniz Altinbuken 09/29/09.
CS533 Concepts of Operating Systems Jonathan Walpole.
Layers and Views of a Computer System Operating System Services Program creation Program execution Access to I/O devices Controlled access to files System.
1 Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska and Henry M. Levy Presented by: Karthika Kothapally.
CS533 Concepts of Operating Systems Class 9 Lightweight Remote Procedure Call (LRPC) Rizal Arryadi.
Virtual Memory.
Chapter 8 Windows Outline Programming Windows 2000 System structure Processes and threads in Windows 2000 Memory management The Windows 2000 file.
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocesor Operating System By: Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
CS533 Concepts of Operating Systems Jonathan Walpole.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Supporting Multi-Processors Bernard Wong February 17, 2003.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
Processes and Virtual Memory
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.
1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Distributed Shared Memory
Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.
William Stallings Computer Organization and Architecture
B. N. Bershad, T. E. Anderson, E. D. Lazowska and H. M
Introduction to Operating Systems
Page Replacement.
Outline Midterm results summary Distributed file systems – continued
Architectural Support for OS
Presented by Neha Agrawal
TORNADO OPERATING SYSTEM
CSE 451: Operating Systems Autumn 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 596 Allen Center 1.
Shared Memory Multiprocessors
CSE 451: Operating Systems Autumn 2001 Lecture 2 Architectural Support for Operating Systems Brian Bershad 310 Sieg Hall 1.
CS510 - Portland State University
CSE 451: Operating Systems Winter 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 412 Sieg Hall 1.
CS510 Operating System Foundations
Architectural Support for OS
CS533 - Concepts of Operating Systems
Presentation transcript:

Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm By : Priya Limaye

Locality What is Locality of reference?

Locality What is Locality of reference? sum = 0; for (int i = 0; i < 10; i ++) { sum = sum + number[i]; } sum = 0; for (int i = 0; i < 10; i ++) { sum = sum + number[i]; }

Locality What is Locality of reference? sum = 0; for (int i = 0; i < 10; i ++) { sum = sum + number[i]; } sum = 0; for (int i = 0; i < 10; i ++) { sum = sum + number[i]; } Temporal Locality Recently accessed data and instruction are likely to be accessed in near future Temporal Locality Recently accessed data and instruction are likely to be accessed in near future

Locality What is Locality of reference? sum = 0; for (int i = 0; i < 10; i ++) { sum = sum + number[i]; } sum = 0; for (int i = 0; i < 10; i ++) { sum = sum + number[i]; } Spatial Locality Data and instructions close to recently accessed data and instructions are likely to be accessed in the near future. Spatial Locality Data and instructions close to recently accessed data and instructions are likely to be accessed in the near future.

Locality What is Locality of reference? – Recently accessed data and instructions and nearby data and instructions are likely to be accessed in the near future. – Grab a larger chunk than you immediately need – Once you’ve grabbed a chunk, keep it

Locality in multiprocessor Computation depends on data local to processor – Each processor uses data from its own cache – Once data is brought in cache it stays there

Locality in multiprocessor Memory CPU Cache CPU Cache Counter

Counter: Shared Memory CPU 0

Counter: Shared Memory CPU 0 0

Counter: Shared Memory CPU 1 1

Counter: Shared Memory CPU Read : OK

Counter: Shared Memory CPU 2 2 Invalidate

Comparing counter 1.Scales well with old architecture 2.Performs worse with shared memory multiprocessor 1.Scales well with old architecture 2.Performs worse with shared memory multiprocessor

Counter: Array Sharing requires moving back and forth between CPU Caches Split counter into array Each CPU get its own counter

Counter: Array Memory CPU 00

Counter: Array Memory CPU 1 10

Counter: Array Memory CPU

Counter: Array Memory CPU Read Counter Add All Counters (1 + 1)

Counter: Array This solves the problem What about performance?

Comparing counter Does not perform better than ‘shared counter’.

Counter: Array This solves the problem What about performance? What about false sharing?

Counter: False Sharing Memory CPU 0,0

Counter: False Sharing Memory CPU 0,0 CPU 0,0

Counter: False Sharing Memory CPU 0,0 CPU 0,0 Sharing

Counter: False Sharing Memory CPU 1,0 CPU 1,0 Invalidate

Counter: False Sharing Memory CPU 1,0 CPU 1,0 Sharing

Counter: False Sharing Memory CPU 1,1 Invalidate

Solution? Use padded array Different elements map to different locations

Counter: Padded Array Memory CPU 00

Counter: Padded Array Memory CPU Update independent of each other

Comparing counter Works better

Locality in OS Serious performance impact Difficult to retrofit Tornado – Ground up design – Object Oriented approach – Natural locality

Tornado Object Oriented Approach Clustered Objects Protected Procedure Call Semi-automatic garbage collection – Simplified locking protocol

Object Oriented Approach Process 1 Process 2 … … Process Table

Object Oriented Approach Process 1 Process 2 … … Process Table Process 1 Lock

Object Oriented Approach Process 1 Process 2 … … Process Table Process 1 Lock Process 2

Object Oriented Approach Process 1 Process 2 … … Process Table Process 1 Lock Process 2 Lock

Object Oriented Approach Class ProcessTableEntry{ data lock code }

Object Oriented Approach Each resource is represented by different object Requests to virtual resources handled independently – No shared data structure access – No shared locks

Object Oriented Approach Process Page Fault Exception

Object Oriented Approach Process Page Fault Exception Region

Object Oriented Approach Process Page Fault Exception Region FCM FCMFile Cache Manager

Object Oriented Approach HAT Process RegionFCM RegionFCM HATHardware Address Translation FCMFile Cache Manager Search for responsibl e region Page Fault Exception

Object Oriented Approach Process Page Fault Exception Region FCM COR DRAM FCMFile Cache Manager CORCached Object Representative DRAMMemory manager

Object Oriented Approach Multiple implementations for system objects Dynamically change the objects used for resource Provides foundation for other Tornado features

Clustered Objects Improve locality for widely shared objects Appears as single object – Composed of multiple component objects Has representative ‘rep’ for processors – Defines degree of clustering Common clustered object reference for client

Clustered Objects

Clustered Objects : Implementation

A translation table per processor – Located at same virtual address – Pointer to rep Clustered object reference is just a pointer into the table ‘reps’ created on demand when first accessed – Special global miss handling object

Counter: Clustered Object Counter – Clustered Object CPU rep 1 Object Reference

Counter: Clustered Object Counter – Clustered Object CPU 1 1 rep 1 Object Reference

Counter: Clustered Object Counter – Clustered Object CPU 2 1 rep 2rep 1 Object Reference Update independent of each other

Clustered Objects Degree of clustering Multiple reps per object – How to maintain consistency ? Coordination between reps – Shared memory – Remote PPCs

Counter: Clustered Object Counter – Clustered Object CPU 1 1 rep 1 Object Reference

Counter: Clustered Object rep 1 Object Reference Counter – Clustered Object CPU 1 1 rep 1 Read Counter

Counter: Clustered Object rep 1 Object Reference Counter – Clustered Object CPU rep 1 Add All Counters (1 + 1)

Clustered Objects : Benefits Facilitates optimizations applied on multiprocessor e.g. replication and partitioning of data structure Preserves object-oriented design Enables incremental optimizations Can have several different implementations

Synchronization Two kinds of locking issues – Locking – Existence guarantees

Synchronization: Locking Encapsulate locking within individual objects Uses clustered objects to limit contention Uses spin-then-block locks – Highly efficient – Reduces cost of lock/unlock pair

Synchronization: Existence guarantees All references to an object protected by lock – Eliminates races where one thread is accessing the object and another is deallcoating it Complex global hierarchy of locks Tornado - semi automatic garbage collection – Clustered object reference can be used any time – Eliminates needs for locks

Garbage Collection Distinguish between temporary references and persistent references – Temporary: clustered references held privately – Persistent: shared memory, can persist beyond lifetime of a thread

Garbage Collection Remove all persistent references – Normal cleanup Remove all temporary references – Event driven kernel – Maintain counter for each processor – Delete object if counter is zero Destroy object itself

Garbage Collection 259 Process 1 Read

Garbage Collection 259 Process 1 Read Counter ++

Garbage Collection 259 Process 1 Read Counter = 1 Process 2 Delete

Garbage Collection 259 Process 1 Read Counter = 1 Process 2 Delete GC If counter = 0

Garbage Collection 259 Process 1 Counter-- Process 2

Garbage Collection 29 Process 1 Counter = 0 Process 2 GC If counter = 0

Interprocess communication Uses Protected Procedure Calls A call from client object to server object – Clustered object call that crosses protection domain of client to server Advantages – Client requests serviced on local processor – Client and server share processors similar to handoff scheduling – Each client request has one thread in server

PPC: Implementation On demand creation of server threads Maintains list of worker threads Implemented as a trap and some queue manipulations – Dequeue worker thread from ready workers – Enqueue caller thread on the worker – Return from-trap to the server Registers are used to pass parameters

Performance

Performance: summary Strong basic design Highly scalable Locality and locking overhead are major source of slowdown

Conclusion Object-oriented approach and clustered objects exploits locality and concurrency OO design has some overhead, but these are low compared to performance advantages Tornado scales extremely well and achieves high performance on shared-memory multiprocessors

References 0/papers/05.pdf 0/papers/05.pdf Presentation by Holly Grimes, CS 533, Winter ence ence