Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.

Slides:



Advertisements
Similar presentations
Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
Advertisements

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
G Robert Grimm New York University Virtual Memory.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.
Multiprocessors CS 6410 Ashik Ratnani, Cornell University.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CS-334: Computer.
Memory Management 2010.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Tornado: Maximizing Locality and Concurrency in a SMMP OS.
Memory Organization.
Computer Organization and Architecture
Multiprocessors Deniz Altinbuken 09/29/09.
CS533 Concepts of Operating Systems Jonathan Walpole.
Layers and Views of a Computer System Operating System Services Program creation Program execution Access to I/O devices Controlled access to files System.
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocesor Operating System By: Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael.
Review of Memory Management, Virtual Memory CS448.
Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
Supporting Multi-Processors Bernard Wong February 17, 2003.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Processes and Virtual Memory
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
CS 704 Advanced Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Jonathan Walpole Computer Science Portland State University
Processes and threads.
Chapter 2 Memory and process management
Memory Caches & TLB Virtual Memory
Operating Systems (CS 340 D)
Outline Paging Swapping and demand paging Virtual memory.
William Stallings Computer Organization and Architecture
Multiprocessor Cache Coherency
Swapping Segmented paging allows us to have non-contiguous allocations
Cache Memory Presentation I
CS510 Operating System Foundations
Operating Systems (CS 340 D)
Chapter 8: Main Memory.
Advanced Operating Systems
Evolution in Memory Management Techniques
Page Replacement.
Outline Midterm results summary Distributed file systems – continued
CS399 New Beginnings Jonathan Walpole.
Virtual Memory Hardware
Lecture 3: Main Memory.
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Operating System Chapter 7. Memory Management
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
TORNADO OPERATING SYSTEM
Prof. Leonardo Mostarda University of Camerino
Contents Memory types & memory hierarchy Virtual memory (VM)
CS510 - Portland State University
Virtual Memory Prof. Eric Rotenberg
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Lecture 8: Efficient Address Translation
CSE 471 Autumn 1998 Virtual memory
Virtual Memory: Working Sets
CS703 - Advanced Operating Systems
Distributed Resource Management: Distributed Shared Memory
Virtual Memory Lecture notes from MKP and S. Yalamanchili.
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Lecture 18: Coherence and Synchronization
COMP755 Advanced Operating Systems
Operating Systems: Internals and Design Principles, 6/E
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm (Department of Electrical and Computer Engineering, University of Toronto, 1999) Presented by: Anusha Muthiah, 4th Dec 2013

Why Tornado? Previous shared memory multiprocessors evolved from designs for architecture that existed back then

Locality – Traditional System Memory latency was low: fetch – decode – execute without stalling Memory faster than CPU = okay to leave data in memory CPU CPU CPU Loads & Stores Shared Bus Shared memory

Locality – Traditional System Memory got faster CPU got faster than memory Memory latency was high. Fast CPU = useless! CPU CPU CPU Shared Bus Shared memory

Locality – Traditional System Reading memory was very important Fetch – decode – execute cycle CPU CPU CPU Shared Bus Shared memory

Cache Shared memory Shared Bus CPU CPU CPU cache cache cache Extremely close fast memory Hitting a lot of addresses within a small region E.g. Program – seq of instructions Adjacent words on same page of memory More if in a loop

Repeated acess to the same page of memory Cache CPU CPU CPU cache cache cache Shared Bus Shared memory Repeated acess to the same page of memory was good!

Memory latency CPU executes at speed of cache Making up for memory latency CPU Really small How can it help us more than 2% of the time? cache Bus Memory

Locality Locality: same value or storage location being frequently accessed Temporal Locality: reusing some data/resource within a small duration Spatial Locality: use of data elements within relatively close storage locations Temporal: Recently accessed data and instructions are likely to be accessed in the near future Spatial: Data and instructions close to recently accessed data and instructions are likely to be accessed in the near future

Cache All CPUs accessing same page in memory Shared memory Shared Bus

Example: Shared Counter Memory CPU-1 Cache CPU-2 Counter Copy of counter exists in both processor’s cache Shared by both CPU-1 & CPU-2

Example: Shared Counter Memory CPU-1 CPU-2

Example: Shared Counter Memory CPU-1 CPU-2

Example: Shared Counter Memory CPU-1 1 CPU-2 Write in exclusive mode

Example: Shared Counter Memory CPU-1 1 CPU-2 Shared mode Read : OK

Example: Shared Counter 1 2 Memory CPU-1 CPU-2 2 Invalidate From shared to exclusive

Example: Shared Counter Terrible Performance!

Problem: Shared Counter Counter bounces between CPU caches leading to high cache miss rate Try an alternate approach Counter converted to an array Each CPU has its own counter Updates can be local Number of increments = commutativity of addition To read, you need all counters (add up)

Example: Array-based Counter Memory CPU-1 CPU-2 Array of counters, one for each CPU CPU-1 CPU-2

Example: Array-based Counter Memory CPU-1 1 CPU-2

Example: Array-based Counter Memory CPU-1 1 CPU-2

Example: Array-based Counter Memory CPU-1 1 CPU-2 Read Counter CPU 2 Add All Counters (1 + 1)

Performance: Array-based Counter Performs no better than ‘shared counter’!

Why Array-based counter doesn’t work? Data is tranferred between main memory and caches through fixed sized blocks of data called cache lines If two counters are in the same cache line, they can only be used by one processor at a time for writing Ultimately, still has to bounce between CPU caches Memory 1 1 Share cache line

False sharing Memory CPU-1 CPU-2 0,0

False sharing Memory CPU-1 0,0 CPU-2

False sharing Memory CPU-1 0,0 CPU-2 Sharing

False sharing Invalidate Memory CPU-1 1,0 CPU-2

False sharing Memory CPU-1 1,0 CPU-2 Sharing

False sharing Memory CPU-1 CPU-2 1,1 Invalidate

Ultimate Solution Pad the array (independent cache line) Spread counter components out in memory

Example: Padded Array Memory CPU-1 CPU-2 Individual cache lines for Individual cache lines for each counter

Updates independent of each other Example: Padded Array Memory CPU-1 1 CPU-2 Updates independent of each other

Performance: Padded Array Works better

Cache All CPUs accessing same page in memory Write sharing is very destructive CPU CPU CPU cache cache cache Shared Bus Shared memory

Shared Counter Example Minimize read/write & write sharing = minimize cache coherence Minimize false sharing

Now and then Then (Uni): Locality is good Now (Multi): Locality is bad. Don’t share! Traditional OS: implemented to have good locality Running same code on new architecture = cache interference More CPU is detrimental CPU wants to run from its own cache without interference from others

Modularity Minimize write sharing & false sharing How do you structure a system so that CPUs don’t share the same memory? Then: No objects Now: Split everything and keep it modular Paper: Object Oriented Approach Clustered Objects Existense Guarantee

Goal Ideally: CPU uses data from its own cache Cache hit rate good Cache contents stay in cache and not invalidated by other CPUs CPU has good locality of reference (it frequently accesses its cache data) Sharing is minimized

Object Oriented Approach Goal: Minimize access of shared data structures Minimize using shared locks Operating Systems are driven by requests of applications on virtual resources Good performance – requests to different virtual resources must be handled independently

How do we handle requests independently? Represent each resource by a different object Try to reduce sharing If an entire resource is locked, requests get queued up Avoid making resource a source of contention Use fine-grain locking instead

Coarse/Fine grain locking example Process object: maintains the list of mapped memory region in the process’s address space On a page fault, it searches process table to find the responsible region to forward page fault to Thread 1 Region 1 Region 2 Region 3 Whole table gets locked even though it only needs one region. Other threads get queued up … Region n Process Object

Coarse/Fine grain locking example Thread 1 Region 1 Region 2 Use individual locks Region 3 … Thread 2 Region n

Key memory management object relationships in Tornado If file data not cached in memory, request new phy page frame Forwards request to responsible region Page fault delivered Translates fault addr into a file offset. Forwards req to FCM And asks Cached Object Rep to fill page from file If file data is cached in memory, address of corresponding physical page frame is returned to Region which makes a call to Hardware Address Translation object

Advantage of using Object Oriented Approach Performance critical case of in-core page fault (TLB miss fault for a page table resident in memory) Objects invoked specific to faulting object Locks acquired & data structures accessed are internal to the object In contrast, many OSes use global page cache or single HAT layer = source of contention

Clustered Object Approach Obj Oriented is good, but some resources are just too widely shared Enter Clustering Eg: Thread dispatch queue If single list used : high contention (bottleneck + more cache coherency) Solution: Partition queue and give each processor a private list (no contention!)

Clustered Object Approach All clients access a clustered object using a common clustered obj reference Each call to a Clustered Obj automatically directed to appropriate local rep Actually made up of several component objects. Each rep handles calls from a subset of processors Looks like a single object

Clustered object Shared counter Example But actually made up of representative counters that each CPU can access independently It looks like a shared counter Common techniques for reducing contention is replication, distribution and partitionining of objects. Counter is full distribution.

Degree of Clustering One rep for the entire system Cached Obj Rep – Read mostly. All processors share a single rep. One rep for a cluster of Neighbouring processors Region – Read mostly. On critical path for all page faults. One rep per processor FCM – maintains state of pages of a file cached in memory. Hash table for cache split across many reps

Clustered Object Approach

Clustering through replication One way to have data shared: only read Replicate and all CPUs can read their copy from cache Writing? How to maintain consistency among replicas? Fundamentally same as cache coherence Take advantage of the semantics of object you are replicating Use object semantic knowledge to make a specific implementation (obj specific algm to maintain distributed state)

Advantages of Clustered Object Approach Supports replication, partitioning and locks (essential to SMMP) Built on object oriented design No need to worry about location or organization of objects Just use clustered object references Allows incremental implementation Depending on performance needed, change degree of clustering (Initially one rep serving all requests and then optimize when widely shared)

Clustered Object Implementation Each processor: own set translation tables Clustered object reference is just a pointer to the translation table Pointer to rep responsible for handling method invocations for local processor Clustered Obj 1 Ptr Clustered Obj 2 Ptr … … Translation table

Clustered Object Approach All clients access a clustered Object using a common clustered obj reference Each call to a Clustered Obj automatically directed to appropriate local rep Actually made up of several component objects. Each rep handles calls from a subset of processors Looks like a single object

Clustered Object Implementation Each ‘per processor’ copy of the table is located in the same virtual address space One pointer into table will give rep responsible for handling invocation on each processor

Clustered Object Implementation We do not know which reps will be needed Reps created on first access Each clustered object defines a miss handling object All entries in the translation table are initialized to point to a global miss handling object On first invocation, global miss handling object called Global miss handling object saves state and calls clustered object’s miss handling object Object miss handler: if (rep exists) pointer to rep installed in translation table else create rep, install pointer to it in translation table

Counter: Clustered Object CPU Object Reference Rep 1 Rep 2

Counter: Clustered Object CPU 1 Object Reference Rep 1 Rep 2

Counter: Clustered Object CPU 2 1 Update independent of each other Object Reference Rep 1 Rep 2

Counter: Clustered Object Read Counter Counter – Clustered Object CPU 1 Object Reference Rep 1 Rep 2

Counter: Clustered Object CPU 1 Add All Counters (1 + 1) Object Reference Rep 1 Rep 2

Synchronization Locking Existence Guarantee Related to concurrency control w.r.t modifications to data structures Existence Guarantee To ensure the data structure containing variable is not deallocated during update Concurrent access to a shared data structure = must provide existence guarantee

Locking Lots of overhead Basic instruction overhead Extra cache coherence traffic – write sharing of lock Tornado: Encapsulate all locks within individual objects Reduce scope of lock Limits contention Split objects into multiple representatives – further reduce contention!

Existence Guarantee Scheduler’s ready list Element 2 Garbage collected Memory possibly reallocated Thread Control Blocks Thread 1: Traversing the list Thread 2: Trying to delete element 2

Existence Guarantee How to stop a thread from deallocating an object being used by another? Semi-automatic garbage collection scheme

Garbage Collection Implementation References Temporary Persistent Held privately by single thread Destroyed when thread terminates Stored in shared memory Accessed by multiple threads

Clustered Object destruction Phase 1: Object makes sure all persistent references to it have been removed (lists, tables etc) Phase 2: Object makes sure all temporary references to it have been removed Uniprocessor: Number of active operations maintained in per- processor counter. Count = 0 (none active) Multiprocessor: Clustered object knows which set of processors can access it. Counter must become zero on all processors (circulating token scheme)

IPC Acts like a clustered object call crosses that crosses from protection domain of client to that of server and back Client requests are always serviced on their local processors Processor sharing: like handoff scheduling

Performance Test machine: 16 processor NUMAchine prototype SimOS Simulator

Component Results Avg no. of cylcles Required for ‘n’ threads Performs quite well for memory allocation and miss handling Lots of variations in garbage collection and PPC tests Multiple data structures occasionally mapping to the same cache block on some processors

Component Results Avg cycles required on SimOS with 4-way associative cache 4-way associativity: A cache entry from main memory can go into any one of the 4 places

Highly scalable Compares favourably to commercial operating systems (x100 slowdown on 16 processors) First generation: Hurricane (coarse grain approach to scalability) Third generation: K42 (IBM & University of Toronto)

References Clustered Objects: Initial Design, Implementation and Evaluation by Jonathan Appavoo, Masters thesis, University of Toronto, 1998 Tornado: Maximizing Locality and Concurrency in a shared-memory multiprocessor operating system by Benjamin Gamsa, PhD thesis, University of Toronto, 1999 Shared counter illustration, CS533, 2012 slides