CS533 Concepts of Operating Systems Jonathan Walpole.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Memory.
Background Virtual memory – separation of user logical memory from physical memory. Only part of the program needs to be in memory for execution. Logical.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
G Robert Grimm New York University Virtual Memory.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Multiple Processor Systems
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.
Multiprocessors CS 6410 Ashik Ratnani, Cornell University.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
CS 333 Introduction to Operating Systems Class 12 - Virtual Memory (2) Jonathan Walpole Computer Science Portland State University.
Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CS-334: Computer.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Memory Management 2010.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Tornado: Maximizing Locality and Concurrency in a SMMP OS.
Memory Organization.
Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
Computer Organization and Architecture
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
CS364 CH08 Operating System Support TECH Computer Science Operating System Overview Scheduling Memory Management Pentium II and PowerPC Memory Management.
Multiprocessors Deniz Altinbuken 09/29/09.
Multiprocessor Cache Coherency
Layers and Views of a Computer System Operating System Services Program creation Program execution Access to I/O devices Controlled access to files System.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Virtual Memory.
CS333 Intro to Operating Systems Jonathan Walpole.
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocesor Operating System By: Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
CS533 Concepts of Operating Systems Jonathan Walpole.
Review of Memory Management, Virtual Memory CS448.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Multi-core architectures. Single-core computer Single-core CPU chip.
Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.
1 Memory Management Chapter Basic memory management 4.2 Swapping (εναλλαγή) 4.3 Virtual memory (εικονική/ιδεατή μνήμη) 4.4 Page replacement algorithms.
IT253: Computer Organization
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Supporting Multi-Processors Bernard Wong February 17, 2003.
Memory Management Fundamentals Virtual Memory. Outline Introduction Motivation for virtual memory Paging – general concepts –Principle of locality, demand.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
CS333 Intro to Operating Systems Jonathan Walpole.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
Processes and Virtual Memory
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Jonathan Walpole Computer Science Portland State University
Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.
William Stallings Computer Organization and Architecture
CS510 Operating System Foundations
Page Replacement.
Outline Midterm results summary Distributed file systems – continued
CS399 New Beginnings Jonathan Walpole.
Presented by Neha Agrawal
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
TORNADO OPERATING SYSTEM
Prof. Leonardo Mostarda University of Camerino
CS510 - Portland State University
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
CS510 Operating System Foundations
Presentation transcript:

CS533 Concepts of Operating Systems Jonathan Walpole

Tornado

Why Tornado? Existing OS designs do not scale on shared memory multiprocessors!

Locality – Traditional System Memory latency was low: fetch–decode–execute without stalling Memory faster than CPU = okay to leave data in memory CPU Shared Bus Shared memory Loads & Stores

Locality – Traditional System Memory got faster, but CPUs got even faster than memory Memory latency became high. Fast CPUs became worthless CPU Shared Bus Shared memory

Locality – Traditional System Memory access speed became critical CPU Shared Bus Shared memory

Cache CPU Shared Bus Shared memory cache Fast memory close to CPU Locality of reference

Cache CPU Shared Bus Shared memory cache We like repeated access to the same page of memory!

Memory Latency CPU executes at speed of cache, not memory! CPU cache Bus Memory Really small! How can it help us most of the time?

Locality Locality: -same memory location accessed frequently Temporal Locality: -same memory location repeatedly accessed within a small duration Spatial Locality: -memory location accessed by local CPU only

Caches All CPUs accessing the same memory location CPU Shared Bus Shared memory cache

Shared Counter Example Memory CPU-1 Cache CPU-2 Cache Counter Shared by both CPU-1 & CPU-2 Copy of counter exists in both processor’s cache

Shared Counter Example Memory CPU-1CPU-2 0

Shared Counter Example Memory CPU-1 0 CPU-2 0

Shared Counter Example Memory CPU-1 1 CPU-2 1 Write in exclusive mode

Shared Counter Example Memory CPU-1 1 CPU Read : OK Shared mode

Shared Counter Example Memory CPU-1CPU Invalidate From shared to exclusive 2 1

Shared Counter Example Terrible Performance!

What Went Wrong? Counter bounces between CPU caches leading to high cache miss rate -CPU speed set by memory speed not cache A better idea? -Convert counter to an array -Each CPU has its own counter -Updates can be local -Reads need all counters (add up) -Depends on the commutativity of addition

Array-Based Counter Example Memory CPU-1CPU-2 00 Array of counters, one for each CPU CPU-2 CPU-1

Array-Based Counter Example Memory CPU-1 1 CPU-2 10

Array-Based Counter Example Memory CPU-1 1 CPU

Array-Based Counter Example Memory CPU-1 1 CPU CPU Read Counter Add All Counters (1 + 1)

Array-Based Counter Performance Performance is no better!

Why Didn’t it Work? Data is transferred between main memory and caches in fixed sized blocks, called cache lines If two counters are in the same cache line, they can only be used by one processor at a time for writing -It still has to bounce between CPU caches! Memory 11 Shared cache line

False Sharing Memory CPU-1CPU-2 0,0

False Sharing Memory CPU-1 0,0 CPU-2 0,0

False Sharing Memory CPU-1 0,0 CPU-2 0,0 Sharing

False Sharing Memory CPU-1 1,0 CPU-2 1,0 Invalidat e

False Sharing Memory CPU-1 1,0 CPU-2 1,0 Sharing

False Sharing Memory CPU-1CPU-2 1,1 Invalidate

A Better Approach Pad the array -leave unused memory space between counters -force them into independent cache lines But doesn’t this waste memory... cache space... and TLB entries... etc? Isn’t this bad for locality?

Padded Array Example Memory CPU-1CPU-2 00 Individual cache lines for each counter

Padded Array Example Memory CPU-1 1 CPU Updates independen t of each other

Padded Array Performance Works better

Write Sharing With Caches If all CPUs access the same memory location, sharing is very destructive, if they write CPU Shared Bus Shared memory cache

Shared Counter Example Minimize read/write & write/write sharing -reduces cache coherence protocol activity! -i.e., cuts out unnecessary communication Ideally we should remove all false sharing

What Has Changed? In the old days locality was a good thing Now locality seems to be a bad thing -if you share data, you communicate -communication is expensive! Traditional OS implemented to have good locality -on new architectures this causes cache interference -more CPUs make the problem worse! Ideally, CPUs should run from their own cache with no interference from others

Modularity How can we minimize write sharing & false sharing? How can we structure a system so that CPUs don’t share the same memory? Split data and encapsulate it - each CPU should access a separate instance The Tornado approach: -Object oriented design -Clustered objects -Existence guarantees

Goal Ideally: -Cache hit rate should be high -CPU should use data from its own cache -CPU’s locality of reference should be good (it frequently accesses its cached data) -Sharing is minimal -Cache contents stay in cache and are not invalidated by other CPUs

Object Oriented Approach Goals: -Minimize access to shared data structures -Minimize need for shared locks Operating systems service application requests for virtual resources –requests for different virtual resources must be handled independently -resources are objects -each instance of a resource is a separate object instance -fine-grain locking: a lock per instance

Coarse vs. Fine Grain Locking Process object: maintains the list of mapped memory regions in the process’s address space On a page fault, it searches the process table to find the responsible region to forward page fault to Region 1 Region 2 … Region n Region 3 Thread 1 Whole table gets locked even though it only needs one region. Other threads get queued up Process Object

Coarse vs. Fine Grain Locking Region n … Region 3 Region 2 Region 1 Use individual locks Thread 1 Thread 2

Memory Management Objects Page fault delivered Forwards request to responsible region Translates fault addr into a file offset. Forwards req to FCM If file data not cached in memory, request new phy page frame And asks Cached Object Rep to fill page from file If file data is cached in memory, address of corresponding physical page frame is returned to Region which makes a call to the Hardware Address Translation object

Advantages of Object Orientation Performance critical case of in-core page fault (TLB miss fault for a memory resident page) -Objects invoked specific to faulting CPU -Locks acquired & data accessed are internal to the object -In contrast, many Operating Systems use a global page cache or single HAT layer -These must be locked and both data and locks are sources of contention

Clustered Object Approach Object Orientation is good, but some objects (resources) are still too widely shared -Thread dispatch queue (ready list) -Use of a single list causes memory and lock contention -Solution: Partition the queue and give each processor its own private list These private lists represent the thread dispatch queue -together they form a clustered object -each private list is a clustered object representative

Clustered Objects Looks like a single object Actually made up of several component objects. Each rep handles calls from a subset of processors All clients access a clustered object using a common clustered obj reference Each call to a Clustered Obj automatically directed to appropriate local rep

Clustered Objects Shared counter Example It looks like a shared counter But actually made up of representative counters that each CPU can access independently

One rep for the entire system One rep per processor One rep for a cluster of Neighbouring processors Cached Obj Rep – Read mostly. All processors share a single rep. Region – Read mostly. On critical path for all page faults. FCM – maintains state of pages of a file cached in memory. Hash table for cache split across many reps

Clustered Objects

Clustering Through Replication One way to have shared data and scale well: -Only allow reads! -Replicate data -All CPUs read their own copy from cache What about writing? -Challenge: how to maintain consistency among replicas? -Fundamentally the same problem as cache coherence -May be able to do better by taking advantage of the semantics of the object being replicated -Example: shared counter

Advantages of Clustered Objects Internally supports replication, partitioning and locks -all essential for scalable performance on SMMP Built on object oriented design -No need to worry about location or organization of objects -Just use clustered object references Allows incremental implementation -Change degree of clustering based on performance needs -Initially one rep serving all requests -Optimize if widely shared

Clustered Object Implementation Each processor has its own translation tables Clustered object reference is just a pointer to the local translation table Clustered Obj 1 Clustered Obj 2 … Ptr … Translation table Pointer to rep responsible for handling method invocations for local processor

Clustered Objects Looks like a single object Actually made up of several component objects. Each rep handles calls from a subset of processors All clients access a clustered Object using a common clustered obj reference Each call to a Clustered Obj automatically directed to appropriate local rep

Clustered Object Implementation Each ‘per processor’ copy of the table is located in the same virtual address space One pointer into table will give rep responsible for handling invocation on each processor

Clustered Object Implementation We do not know which reps will be needed Reps are created on demand, on first access 1.Each clustered object defines a miss handling object 2.All entries in the translation table are initialized to point to a global miss handling object 3.On first invocation, global miss handling object called 4.Global miss handling object saves state and calls clustered object’s miss handling object 5.Object miss handler: if (rep exists) pointer to rep installed in translation table else create rep, install pointer to it in translation table

Counter as a Clustered Object Counter – Clustered Object CPU Object Reference Rep 1Rep 2

Counter as a Clustered Object Counter – Clustered Object CPU 1 1 Object Reference Rep 1Rep 2

Counter as a Clustered Object Counter – Clustered Object CPU 2 1 Object Reference Rep 1Rep 2 Updates are independen t of each other

Counter as a Clustered Object Counter – Clustered Object CPU 1 1 Object Reference Rep 1Rep 2 Read Counter

Counter as a Clustered Object Counter – Clustered Object CPU 1 1 Object Reference Rep 1Rep 2 Add All Counters (1 + 1)

Synchronization Locking -concurrency control for managing accesses to shared data Existence Guarantees -Ensure the data structure containing a variable is not deallocated during an update -Concurrent access to shared data structures requires existence guarantee

Locking Lots of overhead -Basic instruction overhead -Extra cache coherence traffic due to write sharing of lock Tornado: Encapsulate all locks within individual objects -Reduces the scope of a lock -Limits contention -Split objects into multiple representatives to further reduce contention!

Existence Guarantees Scheduler’s ready list Element 2 Thread Control Blocks Thread 1: Traversing the list Thread 2: Trying to delete element 2 Garbage collected Memory possibly reallocated

Existence Guarantees How to stop a thread from deallocating an object being used by another? Semi-automatic garbage collection scheme

Garbage Collection Implementation References TemporaryPersistent Held privately by single thread Destroyed when thread terminates Stored in shared memory Accessed by multiple threads

Clustered Object Destruction Phase 1: -Object makes sure all persistent references to it have been removed (lists, tables etc) Phase 2: -Object makes sure all temporary references to it have been removed -Uniprocessor: Number of active operations maintained in per-processor counter. Count = 0 (none active) -Multiprocessor: Clustered object knows which set of processors can access it. Counter must become zero on all processors (circulating token scheme)

IPC Acts like a clustered object call that crosses from protection domain of client to that of server and back Client requests are always serviced on their local processors Processor sharing: like handoff scheduling

Performance Test machine: -16 processor NUMAchine prototype -SimOS Simulator

Component Results Avg no. of cylcles Required for ‘n’ threads - Performs quite well for memory allocation and miss handling - Lots of variations in garbage collection and PPC tests - Multiple data structures occasionally mapping to the same cache line on some processors

Component Results Avg cycles required on SimOS with 4-way associative cache 4-way associativity: A cache line from main memory can reside in any one of the 4 slots

Tornado is highly scalable First generation: -Hurricane (coarse grain approach to scalability) Second generation: -Tornado (University of Toronto) Third generation: -K42 (IBM & University of Toronto) Conclusion

References - Clustered Objects: Initial Design, Implementation and Evaluation by Jonathan Appavoo, Masters thesis, University of Toronto, Tornado: Maximizing Locality and Concurrency in a shared-memory multiprocessor operating system by Benjamin Gamsa, PhD thesis, University of Toronto, PowerPoint diagrams, earlier slides from Anusha Muthia