Presented by: Sagnik Bhattacharya Kingshuk Govil, Dan Teodosiu, Yongjang Huang, Mendel Rosenblum
Overview Problems of current shared memory multiprocessors and our requirements Cellular Disco as a solution –architecture –prototype –hardware-fault containment –CPU management –Memory management –statistics Cellular Disco and ubiquitous environments Conclusion
Problem Extending modern Operating systems to run efficiently on shared memory multiprocessors. Software development has not kept pace with hardware development. Common operating systems fail beyond 12 processors.
What we need…. the system should be reliable it should be scalable it should be fault-tolerant it should not take too much of development time or effort.
Traditional approaches Hardware partitioning - lacks resource sharing, makes physical clusters. Software-centric approaches : (significant development time and cost) –modify existing OS –develop new OS
A scenario…. Control unit Smart Space ProcProc ProcProc (No rebooting necessary)
Solution : Cellular Disco Extension of previous work - Disco Uses the concept of Virtual machine monitors Partitions the multiprocessor system into virtual clusters.
Virtual Machine Monitor VM1 µP1µP2µP3 VM2 µP1µP3µP8 VM1 - µP’s 1,2,3 µP5 VM2 - µP’s 1,3,5,8 OS (Win NT) OS (IRIX 6.2) Virtual Machine Hardware
Virtual Machine Monitor VM1 µP1µP2µP3 VM2 µP1µP3µP8 VM1 - µP’s 1,2,3 µP5 VM2 - µP’s 1,3,5,8 OS (Win NT) OS (IRIX 6.2) I/O request
Virtual Machine Monitor VM1 µP1µP2µP3 VM2 µP1µP3µP8 VM1 - µP’s 1,2,3 µP5 VM2 - µP’s 1,3,5,8 OS (Win NT) OS (IRIX 6.2) Trap I/O request & perform I/O
Virtual Machine Monitor VM1 µP1µP2µP3 VM2 µP1µP3µP8 VM1 - µP’s 1,2,3 µP5 VM2 - µP’s 1,3,5,8 OS (Win NT) OS (IRIX 6.2) Perform I/O and send interrupt
Virtual Machine Monitor VM1 µP1µP2µP3 VM2 µP1µP3µP8 VM1 - µP’s 1,2,3 µP5 VM2 - µP’s 1,3,5,8 OS (Win NT) OS (IRIX 6.2)
Issues it addresses Address scalability NUMA awareness Hardware fault-containment Resource management
Basic Cellular Disco Architecture
Prototype Runs on a 32-processor SGI-Origin 2000 Supports shared memory systems based on MIPS R1000 architecture. The prototype runs piggybacked on IRIX 6.4 The host OS is made dormant and is only used to invoke some device drivers.
Hardware Virtualization Physical Resources - visible to a virtual machine Machine Resources - actual resources; allocated by Cellular Disco CD operates in the kernel mode of the MIPS processor CD intercepts all system calls.
Resource Management CPU management - Each processor maintains its own run queue Memory Management - Memory borrowing mechanism Each OS instance is only given as many resources as it can handle. Large applications are split and communications between the parts is established by using the shared-memory regions.
CPU Management VCPU migration : - Intra node (37 µsec) - Inter node (520 µsec) - Inter Cell (1520 µsec)
VCPU migration Cellular Disco Interconnect InterconnectNodeNodeNodeNodeNodeNode CPUCPUCPUCPUCPUCPUCPUCPUCPU CellCellCell VCPU
Intra Node Cellular Disco Interconnect InterconnectNodeNodeNodeNodeNodeNode CPUCPUCPUCPUCPUCPUCPUCPUCPU CellCellCell VCPU
Inter Node Cellular Disco Interconnect InterconnectNodeNodeNodeNodeNodeNode CPUCPUCPUCPUCPUCPUCPUCPUCPU CellCellCell VCPU
Inter Cell Cellular Disco Interconnect InterconnectNodeNodeNodeNodeNodeNode CPUCPUCPUCPUCPUCPUCPUCPUCPU CellCellCell VCPU
CPU Management (contd.) CPU balancing : Idle Balancer Periodic balancer Load Balancing Scenario
Idle balancer CPU0CPU1CPU2CPU3 VC B0 VC A1 VC B1 VC A0 Does this have enough cache affinity to CPU2? (Idle) Asks
Idle balancer CPU0CPU1CPU2CPU3 VC B0 VC A1 VC B1 VC A0 Does this have enough cache affinity to CPU2? NO!! (Idle) Asks
Idle balancer CPU0CPU1CPU2CPU3 VC B0 VC A1 VC B1 VC A0 VC B1
Periodic Balancer Does depth-first traversal of the load tree Traversal
Periodic Balancer Checks difference of 2 siblings, ignores if< Traversal Diff=1Diff=1
Periodic Balancer If diff>=2 does load balancing if benefit>cost Traversal Diff=2 Diff=2
Gang Scheduling For all the CPU’s we select the VCPU that is to run on the physical CPU. The VCPU selected is the highest priority be gang-runnable VCPU –all non-idle VCPU’s of that VM are either running or, waiting on run queues of processors running lower- priority VM’s.
Example µP1 : µP2 : µP3 : VC1 VC2 VC5 VC7VC5 VC1VC9 VC3VC4 Currently Executing VCPU Wait Queue VM1 VC’s - 1,3,8 (idle) VM2 VC’s - 2,4,6 (idle),7 VM3 VC’s - 5,9 Priority
Example µP1 : µP2 : µP3 : VC1 VC2 VC5 VC7VC5 VC1VC9 VC3VC4 VM1 VC’s - 1,3,8 (idle) VM2 VC’s - 2,4,6 (idle),7 VM3 VC’s - 5,9 Priority Gang Runnable
Example µP1 : µP2 : µP3 : VC5 VC9 VC5 VC7VC1 VC1VC2 VC3VC4 New Executing VCPU New Wait Queue VM1 VC’s - 1,3,8 (idle) VM2 VC’s - 2,4,6 (idle),7 VM3 VC’s - 5,9 Priority
Memory Management Each cell maintains its own freelist, and allocates memory to other cells in it allocation preference list on request(RPC). Speed µsec for 4 MB. A threshold is set for min. amount of local free memory As far as possible Paging is avoided.
Memory Borrowing freelist - list of free pages in the cell allocation preference list - list of cells from which borrowing memory is more beneficial than paging.
Memory Borrowing Cell 1 Cell 3 Cell 4 Cell 5 Cell 2 Freelist sizes 16 MB 32 MB Borrowing threshold Lending threshold
Memory Borrowing Cell 1 Cell 3 Cell 4 Cell 5 Cell 2 Freelist sizes 16 MB 32 MB Borrowing threshold Lending threshold asks
Memory Borrowing Cell 1 Cell 3 Cell 4 Cell 5 Cell 2 Freelist sizes 16 MB 32 MB Borrowing threshold Lending threshold refused
Memory Borrowing Cell 1 Cell 3 Cell 4 Cell 5 Cell 2 Freelist sizes 16 MB 32 MB Borrowing threshold Lending threshold cannot ask
Memory Borrowing Cell 1 Cell 3 Cell 4 Cell 5 Cell 2 Freelist sizes 16 MB 32 MB Borrowing threshold Lending threshold asks
Memory Borrowing Cell 1 Cell 3 Cell 4 Cell 5 Cell 2 Freelist sizes 16 MB 32 MB Borrowing threshold Lending threshold Gives 4 MB 4 MB
Memory Borrowing Cell 1 Cell 3 Cell 4 Cell 5 Cell 2 Freelist sizes 16 MB 32 MB Borrowing threshold Lending threshold
Memory Management (contd.) Paging : Algo - Second Chance FIFO Page sharing information by some control data structure Cellular Disco traps all read and write requests made by the Operating Systems
Second-chance FIFO A reference bit is added to each page in FIFO scheme Every time the page is accessed the bit is set to 1 If the page is selected by FIFO, and the reference bit is 1, then it is set to 0 and another page is looked for. A page is the target page if it is selected b FIFO and the reference bit is 0
Example Page Fault 1 Oldest Page 1 Oldest Page 0 Second Oldest Page Oldest Page FIFO RB Page Table
Example Page Fault 0 Oldest Page 0 Oldest Page 0 Second Oldest Page Oldest Page Second- chance FIFO RB Page Table
Example 0 Oldest Page 0 Oldest Page RB Page Table
Hardware fault-containment Failure rate increases with increase in processors. Internally structured as a set of semi- independent cells. Failure in one cell does not impact VM’s running in other cells (localization of faults) Assumption - CD is a trusted software layer
Cellular Structure Fault in one cell does not affect others
Hardware fault-containment (contd.) Communication modes - Fast inter-processor RPC - Message Side benefit - Software fault containment, i.e., individual OS crashes do not impact the system.
Hardware-Fault recovery liveset - set of still functioning nodes. Failure - removal from liveset Recovery - insert back to liveset Virtual machines dependent on the failed cell are terminated. Memory dependencies are updated when a cell fails.
Example Cellular Disco Interconnect InterconnectNode1Node4Node5Node6Node3Node2 CPUCPUCPUCPUCPUCPUCPUCPUCPU CellCell Cell VM 1 VM 2 VM 3 Liveset - 1,2,3,4,5,6
Example Cellular Disco Interconnect InterconnectNode1Node4Node5Node6Node3Node2 CPUCPUCPUCPUCPUCPUCPUCPUCPU CellCell Cell VM 1 VM 2 VM 3 Liveset - 1,2,3,4,5,6 BOOM
Example Cellular Disco Interconnect InterconnectNode4Node5Node6Node3 CPUCPUCPUCPUCPUCPU CellCell Cell VM 2 Liveset - 5,6
Example Cellular Disco Interconnect InterconnectNode1Node4Node5Node6Node3Node2 CPUCPUCPUCPUCPUCPUCPUCPUCPU CellCell Cell VM 2 Liveset - 5,6 Interrupt
Example Cellular Disco Interconnect InterconnectNode1Node4Node5Node6Node3Node2 CPUCPUCPUCPUCPUCPUCPUCPUCPU CellCell Cell VM 2 Liveset - 1,2,3,4,5,6
Fault-Recovery overhead
Virtualization Overheads (the first column shows the exec. Time on IRIX 6.4 and the second shows the exec. time on Cellular Disco).
Cellular Disco and Ubiquitous environments Provides raw computational power for our smart spaces. More importantly it does not fail. Fault- recovery present. Adaptable to new Operating systems
Grey Areas Will the source simplicity remain if it is not piggybacked on IRIX 6.4? Will it work on non-uniform multiprocessor systems? –Probable solution - development of a hardware virtualization standard
In conclusion…. Cellular Disco present a midway path between hardware and software directed techniques. It can be used on the central control unit for our smart spaces because it is scalable and fault-tolerant.