The Kernel Abstraction
Main Points Process concept Dual-mode operation: user vs. kernel A process is an OS abstraction for executing a program with limited privileges Dual-mode operation: user vs. kernel Kernel-mode: execute with complete privileges User-mode: execute with fewer privileges Safe control transfer How do we switch from one mode to the other?
Processes Fundamental abstraction of program execution memory processor(s) each processor abstraction is a thread “execution context” Unix, as do many operating systems, uses the notion of a process as its fundamental abstraction of program execution. Each program runs in a separate process. Processes are protected from one another in the sense that the actions of one process cannot directly harm others. The abstraction comprises the memory of a program (known as its address space—the collection of locations that can be referenced by the process), the execution agents (processor abstractions), and other information, known collectively as the execution context, representing such things as the files the process is currently accessing, how it responds to exceptions, to external stimuli, etc. The processor abstraction is often called a thread. In “traditional” Unix programs, processes have only one thread, so we’ll use the word process to include the single thread running inside of it. Later in this course, when we cover multithreaded programming, we’ll be more careful and use the word thread when we are discussing the processor abstraction.
Process Concept Process: an instance of a program, running with limited rights Process control block: the data structure the OS uses to keep track of a process Two parts to a process: Thread: a sequence of instructions within a process Potentially many threads per process (for now 1:1) Thread aka lightweight process Address space: set of rights of a process Memory that the process can access Other permissions the process has (e.g., which procedure calls it can make, what files it can access)
A Program const int nprimes = 100; int prime[nprimes]; int main() { int i; int current = 2; prime[0] = current; for (i=1; i<nprimes; i++) { int j; NewCandidate: current++; for (j=0; prime[j]*prime[j] <= current; j++) { if (current % prime[j] == 0) goto NewCandidate; } prime[i] = current; return(0);
The Unix Address Space stack dynamic bss data text A Unix process’s address space appears to be three regions of memory: a read-only text region (containing executable code); a read-write region consisting of initialized data (simply called data), uninitialized data (BSS—a directive from an ancient assembler (for the IBM 704 series of computers), standing for Block Started by Symbol and used to reserve space for uninitialized storage), and a dynamic area; and a second read-write region containing the process’s user stack (a standard Unix process contains only one thread of control). The first area of read-write storage is often collectively called the data region. Its dynamic portion grows in response to sbrk system calls. Most programmers do not use this system call directly, but instead use the malloc and free library routines, which manage the dynamic area and allocate memory when needed by in turn executing sbrk system calls. The stack region grows implicitly: whenever an attempt is made to reference beyond the current end of stack, the stack is implicitly grown to the new reference. (There are system-wide and per-process limits on the maximum data and stack sizes of processes.)
Typical Process Layout Libraries provide the glue between user processes and the OS libc linked in with all C programs Provides printf, malloc, and a whole slew of other routines necessary for programs Activation Records Stack Heap OBJECT1 OBJECT2 HELLO WORLD GO BIG RED CS! Data printf(char * fmt, …) { create the string to be printed SYSCALL 80 } malloc() { … } strcmp() { … } Library Text main() { printf (“HELLO WORLD”); printf(“GO BIG RED CS”); ! Program
Full System Layout The OS is omnipresent and steps in where necessary to aid application execution Typically resides in high memory When an application needs to perform a privileged operation, it needs to invoke the OS USER OBJECT1 OBJECT2 OS Stack OS Heap OS Data LINUX syscall_entry_point() { … } OS Text Kernel Activation Records OBJECT1 OBJECT2 Stack Heap Data HELLO WORLD GO BIG RED CS! printf(char * fmt, …) { main() { … } Program Library Activation Records
Process Concept OK, so you compile your program into an executable image with instructions and data. What’s to keep the program from overwriting the OS kernel? Or some other program running at the same time? What’s to keep it from overwriting the disk? From reading someone else’s files that are stored on disk?
Multiple Processes other stuff kernel stack other stuff kernel stack kernel text Each process has its own user address space, but there’s a single kernel address space. It contains context information for each user process, including the stacks used by each process when executing system calls.
Memory Protection
Memory Protection
Towards Virtual Addresses Problems with base and bounds? Expandable heap? Expandable stack? Memory sharing between processes? Non-relative addresses – hard to move memory around Memory fragmentation
Virtual Addresses Translation done in hardware, using a table Table set up by operating system kernel On every instruction!
Privileged Instructions
Privileged instructions Examples? What should happen if a user program attempts to execute a privileged instruction? Change mode bit in EFLAGs register! Change which memory locations a user program can access Send commands to I/O devices Read data from/write data to I/O devices Jump into kernel code …
Thought Experiment How can we implement execution with limited privilege? Execute each program instruction in a simulator If the instruction is permitted, do the instruction Otherwise, stop the process Basic model in Javascript, … How do we go faster? Run the unprivileged code directly on the CPU? Upsides? Downsides to this approach? Essentially what you do in Javascript in a browser – simulate the execution of the script, one line at a time.
Privilege Levels Some processor functionality cannot be made accessible to untrusted user applications e.g. HALT, Read from disk, set clock, reset devices, manipulate device settings, … Need to have a designated mediator between untrusted/untrusting applications The operating system (OS) Need to delineate between untrusted applications and OS code Use a “privilege mode” bit in the processor 0 = Untrusted = user, 1 = Trusted = OS
Privilege Mode Privilege mode bit indicates if the current program can perform privileged operations On system startup, privilege mode is set to 1, and the processor jumps to a well-known address The operating system (OS) boot code resides at this address The OS sets up the devices, initializes the MMU, loads applications, and resets the privilege bit before invoking the application Applications must transfer control back to OS for privileged operations
Challenge: Protection How do we execute code with restricted privileges? Either because the code is buggy or if it might be malicious Some examples: A script running in a web browser A program you just downloaded off the Internet A program you just wrote that you haven’t tested yet Not just about OS’es; not just bugs
Hardware Support: Dual-Mode Operation Kernel mode Execution with the full privileges of the hardware Read/write to any memory, access any I/O device, read/write any disk sector, send/read any packet User mode Limited privileges Only those granted by the operating system kernel On the x86, mode stored in EFLAGS register Obviously, you need the part that has full rights to be really reliable!
A Model of a CPU
A CPU with Dual-Mode Operation
Hardware Support: Dual-Mode Operation Privileged instructions Available to kernel Not available to user code Limits on memory accesses To prevent user code from overwriting the kernel Timer To regain control from a user program in a loop Safe way to switch from user mode to kernel mode, and vice versa
Atomic Instructions Hardware needs to provide special instructions to enable concurrent programs to operate correctly
Virtual Address Layout Plus shared code segments, dynamically linked libraries, memory mapped files, …
Example: Corrected (What Does this Do?) int staticVar = 0; // a static variable main() { int localVar = 0; // a procedure local variable staticVar += 1; localVar += 1; sleep(10); // sleep causes the program to wait for x seconds printf ("static address: %x, value: %d\n", &staticVar, staticVar); printf ("procedure local address: %x, value: %d\n", &localVar, localVar); } Produces: static address: 5328, value: 1 procedure local address: ffffffe2, value: 1 Because of stack address munging on modern systems (to prevent viruses), this won’t actually produce the same output when run repeatedly
Switch between hardware and kernel Hardware Timer Switch between hardware and kernel
Hardware Timer Hardware device that periodically interrupts the processor Returns control to the kernel timer interrupt handler Interrupt frequency set by the kernel Not by user code! Interrupts can be temporarily deferred Crucial for implementing mutual exclusion
Question For a “Hello world” program, the kernel must copy the string from the user program memory into the screen memory. Why must the screen’s buffer memory be protected?
Question Suppose we had a perfect object-oriented language and compiler, so that only an object’s methods could access the internal data inside an object. If the operating system only ran programs written in that language, would it still need hardware memory address protection?
Context switch between user-mode and kernel
Mode Switch From user-mode to kernel Interrupts Exceptions Triggered by timer and I/O devices Exceptions Triggered by unexpected program behavior Or malicious behavior! System calls (aka protected procedure call) Request by program for kernel to do some operation on its behalf Only limited # of very carefully coded entry points
Mode Switch From kernel-mode to user New process/new thread start Jump to first instruction in program/thread Return from interrupt, exception, system call Resume suspended execution Process/thread context switch Resume some other process User-level upcall Asynchronous notification to user program
Context switch Interrupts
Basic Computer Organization Memory CPU ?
Keyboard Let’s build a keyboard Lots of mechanical switches Need to convert to a compact form (binary) We’ll use a special mechanical switch that, when pressed, connects two wires simultaneously
Keyboard When a key is pressed, a 7-bit key identifier is computed + 3-bit encoder (4 to 3) 4-bit encoder (16 to 4) not all 16 wires are shown
Keyboard A latch can store the keystroke indefinitely + Latch 3-bit encoder (4 to 3) 4-bit encoder (16 to 4) not all 16 wires are shown Latch A latch can store the keystroke indefinitely
Keyboard CPU + 3-bit encoder (4 to 3) 4-bit encoder (16 to 4) not all 16 wires are shown Latch The keyboard can then appear to the CPU as if it is a special memory address
Device Interfacing Techniques Memory-mapped I/O Device communication goes over the memory bus Reads/Writes to special addresses are converted into I/O operations by dedicated device hardware Each device appears as if it is part of the memory address space Programmed I/O CPU has dedicated, special instructions CPU has additional input/output wires (I/O bus) Instruction specifies device and operation Memory-mapped I/O is the predominant device interfacing technique in use
Polling vs. Interrupts One design is the CPU constantly needs to read the keyboard latch memory location to see if a key is pressed Called polling Inefficient An alternative is to add extra circuitry so the keyboard can alert the CPU when there is a keypress Called interrupt driven I/O Interrupt driven I/O enables the CPU and devices to perform tasks concurrently, increasing throughput Only needs a tiny bit of circuitry and a few extra wires to implement the “alert” operation
Interrupt Driven I/O CPU Memory Interrupt Controller intr CPU dev id An interrupt controller mediates between competing devices Raises an interrupt flag to get the CPU’s attention Identifies the interrupting device Can disable (aka mask) interrupts if the CPU so desires
Interrupt Management Interrupt controllers manage interrupts Maskable interrupts: can be turned off by the CPU for critical processing Nonmaskable interrupts: signifies serious errors (e.g. unrecoverable memory error, power out warning, etc) Interrupts contain a descriptor of the interrupting device A priority selector circuit examines all interrupting devices, reports highest level to the CPU Interrupt controller implements interrupt priorities Can optionally remap priority levels
How do we take interrupts safely? Interrupt vector Limited number of entry points into kernel Kernel interrupt stack Handler works regardless of state of user code Interrupt masking Handler is non-blocking Atomic transfer of control Single instruction to change: Program counter Stack pointer Memory protection Kernel/user mode Transparent restartable execution User program does not know interrupt occurred
Interrupt Vector Table set up by OS kernel; pointers to code to run on different events Note: by “processor register” I do not mean %eax. Rather – these are special purpose registers.
Interrupt Masking Interrupt handler runs with interrupts off Reenabled when interrupt completes OS kernel can also turn interrupts off Eg., when determining the next process/thread to run If defer interrupts too long, can drop I/O events
Interrupt Handlers Non-blocking, run to completion Minimum necessary to allow device to take next interrupt Any waiting must be limited duration Wake up other threads to do any real work Pintos: semaphore_up Rest of device driver runs as a kernel thread Queues work for interrupt handler (Sometimes) wait for interrupt to occur
Atomic Mode Transfer Context Switch On interrupt (x86) Save current stack pointer Save current program counter Save current processor status word (condition codes) Switch to kernel stack; put SP, PC, PSW on stack Switch to kernel mode Vector through interrupt table Interrupt handler saves registers it might clobber
Before
During
After
At end of handler Handler restores saved registers Atomically return to interrupted process/thread Restore program counter Restore program stack Restore processor status word/condition codes Switch to user mode
Exceptional Situations System calls are control transfers to the OS, performed under the control of the user application Sometimes, need to transfer control to the OS at a time when the user program least expects it Division by zero, Alert from the power supply that electricity is about to go out, Alert from the network device that a packet just arrived, Clock notifying the processor that the clock just ticked, Some of these causes for interruption of execution have nothing to do with the user application Need a (slightly) different mechanism, that allows resuming the user application
Interrupts & Exceptions On an interrupt or exception Switches the stack pointer to the kernel stack Saves the old (user) SP value Saves the old (user) Program Counter value Saves the old privilege mode Saves cause of the interrupt/exception Sets the new privilege mode to 1 Sets the new PC to the kernel interrupt/exception handler Kernel interrupt/exception handler handles the event Saves all registers Examines the cause Performs operation required Restores all registers Performs a “return from interrupt” instruction, which restores the privilege mode, SP and PC
Before
After
Interrupt Stack Per-processor, located in kernel (not user) memory Usually a thread has both: kernel and user stack Why can’t interrupt handler run on the stack of the interrupted user process?
Interrupt Stack
Context switch System Calls
System Calls A system call is a controlled transfer of execution from unprivileged code to the OS A potential alternative is to make OS code read-only, and allow applications to just jump to the desired system call routine. Why is this a bad idea? A SYSCALL instruction transfers control to a system call handler at a fixed address
System Calls Kernel portion of address space trap into kernel kernel text other stuff kernel stack Kernel portion of address space trap into kernel User portion of address space write(fd, buf, len)
System Calls Sole interface between user and kernel Implemented as library routines that execute trap instructions to enter kernel Errors indicated by returns of –1; error code is in errno if (write(fd, buffer, bufsize) == –1) { // error! printf("error %d\n", errno); // see perror } System calls, such as fork, execv, read, write, etc., are the only means for application programs to communicate directly with the kernel: they form an API (application program interface) to the kernel. When a program calls such a routine, it is actually placing a call to a subroutine in a system library. The body of this subroutine contains a hardware-specific trap instruction which transfers control and some parameters to the kernel. On return to this library return, the kernel provides an indication of whether or not there was an error and what the error was. The error indication is passed back to the original caller via the functional return value of the library routine. If there was an error, a positive-integer code identifying it is stored in the global variable errno. Rather than simply print this code out, as shown in the slide, one might instead print out an informative error message. This can be done via the perror routine.
Sample System Calls Print character to screen Needs to multiplex the shared screen resource between multiple applications Send a packet on the network Needs to manipulate the internals of a device whose hardware interface is unsafe Allocate a page Needs to update page tables & MMU
Syscall vs. Interrupt The differences lie in how they are initiated, and how much state needs to be saved and restored Syscall requires much less state saving Caller-save registers are already saved by the application Interrupts typically require saving and restoring the full state of the processor Because the application got struck by a lightning bolt without anticipating the control transfer
System Calls
Kernel System Call Handler Locate arguments In registers or on user(!) stack Copy arguments From user memory into kernel memory Protect kernel from malicious code evading checks Validate arguments Protect kernel from errors in user code Copy results back into user memory
SYSCALL instruction SYSCALL instruction does an atomic jump to a controlled location Switches the sp to the kernel stack Saves the old (user) SP value Saves the old (user) PC value (= return address) Saves the old privilege mode Sets the new privilege mode to 1 Sets the new PC to the kernel syscall handler Kernel system call handler carries out the desired system call Saves callee-save registers Examines the syscall number Checks arguments for sanity Performs operation Stores result in v0 Restores callee-save registers Performs a “return from syscall” instruction, which restores the privilege mode, SP and PC
Web Server Example
System Boot
System Boot Operating system must be made available to hardware so hardware can start it Small piece of code – bootstrap loader, locates the kernel, loads it into memory, and starts it Sometimes two-step process where boot block at fixed location loads bootstrap loader When power initialized on system, execution starts at a fixed memory location Firmware used to hold initial boot code
Booting
Virtual Machines
Virtual Machines A virtual machine takes the layered approach to its logical conclusion. It treats hardware and the operating system kernel as though they were all hardware A virtual machine provides an interface identical to the underlying bare hardware The operating system host creates the illusion that a process has its own processor and (virtual memory) Each guest provided with a (virtual) copy of underlying computer
Virtual Machine
Virtual Machines (Cont) (a) Nonvirtual machine (b) virtual machine Non-virtual Machine Virtual Machine
User-Level Virtual Machine How does VM Player work? Runs as a user-level application How does it catch privileged instructions, interrupts, device I/O, … Installs kernel driver, transparent to host kernel Requires administrator privileges! Modifies interrupt table to redirect to kernel VM code If interrupt is for VM, upcall If interrupt is for another process, reinstalls interrupt table and resumes kernel
Context switch System Upcalls
Upcall: User-level interrupt AKA UNIX signal Notify user process of event that needs to be handled right away Time-slice for user-level thread manager Interrupt delivery for VM player Direct analogue of kernel interrupts Signal handlers – fixed entry points Separate signal stack Automatic save/restore registers – transparent resume Signal masking: signals disabled while in signal handler
Upcall: Before
Upcall: After
Terminology Trap Syscall Exception Interrupt Any kind of a control transfer to the OS Syscall Synchronous, program-initiated control transfer from user to the OS to obtain service from the OS e.g. SYSCALL Exception Asynchronous, program-initiated control transfer from user to the OS in response to an exceptional event e.g. Divide by zero, segmentation fault Interrupt Asynchronous, device-initiated control transfer from user to the OS e.g. Clock tick, network packet