W4118 Operating Systems Interrupt and System Call in Linux Instructor: Junfeng Yang.

W4118 Operating Systems Interrupt and System Call in Linux Instructor: Junfeng Yang

Logistics  Find two teammates before next Thursday  Post ads in CourseWorks

Last lecture  OS structure  Monolithic v.s. microkernel  Modern OS: modules  Virtual machine  Intro to Linux  Interrupts in Linux

Interrupts in Linux PIC CPU Memory Bus INTR IRQs IDT 0 255 ISR idtr intr # Mask points intr # How to handle interrupts?

Today  Interrupts in Linux (cont.)  Interrupt handlers  System calls in Linux  Intro to Process

Nested Interrupts  What if a second interrupt occurs while an interrupt routine is executing?  Generally a good thing to permit that — is it possible?  And why is it a good thing?

Maximizing Parallelism  You want to keep all I/O devices as busy as possible  In general, an I/O interrupt represents the end of an operation; another request should be issued as soon as possible  Most devices don ’ t interfere with each others ’ data structures; there ’ s no reason to block out other devices

Handling Nested Interrupts  Hardware invokes handler with interrupt disabled  As soon as possible, unmask the global interrupt  Interrupts from the same IRQ line?  Wants to process in serial  Thus, interrupt from same IRQ is not enabled during interrupt-handling

Interrupt Handling Philosophy  To preserve IRQ order on the same line, must disable incoming interrupts on same line  New interrupts can get lost if controller buffer overflow  Interrupt preempts what CPU was doing, which may be important  Even not important, undesirable to block user program for long  So, handler must run for a very short time!  Do as little as possible in the interrupt handler Often just: queue a work item and set a flag  Defer non-critical actions till later

Intr handlers have no process context!  Interrupts (as opposed to exceptions) are not associated with particular instructions, nor the current process. It’s like an unexpected jump. Why? Context switch expensive  Implication  Interrupt handlers cannot call functions that may sleep (i.e. yield CPU to scheduler) !  Why not? Scheduler only schedules processes, so wouldn’t know to reschedule the interrupt handler The current process may be doing something dangerous, and cannot sleep

Linux Interrupt Handler Structure  Top half (th) and bottom half (bh)  Top-half: do minimum work and return (ISR)  Bottom-half: deferred processing (softirqs, tasklets, workqueues) Top half softirq tasklet workqueue Bottom half

Top half  Perform minimal, common functions: saving registers, unmasking other interrupts. Eventually, undoes that: restores registers, returns to previous context.  Most important: call proper interrupt handler provided in device drivers (C program)  Typically queue the request and set a flag for deferred processing Top halfsoftirq Softirq flag = 0Softirq flag = 1

Deferrable Work  Three deferred work mechanisms: softirqs, tasklets, and work queues (tasklet built on top of softirq)  All of these use request queues  Think of requests as (function, args)  All can be interrupted Top half softirq tasklet workqueue Bottom half

Softirqs  Types are statically allocated: at kernel compile time  Limited number: Priority Type 0 High-priority tasklets (generic) 1 Timer interrupts 2 Network transmission 3 Network reception 4 SCSI disks 5 Regular tasklets (generic)  Singled out timer, net and scsi disk because these are the most important for server performance  Each type mapped to a bit in a per-CPU bitmask. To raise a softirq = simply set bit  What does a softirq handler do?

Example: Network card SoftIRQ  Parse packet header  Verify checksum  Deliver packet up to the network stack  return Linux-2.6.11/net/core/dev.c, function net_rx_action

Running Softirqs  When to execute softirq?  Run at various points by the kernel, using current process’s context  Most important: after handling IRQs and after timer interrupts  Essentially polling  Problem: while processing one softirq, another is raised. Process it?  No  long delay for new irq  Always  starve user program when long softirq burst Livelock!

Livelock  100% CPU utilization, but no progress   Why?  Need user program to eventually process requests E.g. webserver  However, if too many interrupt requests, starve user program  Big deal for networking in 90s  Solution: Eliminating receive livelock in an interrupt-driven kernel, Jeffrey C. Mogul, K. K. Ramakrishnan Adopted into Linux

Avoid Livelock  Goal: provide user program fair share of CPU time despite interrupt burst  Quota + dedicated context ksoftirqd  Process up to N softirqs for one softirq hanlder invocation Bound time spent in handler  Process the rest in ksoftirqd ksoftirqd subject to scheduling, as user process Provide fairness to user process

Tasklets  Problem: softirq is static  To add a new type of Softirq, need to convince Linus!  Solution: tasklets  Built on top of softirq  New types are created and destroyed dynamically  Simplified for muliticore processing: at any time, only one tasklet among all of the same type can run  Problem with softirq and tasklets: they have no process contexts either, thus cannot sleep

Work Queues  Softirqs and tasklets run in an interrupt context; work queues have a process context  The idea:  You throw work (fn, args) to a workqueue  Workqueue add to an internal FIFO queue  A dedicated workqueue process loops forever, dequeuing (fn, args), and running fn(args)  Because they have a process context, they can sleep

Monitoring Interrupt Activity  Linux has a pseudo-file system, /proc, for monitoring (and sometimes changing) kernel behavior  Run cat /proc/interrupts to see what ’ s going on

CPU0 0: 162 IO-APIC-edge timer 1: 0 IO-APIC-edge i8042 4: 10 IO-APIC-edge 7: 0 IO-APIC-edge parport0 8: 1232299 IO-APIC-edge rtc 9: 0 IO-APIC-fasteoi acpi 12: 1 IO-APIC-edge i8042 16: 19256781 IO-APIC-fasteoi uhci_hcd:usb1, … 17: 79 IO-APIC-fasteoi uhci_hcd:usb2, uhci_hcd:usb4 … # Columns: IRQ, count, interrupt controller, devices

Today  Interrupts in Linux (cont.)  System calls in Linux  Intro to Process

API – System Call – OS Relationship { printf(“hello world!\n”); } libc User mode kernel mode %eax = sys_write; int 0x80 IDT 0x80 system_call() { fn = syscalls[%eax] } syscalls table sys_write(…) { // do real work }

System Calls vs. Library Calls  Library calls are much faster than system calls  If you can do it in user space, you should  strlen?  write?  Learn what a library call/system call do:  Documents are called “manpages,” divided into sections  Library calls (section 3) e.g. man 3 strlen  System calls (section 2) e.g. man 2 write

Next: Syscall Wrapper Macros { printf(“hello world!\n”); } libc User mode kernel mode %eax = sys_write; int 0x80 IDT 0x80 system_call() { fn = syscalls[%eax] } syscalls table sys_write(…) { // do real work }

Syscall Wrapper Macros  Generating the assembly code for trapping into the kernel is complex so Linux provides a set of macros to do this for you!  Macros with name _syscallN(), where N is the number of system call parameters  _syscallN(return_type, name, arg1type, arg1name, …)  in linux-2.6.11/include/asm-i386/unistd.h  Macro will expands to a wrapper function  Example:  long open(const char *filename, int flags, int mode);  _syscall3(long, open, const char *, filename, int, flags, int, mode)  NOTE: _syscallN obsolete after 2.6.18; now syscall (…), can take different # of args

Lib call/Syscall Return Codes  Library calls return -1 on error and place a specific error code in the global variable errno  System calls return specific negative values to indicate an error  Most system calls return -errno  The library wrapper code is responsible for conforming the return values to the errno convention

Next: Syscall Implementation { printf(“hello world!\n”); } libc User mode kernel mode %eax = sys_write; int 0x80 IDT 0x80 system_call() { fn = syscalls[%eax] } syscalls table sys_write(…) { // do real work }

System call handler.section.text system_call: // copy parameters from registers onto stack … callsys_call_table(, %eax, 4) jmpret_from_sys_call ret_from_sys_call: // perform rescheduling and signal-handling … iret// return to caller (in user-mode) // File arch/i386/kernel/entry.S Why jump table? Can’t we use if-then-else?

The system-call jump-table  There are approximately 300 system-calls  Any specific system-call is selected by its ID- number (it ’ s placed into register %eax)  It would be inefficient to use if-else tests or even a switch-statement to transfer to the service-routine ’ s entry-point  Instead an array of function-pointers is directly accessed (using the ID-number)  This array is named ‘ sys_call_table[] ’  Defined in file arch/i386/kernel/entry.S

System call table definition.section.data sys_call_table:.longsys_restart_syscall.longsys_exit.longsys_fork.longsys_read.longsys_write … NOTE: syscall numbers cannot be reused (why?); deprecated syscalls are implemented by a special “not implemented” syscall (sys_ni_syscall)

Syscall Naming Convention  Usually a library function “ foo() ” will do some work and then call a system call ( “ sys_foo() ” )  In Linux, all system calls begin with “ sys_ ”  Often “ sys_foo() ” just does some simple error checking and then calls a worker function named “ do_foo() ”

Tracing System Calls  Linux has a powerful mechanism for tracing system call execution for a compiled application  Output is printed for each system call as it is executed, including parameters and return codes  The ptrace() system call is used  Also used by debuggers (breakpoint, singlestep, etc)  Use the “ strace ” command (man strace for info)  You can trace library calls using the “ ltrace ” command

Passing system call parameters  The first parameter is always the syscall #  eax on Intel  Linux allows up to six additional parameters  ebx, ecx, edx, esi, edi, ebp on Intel  System calls that require more parameters package the remaining params in a struct and pass a pointer to that struct as the sixth parameter  Problem: must validate pointers  Could be invalid, e.g. NULL  crash OS  Or worse, could point to OS, device memory  security hole

How to validate user pointers?  Too expensive to do a thorough check  Need to check that the pointer is within all valid memory regions of the calling process  Solution: No comprehensive check  Linux does a simple check for address pointers and only determines if pointer variables are within the largest possible range of user memory (more details when talking about process)  Even if a pointer value passes this check, it is still quite possible that the specific value is invalid  Dereferencing an invalid pointer in kernel code would normally be interpreted as a kernel bug and generate an Oops message on the console and kill the offending process  Linux does something very sophisticated to avoid this situation

Handling faults due to user-pointers  Kernel code must access user-pointers using a small set of “paranoid” routines (e.g. copy_from_user)  Thus, kernel knows what addresses in its code can throw invalid memory access exceptions (page fault)  When a page fault occurs, the kernel’s page fault handler checks the faulting EIP (recall: saved by hw)  If EIP matches one of the paranoid routines, kernel will not oops; instead, will call “fixup” code  Many violations of this rule in Linux. Once built a checker and found tons of security holes

Paranoid functions to access user pointers FunctionAction get_user(), __get_user()reads integer (1,2,4 bytes) put_user(), __put_user()writes integer (1,2,4 bytes) copy_from_user(), __copy_from_usercopy a block from user space copy_to_user(), __copy_to_user()copy a block to user space strncpy_from_user(), __strncpy_from_user() copies null-terminated string from user space strnlen_user(), __strnlen_user()returns length of null-terminated string in user space clear_user(), __clear_user()fills memory area with zeros

How to find “fixup” code?  Exception table  Faulting instruction address  fixup code  On page fault, kernel scans exception table to find the fixup code  Typically the fixup code terminates the system call with an EINVAL error code (means: invalid arguments)  Some ELF tricks help to generate exception table and implement fixup code; see ULK Chapter 10 for gruesome details

Intel Fast System Calls  int 0x80 not used any more (I lied …)  Intel has a hardware optimization (sysenter) that provides an optimized system call invocation  Read the gory details in ULK Chapter 10

Today  Interrupts in Linux (cont.)  System calls in Linux  Intro to Process  What are processes?  Why need them?

What is a Process  “Program in execution” “virtual CPU”  Process is an execution stream in the context of a particular process state  Execution stream: a stream of instructions  Running piece of code  sequential sequence of instructions

What is a Process? (cont.)  Process state: determines the effects of the instructions.  Stuff that the running code can affect or be affected by  Registers General purpose, floating point, EIP …  Memory: everything a process can address Code, data, stack, heap  I/O File descriptor table  More … IP reg SP code stack heap data mem cpu

Program v.s. Process  Process != program  Program: static code and static data  Process: dynamic instantiation of code and data  Process program: no 1:1 mapping  Process > program: code + data + other things main() { f(x); } f(int x) { } main() { f(x); } f(int x) { } program process IP stack for f() regs heap

Program v.s. Process (cont.)  Process program: no 1:1 mapping  Program > process: one program can invoke multiple processes E.g. shell can run commands in different processes  Process > program: can have multiple processes of the same program E.g. Multiple users run multiple /usr/bin/tcsh

Address Space (AS)  More details when discussing memory management  AS = All memory a process can address + addresses  Virtual address space:  Really large memory to use  Linear array of bytes: [0, N), N roughly 2^32, 2^64  Process and virtual address space: 1 : 1 mapping  Key: an AS is a protection domain  One process can’t address another process’s address space (without permission) E.g. Value stored at 0x800abcd in p1 is different from 0x800abcd  Thus can’t read/write

Process v.s. Threads  More details when discussing threads  Process != Threads  Threads: many streams of executions in one process  Threads share address space main() { f(x); } f(int x) { } process IP stack for f() regs heap main() { f(x); } f(int x) { } IP stack for f() regs heap threads IP stack for f() regs

Why need processes?  Divide and conquer  Decompose a large problem into smaller ones  easier to think well contained smaller problems  Sequential  Easier to think about  Increase performance.  System has many concurrent jobs going on

System categorization  Most OS support process  Uniprogramming: only one process at a time  Good: simple  Bad: low utilization, low interactivity  Multiprogramming: multiple at a time  When one proc blocks (e.g. I/O), switch to another  NOTE: different from multiprocessing (systems with multiple processors)  Good: increase utilization, interactivity  Bad: complex

Multiprogramming  OS support for multiprogramming  Policy: scheduling, what proc to run? (next week)  Mechanism: dispatching, how to run/block process? how to protect from one another?  Separation of policy and mechanism  Recurring theme in OS  Policy: decision making with some performance metric and workload Scheduling (next week)  Mechanism: low-level code to implement decisions Dispatching (today)

Process state diagram  Process state  New: being created  Running: instructions are running on CPU  Waiting: waiting for some event (e.g. IO)  Ready: waiting to be assigned a CPU  Terminated: finished

Process dispatching mechanism  OS stores processes on system-wide lists  OS dispatching loop: while(1) { proc = choose(ready procs); load proc state; run proc for a while; save proc state; } Q1: how to gain control? Q2: what state must be saved?

Backup slides

How to gain control?  Cooperative v.s. preemptive  Cooperative. Process voluntarily yield control to OS  When? System call  Why bad? OS trusts process ! Malicious process? Bugs?  Preemptive  When? Interrupt, especially timer (studied before)  OS trusts no one !

What state must be saved?  Registers  Memory?  I/O?  Context switch  How?  Why hard? Need code to save registers, but code needs registers  Why expensive? A lot of registers to load and store

Interrupt Return Code Path  Interleaved assembly entry points:  ret_from_exception()  ret_from_intr()  ret_from_sys_call()  ret_from_fork()  Things that happen:  Run scheduler if necessary  Return to user mode if no nested handlers Restore context, user-stack, switch mode Re-enable interrupts if necessary  Deliver pending signals  (Some DOS emulation stuff – VM86 Mode)

Interrupt Handling Portability  Which has a higher priority, a disk interrupt or a network interrupt?  Different CPU architectures make different decisions  By not assuming or enforcing any priority, Linux becomes more portable

Interrupt Stack  When an interrupt occurs, what stack is used?  Exceptions: The kernel stack of the current process, whatever it is, is used (There ’ s always some process running — the “ idle ” process, if nothing else)  Interrupts: hard IRQ stack (1 per processor)  SoftIRQs: soft IRQ stack (1 per processor)  These stacks are configured in the IDT and TSS at boot time by the kernel

Invoking System Calls … xyz() … user-mode (restricted privileges) kernel-mode (unrestricted privileges) xyz { … int 0x80; … } call ret system_call: … sys_xyz(); … int 0x80 iret sys_xyz() { … } callret system call service routine system call handler app making system call wrapper routine in std C library

The ‘ jump-table ’ idea sys_restart_syscall sys_exit sys_fork sys_read sys_write sys_open sys_close …etc… sys_call_table.section.text012345678012345678

Mode vs. Space  Recall that an address space is the collection of valid addresses that a process can reference at any given time (0 … 4GB on a 32 bit processor)  The kernel is a logically separate address space (separate code/data) from all user processes  Trapping to the kernel logically involves changing the address space (like a process context switch)  Modern OSes use a trick to optimize system calls: every process gets at most 3GB of virtual addresses (0..3GB) and a copy of the kernel address space is mapped into *every* process address space (3..4GB)  The kernel code is mapped but not accessible in user-mode so processes can only “ see ” the kernel code when they trap into the kernel and the mode bit is changed  Pros: save TLB flushes, make copying in/out of user space easy  Cons: steals processor address space, limited kernel mapping

User space Kernel space User-mode stack-area Task’s code and data Privilege-level 0 Privilege-level 3 kernel-mode stack Shared runtime-libraries Process Address Space 4 GB 3 GB 0 GB

Syscall Wrapper Macros  Generating the assembly code for trapping into the kernel is complex so Linux provides a set of macros to do this for you!  There are seven macros with names _syscallN() where N is a digit between 0 and 6 and N corresponds to the number of parameters  The macros are a bit strange and accept data types as parameters  For each macro there are 2 + 2*N parameters; the first two correspond to the return type of syscall (usually long) and the syscall name; the remaining 2*N parameters are the type and name of each syscall parameter  Example:  long open(const char *filename, int flags, int mode);  _syscall3(long, open, const char *, filename, int, flags, int, mode)

Linux Interrupt Summary  Types of interrupts  Device  IRQs  APICs  Interrupt  ISR  SoftIRQs, tasklets, workqueues, ksoftirqd  Interrupt context vs. process context  Who can block  Nested interrupt

Blocking System Calls  System calls often block in the kernel (e.g. waiting for IO completion)  When a syscall blocks, the scheduler is called and selects another process to run  Linux distinguishes “ slow ” and “ fast ” syscalls:  Slow: may block indefinitely (e.g. network read)  Fast: should eventually return (e.g. disk read)  Slow syscalls can be “ interrupted ” by the delivery of a signal (e.g. Control-C)

Nested Execution  Interrupts can be interrupted  By different interrupts; handlers need not be reentrant  No notion of priority in Linux  Small portions execute with interrupts disabled  Interrupts remain pending until acked by CPU  Exceptions can be interrupted  By interrupts (devices needing service)  Exceptions can nest two levels deep  Exceptions indicate coding error  Exception code (kernel code) shouldn ’ t have bugs  Page fault is possible (trying to touch user data)

Example: Network ISR (receive)  if NIC has incoming packets  Device already copied packet to an OS buffer  ISR allocates a descriptor (sk_buff) for this buffer  ISR queue this descriptor  raises a softirq (== set a flag, explained next)  return (will enable interrupt)

More: lspci $ cat /proc/pci PCI devices found: Bus 0, device 0, function 0: Host bridge: PCI device 8086:2550 (Intel Corp.) (rev 3). Prefetchable 32 bit memory at 0xe8000000 [0xebffffff]. Bus 0, device 29, function 1: USB Controller: Intel Corp. 82801DB USB (Hub #2) (rev 2). IRQ 19. I/O at 0xd400 [0xd41f]. Bus 0, device 31, function 1: IDE interface: Intel Corp. 82801DB ICH4 IDE (rev 2). IRQ 16. I/O at 0xf000 [0xf00f]. Non-prefetchable 32 bit memory at 0x80000000 [0x800003ff]. Bus 3, device 1, function 0: Ethernet controller: Broadcom NetXtreme BCM5703X Gigabit Eth (rev 2). IRQ 48. Master Capable. Latency=64. Min Gnt=64. Non-prefetchable 64 bit memory at 0xf7000000 [0xf700ffff].

W4118 Operating Systems Interrupt and System Call in Linux Instructor: Junfeng Yang.

Similar presentations

Presentation on theme: "W4118 Operating Systems Interrupt and System Call in Linux Instructor: Junfeng Yang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

W4118 Operating Systems Interrupt and System Call in Linux Instructor: Junfeng Yang.

Similar presentations

Presentation on theme: "W4118 Operating Systems Interrupt and System Call in Linux Instructor: Junfeng Yang."— Presentation transcript:

Similar presentations

About project

Feedback