LAIO: Lazy Asynchronous I/O For Event Driven Servers Khaled Elmeleegy Alan L. Cox.

Slides:



Advertisements
Similar presentations
Lazy Asynchronous I/O For Event-Driven Servers Khaled Elmeleegy, Anupam Chanda and Alan L. Cox Department of Computer Science Rice University, Houston,
Advertisements

R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),
QNX® real-time operating system
I/O Multiplexing Road Map: 1. Motivation 2. Description of I/O multiplexing 3. Scenarios to use I/O multiplexing 4. I/O Models  Blocking I/O  Non-blocking.
Flash: An efficient and portable Web server Authors: Vivek S. Pai, Peter Druschel, Willy Zwaenepoel Presented at the Usenix Technical Conference, June.
CS533 Concepts of Operating Systems Jonathan Walpole.
Chapter 4: Multithreaded Programming
Modified from Silberschatz, Galvin and Gagne ©2009 Lecture 7 Chapter 4: Threads (cont)
CSE 451: Operating Systems Section 6 Project 2b; Midterm Review.
Fast Servers Robert Grimm New York University Or: Religious Wars, part I, Events vs. Threads.
Jump to first page Flash An efficient and portable Web server presented by Andreas Anagnostatos CSE 291 Feb. 2, 2000 Vivek S. Pai Peter Druschel Willy.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Fast Servers Robert Grimm New York University Or: Religious Wars, part I, Events vs. Threads.
Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.
Scheduler Activations Effective Kernel Support for the User-Level Management of Parallelism.
Event-Driven Programming Vivek Pai Dec 5, GedankenBits  What does a raw bit cost?  IDE  40GB: $100  120GB: $180  32MB USB Pen: $38  FireWire:
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Computer Science Scalability of Linux Event-Dispatch Mechanisms Abhishek Chandra University of Massachusetts Amherst David Mosberger Hewlett Packard Labs.
3.5 Interprocess Communication
Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented by Reinette Grobler.
Threads CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
G Robert Grimm New York University Scheduler Activations.
Scheduler Activations Jeff Chase. Threads in a Process Threads are useful at user-level – Parallelism, hide I/O latency, interactivity Option A (early.
14.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 4: Threads.
Flash An efficient and portable Web server. Today’s paper, FLASH Quite old (1999) Reading old papers gives us lessons We can see which solution among.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
Architecture Support for OS CSCI 444/544 Operating Systems Fall 2008.
Scheduler Activations: Effective Kernel Support for the User- Level Management of Parallelism. Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska,
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Continuations And Java Regis -
The Performance of Microkernel-Based Systems
Rapid Development of High Performance Servers Khaled ElMeleegy Alan Cox Willy Zwaenepoel.
The Performance of Micro-Kernel- Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presentation by: Seungweon Park.
Background: I/O Concurrency Brad Karp UCL Computer Science CS GZ03 / M030 2 nd October, 2008.
Scalable Kernel Performance for Internet Servers under Realistic Loads. Gaurav Banga, etc... Western Research Lab : Research Report 1998/06 (Proceedings.
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Source: Operating System Concepts by Silberschatz, Galvin and Gagne.
1 Lecture 4: Threads Advanced Operating System Fall 2010.
The Performance of μ-Kernel-Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presenter: Sunita Marathe.
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan.
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
PA3: Improving Performance with I/O Multiplexing Part 1-1: Nov. 7, Part 1-2: Nov. 10 Part 2-1: Nov. 17, Part 2-2: Nov.20.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Unix System Calls and Posix Threads.
CS333 Intro to Operating Systems Jonathan Walpole.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.
Processes, Threads, and Process States. Programs and Processes  Program: an executable file (before/after compilation)  Process: an instance of a program.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
The Performance of Micro-Kernel- Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presentation by: Tim Hamilton.
Threads versus Events CSE451 Andrew Whitaker. This Class Threads vs. events is an ongoing debate  So, neat-and-tidy answers aren’t necessarily available.
An Efficient Threading Model to Boost Server Performance Anupam Chanda.
Code Development for High Performance Servers Topics Multithreaded Servers Event Driven Servers Example - Game Server code (Quake) A parallelization exercise.
Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.
for Event Driven Servers
Ioctl Operations. ioctl Function Interface Configuration  Netstat, ifconfig command 에서 사용.
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan Best Paper at SOSP 2005 Modified for CS739.
1 Lecture 19: Unix signals and Terminal management n what is a signal n signal handling u kernel u user n signal generation n signal example usage n terminal.
Operating System Concepts
Multithreading vs. Event Driven in Code Development of High Performance Servers.
Introduction to Operating Systems
OPERATING SYSTEM CONCEPT AND PRACTISE
CS 6560: Operating Systems Design
Scheduler activations
Chapter 4: Multithreaded Programming
CS 3305 System Calls Lecture 7.
Xen Network I/O Performance Analysis and Opportunities for Improvement
Process Control B.Ramamurthy 2/22/2019 B.Ramamurthy.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Thomas E. Anderson, Brian N. Bershad,
Chapter 4: Threads.
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

LAIO: Lazy Asynchronous I/O For Event Driven Servers Khaled Elmeleegy Alan L. Cox

Outline Available I/O APIs and their shortcomings. Available I/O APIs and their shortcomings. Event driven programming and its challenges. Event driven programming and its challenges. Lazy Asynchronous I/O (LAIO). Lazy Asynchronous I/O (LAIO). Experiments and results. Experiments and results. Conclusions. Conclusions.

Key Idea Existing I/O APIs come short of event driven server needs. Existing I/O APIs come short of event driven server needs. LAIO fixes that. LAIO fixes that.

Non-Blocking I/O System call may return without fully completing the operation. System call may return without fully completing the operation. Ex: write to a socket. Ex: write to a socket. System call may also return with completion. System call may also return with completion. Disadvantages: Disadvantages: Not available for disk operations. Not available for disk operations. Program using it needs to maintain state. Program using it needs to maintain state.

Asynchronous I/O (AIO) System call returns immediately. System call returns immediately. Operation always runs to completion and sends notification on completion. Operation always runs to completion and sends notification on completion. Via signal, event or polling. Via signal, event or polling. Disadvantages Disadvantages Missing disk operations like open and stat. Missing disk operations like open and stat. Always receive completion via a notification even if the operation didn’t block. Always receive completion via a notification even if the operation didn’t block. Lower performance. Lower performance.

Event Driven Programming with I/O event_loop(..) { … while(true) { event_list= get available events; for each event ev in event_list do call handler of ev; } handler(…) { …/* do stuff 1 */ open(..); /*may block*/ …/* do stuff 2 */ return; /* to event_loop */ } (What we have)

Event Driven Programming with I/O event_loop(..) { … while(true) { event_list= get available events; for each event ev in event_list do call handler of ev; } handler(…) { …/* do stuff 1 */ open(..); /*may block*/ …/* do stuff 2 */ return; /* to event_loop */ } If Blocks Server Stalls (What we have)

Event Driven Programming with I/O event_loop(..) { … while(true) { event_list= get available events; for each event ev in event_list do call event_handler of ev; } (What we want) handler1(…) { …/* do stuff 1 */ open(..); /*may block*/ if open blocks { set handler2 as callback for open; return; /* to event_loop */ } …/* do stuff 2 */ return; /* to event_loop */ }

Event Driven Programming with I/O event_loop(..) { … while(true) { event_list= get available events; for each event ev in event_list do call event_handler of ev; } (What we want) handler1(…) { …/* do stuff 1 */ open(..); /*may block*/ if open blocks { set handler2 as callback for open; return; /* to event_loop */ } …/* do stuff 2 */ return; /* to event_loop */ }

Event Driven Programming with I/O event_loop(..) { … while(true) { event_list= get available events; for each event ev in event_list do call event_handler of ev; } handler2(…) { …/* do stuff 2 */ return; /* to event_loop */ } (What we want) handler1(…) { …/* do stuff 1 */ open(..); /*may block*/ if open blocks { set handler2 as callback for open; return; /* to event_loop */ } …/* do stuff 2 */ return; /* to event_loop */ }

Lazy Asynchronous I/O (LAIO) Like AIO on blocking: asynchronous completion notification. Like AIO on blocking: asynchronous completion notification. Also like AIO operations are done in one shot and no partial completions. Also like AIO operations are done in one shot and no partial completions. Similar to non-blocking I/O if operations completes without blocking. Similar to non-blocking I/O if operations completes without blocking. Scheduler activation based. Scheduler activation based. Scheduler activation is an upcall delivered by kernel when a thread blocks or unblocks. Scheduler activation is an upcall delivered by kernel when a thread blocks or unblocks.

LAIO API int laio_syscall (int number,…) Performs the specified syscall asynchronously. void* laio_gethandle (void) Returns a handle to the last laio operation. laio_list laio_poll (void) Returns a list of handles to completed laio operations. Function Name Description

laio_syscall(int number, …) Enable upcalls. Save context Invoke system call System call blocks? Disable upcalls Return retval errno = EINPROGRESS Return -1 upcall_handler(..) {. Steals old stack using stored context } No Yes Invoked via kernel upcall

Experiments and Experimental setup. Performance evaluated using both micro- benchmarks and event driven web servers (thttpd and Flash). Performance evaluated using both micro- benchmarks and event driven web servers (thttpd and Flash). Pentium Xeon 2.4 GZ with 2 GB RAM machines. Pentium Xeon 2.4 GZ with 2 GB RAM machines. FreeBSD-5 with KSE, FreeBSD’s scheduler activation implementation. FreeBSD-5 with KSE, FreeBSD’s scheduler activation implementation. Two web traces, Rice and Berkeley, with working set sizes 1.1 GB and 6.4 GB respectively. Two web traces, Rice and Berkeley, with working set sizes 1.1 GB and 6.4 GB respectively.

Micro-benchmarks Read a byte from a pipe 100,000 times two cases blocking and non-blocking: Read a byte from a pipe 100,000 times two cases blocking and non-blocking: For non-blocking (byte ready on pipe) For non-blocking (byte ready on pipe) LAIO is 320% faster than AIO. LAIO is 320% faster than AIO. LAIO is 40% slower than non-blocking I/O. LAIO is 40% slower than non-blocking I/O. For blocking (byte not ready on pipe) For blocking (byte not ready on pipe) AIO is 8% faster than LAIO. AIO is 8% faster than LAIO. Call getpid(2) 1,000,000 times in two cases KSE enabled and disabled. Call getpid(2) 1,000,000 times in two cases KSE enabled and disabled. When disabled program was 5% faster (KSE overhead) When disabled program was 5% faster (KSE overhead)

thttpd Experiments thttpd is an event driven server modified to use libevent an event notification library. thttpd is an event driven server modified to use libevent an event notification library. Two versions of thttpd, libevent-thttpd and LAIO-thttpd. Two versions of thttpd, libevent-thttpd and LAIO-thttpd. For LAIO-thttpd, thttpd was modified by breaking up event handlers around blocking operations like open. For LAIO-thttpd, thttpd was modified by breaking up event handlers around blocking operations like open.

thttpd Results (Berkeley Throughput)

thttpd Results (Berkeley Response Time)

thttpd Results (Rice Throughput)

thttpd Results (Rice Response Time)

thttpd Results (Rice Throughput 512 MB RAM)

thttpd Results (Rice Response Time 512 MB RAM)

Flash An event driven web server. An event driven web server. 3 flavors: 3 flavors: Pure event driven. Pure event driven. AMPED: Asymmetric Multiprocess Event Driven. AMPED: Asymmetric Multiprocess Event Driven. Event driven core. Event driven core. Potentially blocking I/O handed off to a helper process. Potentially blocking I/O handed off to a helper process. Helper does an explicit read to bring data in memory. Helper does an explicit read to bring data in memory. LAIO: uses LAIO to do all I/O asynchronously. LAIO: uses LAIO to do all I/O asynchronously. For each of the three flavors files are sent either with sendfile(2), or using mmap(2). For each of the three flavors files are sent either with sendfile(2), or using mmap(2).

Flash Experiments All experiments are done with 500 clients. All experiments are done with 500 clients. All sockets are blocking. All sockets are blocking. For mmap: File maped to memory, then written to socket. For mmap: File maped to memory, then written to socket. Page faults may happen. Page faults may happen. mincore(2) is used to check if pages are in memory. mincore(2) is used to check if pages are in memory. For sendfile: File is sent via the sendfile(2) syscall which may block. For sendfile: File is sent via the sendfile(2) syscall which may block. Optimized sendfile: Kernel is modified that sendfile returns if blocking on disk occurs. Optimized sendfile: Kernel is modified that sendfile returns if blocking on disk occurs.

Flash Throughput (mmap) Berkeley-Cold 81 Mbps 134 Mbps 132 Mbps Berkeley-Warm 78 Mbps 127 Mbps 131 Mbps Rice-Cold 203 Mbps 386 Mbps 299 Mbps Rice-Warm 830 Mbps 800 Mbps 797 Mbps Configuration Flash-event (mmap) FLASH-AMPED (mmap) FLASH-LAIO (mmap) For Rice-Cold: callouts to the helper process for AMPED. For LAIO page faults. For Rice-Cold: callouts to the helper process for AMPED. For LAIO page faults. Performance difference is due to prefetching. Performance difference is due to prefetching.

Flash Throughput (sendfile) Berkeley-Cold 122 Mbps 171 Mbps Berkeley-Warm 125 Mbps 180 Mbps 179 Mbps Rice-Cold 277 Mbps 398 Mbps 382 Mbps Rice-Warm 845 Mbps 843 Mbps 815 Mbps Configuration Flash-event (sendfile) FLASH-AMPED (sendfile) FLASH-LAIO (sendfile)

Conclusions LAIO subdues shortcomings of other I/O APIs. LAIO subdues shortcomings of other I/O APIs. LAIO is more than 3 times faster than AIO when data is in memory. LAIO is more than 3 times faster than AIO when data is in memory. LAIO serves well event driven servers. LAIO serves well event driven servers. LAIO increased thttpd throughput by 38%. LAIO increased thttpd throughput by 38%. LAIO matched Flash performance with no kernel modifications. LAIO matched Flash performance with no kernel modifications.

Questions?