Servers: Concurrency and Performance Jeff Chase Duke University.

Slides:



Advertisements
Similar presentations
D u k e S y s t e m s Servers and Threads Jeff Chase Duke University.
Advertisements

Flash: An efficient and portable Web server Authors: Vivek S. Pai, Peter Druschel, Willy Zwaenepoel Presented at the Usenix Technical Conference, June.
1 SEDA: An Architecture for Well- Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.
Threads, Events, and Scheduling Andy Wang COP 5611 Advanced Operating Systems.
Chapter 7 Protocol Software On A Conventional Processor.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of.
Capriccio: Scalable Threads for Internet Services ( by Behren, Condit, Zhou, Necula, Brewer ) Presented by Alex Sherman and Sarita Bafna.
Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.
Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, George Necula and Eric Brewer University of California at Berkeley.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
Event-Driven Programming Vivek Pai Dec 5, GedankenBits  What does a raw bit cost?  IDE  40GB: $100  120GB: $180  32MB USB Pen: $38  FireWire:
3.5 Interprocess Communication
Threads CSCI 444/544 Operating Systems Fall 2008.
Operational Analysis L. Grewe. Operational Analysis Relationships that do not require any assumptions about the distribution of service times or inter-arrival.
Real-Time Kernels and Operating Systems. Operating System: Software that coordinates multiple tasks in processor, including peripheral interfacing Types.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
Threads CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Why Threads Are A Bad Idea (for most purposes) John Ousterhout Sun Microsystems Laboratories
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
Why Events Are A Bad Idea (for high-concurrency servers) Rob von Behren, Jeremy Condit and Eric Brewer University of California at Berkeley
Fundamentals of Python: From First Programs Through Data Structures
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
Apache Architecture. How do we measure performance? Benchmarks –Requests per Second –Bandwidth –Latency –Concurrency (Scalability)
Network Applications: Async Servers and Operational Analysis Y. Richard Yang 10/03/2013.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
Flash An efficient and portable Web server. Today’s paper, FLASH Quite old (1999) Reading old papers gives us lessons We can see which solution among.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.
LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania
Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Concurrent Programming. Concurrency  Concurrency means for a program to have multiple paths of execution running at (almost) the same time. Examples:
Scalable Internet Services Cluster Lessons and Architecture Design for Scalable Services BR01, TACC, Porcupine, SEDA and Capriccio.
Scheduling Lecture 6. What is Scheduling? An O/S often has many pending tasks. –Threads, async callbacks, device input. The order may matter. –Policy,
Copyright ©: University of Illinois CS 241 Staff1 Threads Systems Concepts.
OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.
CS333 Intro to Operating Systems Jonathan Walpole.
Distributed Computing A Programmer’s Perspective.
Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.
Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Department of Computer Science and Software Engineering
Processes, Threads, and Process States. Programs and Processes  Program: an executable file (before/after compilation)  Process: an instance of a program.
6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris
Threads versus Events CSE451 Andrew Whitaker. This Class Threads vs. events is an ongoing debate  So, neat-and-tidy answers aren’t necessarily available.
1 Why Threads are a Bad Idea (for most purposes) based on a presentation by John Ousterhout Sun Microsystems Laboratories Threads!
I/O Software CS 537 – Introduction to Operating Systems.
Network Applications: High-performance Server Design: Async Servers/Operational Analysis Y. Richard Yang 2/24/2016.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Introduction to Operating Systems Concepts
Module 12: I/O Systems I/O hardware Application I/O Interface
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Processes and Threads Processes and their scheduling
CS399 New Beginnings Jonathan Walpole.
Threads, Events, and Scheduling
Chapter 4: Threads.
Operating System Concepts
Threads and Concurrency
Threads, Events, and Scheduling
Capriccio: Scalable Threads for Internet Services
Prof. Leonardo Mostarda University of Camerino
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Why Threads Are A Bad Idea (for most purposes)
Y. Richard Yang 10/9/2018 Network Applications: High-Performance Server Design (Async Select NonBlocking Servers)
Why Threads Are A Bad Idea (for most purposes)
Why Threads Are A Bad Idea (for most purposes)
Module 12: I/O Systems I/O hardwared Application I/O Interface
Presentation transcript:

Servers: Concurrency and Performance Jeff Chase Duke University

HTTP Server –Creates a socket (socket) –Binds to an address –Listens to setup accept backlog –Can call accept to block waiting for connections –(Can call select to check for data on multiple socks ) Handle request –GET /index.html HTTP/1.0\n \n \n

Inside your server packet queues listen queue accept queue Server application (Apache, Tomcat/Java, etc) Measures offered load response time throughput utilization

Example: Video On Demand Client() { fd = connect(“server”); write (fd, “video.mpg”); while (!eof(fd)){ read (fd, buf); display (buf); } Server() { while (1) { cfd = accept(); read (cfd, name); fd = open (name); while (!eof(fd)) { read(fd, block); write (cfd, block); } close (cfd); close (fd); } [MIT/Morris] How many clients can the server support? Suppose, say, 200 kb/s video on a 100 Mb/s network link?

Performance “analysis” Server capacity: –Network (100 Mbit/s) –Disk (20 Mbyte/s) Obtained performance: one client stream Server is limited by software structure If a video is 200 Kbit/s, server should be able to support more than one client. [MIT/Morris] 500?

WebServer Flow TCP socket space state: listening address: {*.6789, *.*} completed connection queue: sendbuf: recvbuf: state: listening address: {*.25, *.*} completed connection queue: sendbuf: recvbuf: state: established address: { :6789, } sendbuf: recvbuf: connSocket = accept() Create ServerSocket read request from connSocket read local file write file to connSocket close connSocket Discussion: what does each step do and how long does it take?

Web Server Processing Steps Accept Client Connection Read HTTP Request Header Find File Send HTTP Response Header Read File Send Data may block waiting on disk I/O Want to be able to process requests concurrently. may block waiting on network

Process States and Transitions running (user) running (kernel) readyblocked Run Wakeup interrupt, exception Sleep Yield trap/return

Server Blocking accept() when no connect requests are waiting on the listen queue –What if server has multiple ports to listen from? E.g., 80 for HTTP, 443 for HTTPS open/read/write on server files read() on a socket, if the client is sending too slowly write() on socket, if the client is receiving too slowly –Yup, TCP has flow control like pipes What if the server blocks while serving one client, and another client has work to do?

Under the Hood CPU I/O device I/O requestI/O completion start (arrival rate λ) exit (throughput λ until some center saturates)

Concurrency and Pipelining CPU DISK Before NET CPU DISK NET After

Better single-server performance Goal: run at server’s hardware speed –Disk or network should be bottleneck Method: –Pipeline blocks of each request –Multiplex requests from multiple clients Two implementation approaches: –Multithreaded server –Asynchronous I/O [MIT/Morris]

Concurrent threads or processes Using multiple threads/processes –so that only the flow processing a particular request is blocked –Java: extends Thread or implements Runnable interface Example: a Multi-threaded WebServer, which creates a thread for each request

Multiple Process Architecture Advantages –Simple programming while addressing blocking issue Disadvantages –Many processes; large context switch overheads –Consumes much memory –Optimizations involving sharing information among processes (e.g., caching) harder Accept Conn Read Request Find File Send Header Read File Send Data Accept Conn Read Request Find File Send Header Read File Send Data Process 1 Process N … separate address spaces

Using Threads Advantages –Lower context switch overheads –Shared address space simplifies optimizations (e.g., caches) Disadvantages –Need kernel level threads (why?) –Some extra memory needed to support multiple stacks –Need thread-safe programs, synchronization Accept Conn Read Request Find File Send Header Read File Send Data Accept Conn Read Request Find File Send Header Read File Send Data Thread 1 Thread N …

Multithreaded server server() { while (1) { cfd = accept(); read (cfd, name); fd = open (name); while (!eof(fd)) { read(fd, block); write (cfd, block); } close (cfd); close (fd); }} for (i = 0; i < 10; i++) threadfork (server); When waiting for I/O, thread scheduler runs another thread What about references to shared data? Synchronization [MIT/Morris]

Event-Driven Programming One execution stream: no CPU concurrency. Register interest in events (callbacks). Event loop waits for events, invokes handlers. No preemption of event handlers. Handlers generally short- lived. Event Loop Event Handlers [Ousterhout 1995]

Single Process Event Driven (SPED) Single threaded Asynchronous (non-blocking) I/O Advantages –Single address space –No synchronization Disadvantages –In practice, disk reads still block Accept Conn Read Request Find File Send Header Read File Send Data Event Dispatcher

Asynchronous Multi-Process Event Driven (AMPED) Like SPED, but use helper processes/thread for disk I/O Use IPC to communicate with helper process Advantages –Shared address space for most web server functions –Concurrency for disk I/O Disadvantages –IPC between main thread and helper threads Accept Conn Read Request Find File Send Header Read File Send Data Event Dispatcher Helper 1 This hybrid model is used by the “Flash” web server.

Event-Based Concurrent Servers Using I/O Multiplexing Maintain a pool of connected descriptors. Repeat the following forever: –Use the Unix select f unction to block until: (a) New connection request arrives on the listening descriptor. (b) New data arrives on an existing connected descriptor. –If (a), add the new connection to the pool of connections. –If (b), read any available data from the connection Close connection on EOF and remove it from the pool. [CMU ]

Select If a server has many open sockets, how does it know when one of them is ready for I/O? int select(int n, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout); Issues with scalability: alternative event interfaces have been offered.

Asynchronous I/O struct callback { bool (*is_ready)(); void (*cb)(arg); void *arg; } main() { while (1) { for (c = each callback) { if (c->is_ready()) c->handler(c->arg); } Code is structured as a collection of handlers Handlers are nonblocking Create new handlers for blocking operations When operation completes, call handler [MIT/Morris]

Asychronous server init() { on_accept(accept_cb); } accept_cb() { on_readable(cfd,name_cb); } on_readable(fd, fn) { c = new callback(test_readable, fn, fd); add c to callback list; } name_cb(cfd) { read(cfd,name); fd = open(name); on_readable(fd, read_cb); } read_cb(cfd, fd) { read(fd, block); on_writeeable(fd, write_cb); } write_cb(cfd, fd) { write(cfd, block); on_readable(fd, read_cb); } [MIT/Morris]

Multithreaded vs. Async Hard to program –Locking code –Need to know what blocks Coordination explicit State stored on thread’s stack –Memory allocation implicit Context switch may be expensive Multiprocessors Hard to program –Callback code –Need to know what blocks Coordination implicit State passed around explicitly –Memory allocation explicit Lightweight context switch Uniprocessors [MIT/Morris]

Coordination example Threaded server: –Thread for network interface –Interrupt wakes up network thread –Protected (locks and conditional variables) shared buffer shared between server threads and network thread Asynchronous I/O –Poll for packets How often to poll? –Or, interrupt generates an event Be careful: disable interrupts when manipulating callback queue. [MIT/Morris]

Threads! One View

Should You Abandon Threads? No: important for high-end servers (e.g. databases). But, avoid threads wherever possible: –Use events, not threads, for GUIs, distributed systems, low-end servers. –Only use threads where true CPU concurrency is needed. –Where threads needed, isolate usage in threaded application kernel: keep most of code single-threaded. Threaded Kernel Event-Driven Handlers [Ousterhout 1995]

Another view Events obscure control flow –For programmers and tools ThreadsEvents thread_main(int sock) { struct session s; accept_conn(sock, &s); read_request(&s); pin_cache(&s); write_response(&s); unpin(&s); } pin_cache(struct session *s) { pin(&s); if( !in_cache(&s) ) read_file(&s); } AcceptHandler(event e) { struct session *s = new_session(e); RequestHandler.enqueue(s); } RequestHandler(struct session *s) { …; CacheHandler.enqueue(s); } CacheHandler(struct session *s) { pin(s); if( !in_cache(s) ) ReadFileHandler.enqueue(s); else ResponseHandler.enqueue(s); }... ExitHandlerr(struct session *s) { …; unpin(&s); free_session(s); } Accept Conn. Write Response Read File Read Request Pin Cache Web Server Exit [von Behren]

Control Flow Accept Conn. Write Response Read File Read Request Pin Cache Web Server Exit ThreadsEvents thread_main(int sock) { struct session s; accept_conn(sock, &s); read_request(&s); pin_cache(&s); write_response(&s); unpin(&s); } pin_cache(struct session *s) { pin(&s); if( !in_cache(&s) ) read_file(&s); } CacheHandler(struct session *s) { pin(s); if( !in_cache(s) ) ReadFileHandler.enqueue(s); else ResponseHandler.enqueue(s); } RequestHandler(struct session *s) { …; CacheHandler.enqueue(s); }... ExitHandlerr(struct session *s) { …; unpin(&s); free_session(s); } AcceptHandler(event e) { struct session *s = new_session(e); RequestHandler.enqueue(s); } Events obscure control flow –For programmers and tools [von Behren]

Exceptions Exceptions complicate control flow –Harder to understand program flow –Cause bugs in cleanup code Accept Conn. Write Response Read File Read Request Pin Cache Web Server Exit ThreadsEvents thread_main(int sock) { struct session s; accept_conn(sock, &s); if( !read_request(&s) ) return; pin_cache(&s); write_response(&s); unpin(&s); } pin_cache(struct session *s) { pin(&s); if( !in_cache(&s) ) read_file(&s); } CacheHandler(struct session *s) { pin(s); if( !in_cache(s) ) ReadFileHandler.enqueue(s); else ResponseHandler.enqueue(s); } RequestHandler(struct session *s) { …; if( error ) return; CacheHandler.enqueue(s); }... ExitHandlerr(struct session *s) { …; unpin(&s); free_session(s); } AcceptHandler(event e) { struct session *s = new_session(e); RequestHandler.enqueue(s); } [von Behren]

State Management ThreadsEvents thread_main(int sock) { struct session s; accept_conn(sock, &s); if( !read_request(&s) ) return; pin_cache(&s); write_response(&s); unpin(&s); } pin_cache(struct session *s) { pin(&s); if( !in_cache(&s) ) read_file(&s); } CacheHandler(struct session *s) { pin(s); if( !in_cache(s) ) ReadFileHandler.enqueue(s); else ResponseHandler.enqueue(s); } RequestHandler(struct session *s) { …; if( error ) return; CacheHandler.enqueue(s); }... ExitHandlerr(struct session *s) { …; unpin(&s); free_session(s); } AcceptHandler(event e) { struct session *s = new_session(e); RequestHandler.enqueue(s); } Accept Conn. Write Response Read File Read Request Pin Cache Web Server Exit Events require manual state management Hard to know when to free –Use GC or risk bugs [von Behren]

Accept Conn Read Request Find File Send Header Read File Send Data Accept Conn Read Request Find File Send Header Read File Send Data Thread 1 Thread N …

Internet Growth and Scale The Internet How to handle all those client requests raining on your server?

Servers Under Stress Ideal Peak: some resource at max Overload: some resource thrashing Load (concurrent requests, or arrival rate) Performance [Von Behren]

Response Time Components Wire time + Queuing time + Service demand + Wire time (response) Depends on Cost/length of request Load conditions at server latency offered load

Queuing Theory for Busy People Big Assumptions –Queue is First-Come-First-Served (FIFO, FCFS). –Request arrivals are independent (poisson arrivals). –Requests have independent service demands. –i.e., arrival interval and service demand are exponentially distributed (noted as “M”). M/M/1 Service Center offered load request arrival rate λ wait here Process for mean service demand D

Utilization What is the probability that the center is busy? –Answer: some number between 0 and 1. What percentage of the time is the center busy? –Answer: some number between 0 and 100 These are interchangeable: called utilization U If the center is not saturated, i.e., it completes all its requests in some bounded time, then: U = λD = (arrivals/T * service demand) “Utilization Law” The probability that the service center is idle is 1-U.

Little’s Law For an unsaturated queue in steady state, mean response time R and mean queue length N are governed by: Little’s Law: N = λR Suppose a task T is in the system for R time units. During that time: –λR new tasks arrive. –N tasks depart (all tasks ahead of T). But in steady state, the flow in balances flow out. – Note: this means that throughput X = λ.

Inverse Idle Time “Law” R 1(100%) Service center saturates as 1/ λ approaches D: small increases in λ cause large increases in the expected response time R. U Little’s Law gives response time R = D/(1 - U). Intuitively, each task T’s response time R = D + DN. Substituting λR for N: R = D + D λR Substituting U for λD: R = D + UR R - UR = D --> R(1 - U) = D --> R = D/(1 - U)

Why Little’s Law Is Important 1. Intuitive understanding of FCFS queue behavior. Compute response time from demand parameters (λ, D). Compute N: how much storage is needed for the queue. 2. Notion of a saturated service center. –Response times rise rapidly with load and are unbounded. At 50% utilization, a 10% increase in load increases R by 10%. At 90% utilization, a 10% increase in load increases R by 10x. 3. Basis for predicting performance of queuing networks. Cheap and easy “back of napkin” estimates of system performance based on observed behavior and proposed changes, e.g., capacity planning, “what if” questions.

What does this tell us about server behavior at saturation?

Under the Hood CPU I/O device I/O requestI/O completion start (arrival rate λ) exit (throughput λ until some center saturates)

Common Bottlenecks No more File Descriptors Sockets stuck in TIME_WAIT High Memory Use (swapping) CPU Overload Interrupt (IRQ) Overload [Aaron Bannert]

Scaling Server Sites: Clustering server array Clients L4: TCP L7: HTTP SSL etc. Goals server load balancing failure detection access control filtering priorities/QoS request locality transparent caching smart switch virtual IP addresses (VIPs) What to switch/filter on? L3 source IP and/or VIP L4 (TCP) ports etc. L7 URLs and/or cookies L7 SSL session IDs

Scaling Services: Replication Internet Distribute service load across multiple sites. How to select a server site for each client or request? Is it scalable? Client Site A Site B ?

Extra Slides (Any new information on the following slides will not be tested.)

Event-Based Concurrent Servers Using I/O Multiplexing Maintain a pool of connected descriptors. Repeat the following forever: –Use the Unix select f unction to block until: (a) New connection request arrives on the listening descriptor. (b) New data arrives on an existing connected descriptor. –If (a), add the new connection to the pool of connections. –If (b), read any available data from the connection Close connection on EOF and remove it from the pool. [CMU ]

Problems of Multi-Thread Server High resource usage, context switch overhead, contended locks Too many threads  throughput meltdown, response time explosion Solution: bound total number of threads

Event-Driven Programming Event-driven programming, also called asynchronous i/o Using Finite State Machines (FSM) to monitor the progress of requests Yields efficient and scalable concurrency Many examples: Click router, Flash web server, TP Monitors, etc. Java: asynchronous i/o –for an example see:

Traditional Processes Expensive and “heavyweight” One system call per process Fork overhead Coordination

Events Need async I/O Need select Wasn’t originally available Not standardized Immature But efficient Code is distributed all through the program Harder to debug and understand

Threads Separate interface and implementation Pthreads interface Implementation is user-level or kernel (native) If user-level, needs async I/O But hide the abstraction behind the thread interface

Reference The State of the Art in Locally Distributed Web- server Systems Valeria Cardellini, Emiliano Casalicchio, Michele Colajanni and Philip S. Yu