Presenter: Godmar Back CS 5204 Capriccio: Scalable Threads for Internet Services by von Behren et al Presenter: Godmar Back
What Happened So Far 1978 Lauer & Needham Duality – event-based and thread-based models are duals Can transform programs from one to the other Can preserve performance after transformation This did not settle the issue Fast forward 18 years to 1996: Internet is becoming popular; multi-threaded OS become commonplace
Why Threads Are A Bad Idea 1996 talk by John Ousterhout: “Why threads are a bad idea” Threads are hard to program (synchronization error-prone, prone to deadlock, convoying, etc.); experts only Threads can make modularization difficult Thread implementations are hard to make fast Threads aren’t well-supported (as of 1996) (Ousterhout’s) Conclusion: use threads only when their power is needed - for true CPU concurrency on an SMP – else use single-threaded event-based model CS 5204 Fall 2007 9/8/2018
Thread/process-based Server Models Option A: fork a new process for every connection on-demand Option B: fork a new thread for every connection to handle it Option C/D: pre-fork a certain number of processes (or threads) and hand connections to them
Server Models (2) A B C D A/B: # grows & shrinks C/D: fixed # Q.: When would you use which?
Most Widely Used Servers 1995-2007 (Source: netcraft.com) Typical Apache Pre-forked model, usually 5-25 processes
The /. effect Sudden spikes in load Need to build scalable server systems Even without /. effect, better scalability means better cost efficiency Q.: Which model provides highest throughput, given a fixed set of hardware resources, while providing for a robust and understandable programming model?
High-Concurrency Servers Ideal Peak: some resource at max Overload: some resource thrashing Load (concurrent tasks) Performance Source: von Behren [2003] Key goal: Maintain throughput: measure of useful work seen by clients CPU and resource management is critical Must weigh which connections to accept vs. to drop Ensure that requests in pipeline are completed
Motivation for Events Source: Walsh 2001, Performance results in LinuxThreads (pre-NPTL) NB: thread packages no longer perform this poorly.
Flash (Pai et al 1999) Fast event-based WebServer Uses Unix select() call to multiplex network connections Used helper processes for disk I/O Now could use Linux aio For more recent alternatives, see [Chandra Mosberger 2001] and Kegel’s “C10k” site at http://www.kegel.com/c10k.html
Event-based Model (Walsh et al 2001) : SEDA Stages: Explicit resource and concurrency control
Capriccio Thread-based model widely used: Allows linear control flow Event-based model has been shown to outperfom thread-based model Q.: is it possible to get performance of event-based model while maintaining ease of programming of threaded model? Fix underlying threading system! (And nevermind Ousterhout: “Why Events Are A Bad Idea [Behrens 2003])
Idea: Write scalable threading package What scalability issues plague threading packages? (a) Memory costs for many thread stacks – particularly, virtual memory (b) High cost of context-switching (c) Lack of explicit load control (all load looks to the scheduler like a “runnable thread”)
Capriccio Solutions (a) Use Linked Stacks (b) Use user-level threading (c) Deduce resource usage from blocking graph
(a) Linked Stacks Compiler assumes that stack is unlimited and continous: “sub $n, esp; mov _, (esp)” will never fail. Moreover, stack is not movable Usually do static, conservative (worst-case) allocation - leads to increased virtual memory use (+fragmentation) on 32-bit systems Linux: 4GB total, 3GB for processes, some needed for heap, shared libraries, code, etc. Say 2GB left for threads. Each stack 128KB -> maximum 65,536 threads. Idea: allocate stack space on demand Change compiler to do so
Compiler Analysis Whole-program analysis computes a static, weighted call graph Nodes: functions Edges: call-sites Node weight: size of activation record of that function
Computing Node Weights Compute size of local vars int sample() { int a, b; int c[10]; double d; char e[8036]; return 0; }
Sizing stack chunks Extremes: Allocate all stack space beforehand would only work if no recursion is present Even then, huge external wasted space Allocate each function’s stack space as it is entered Extremely expensive No external wasted space Configurable parameters: allocate at least “MinChunk” for each stack piece, and introduce breaks either on recursion, or to break path after length “MaxPath”
Example main(){char buf[512]; A(); C();} void A(){char buf[820]; B(); D();} void B(){char buf[1024];} void C(){char buf[205]; D(); E();} void D(){char buf[205];} void E(){char buf[205]; C();}
External vs. Internal Wastage MaxPath = 1KB, MinChunk 1 KB. Internal Waste External Waste
(b) User-level Threading Capriccio uses highly-efficient, non-preemptive (cooperative) user-level threading. Scheduler works like event loop: “waits” for one or more fds to be ready, schedules thread waiting on this fd. Scalable data structures Pays slight price for asynchronous I/O at user level
(c) Blocking Graphs Resource needs are deduced from blocking graph Close Write Read Open Accept Web Server Resource needs are deduced from blocking graph Node represented program points + backtraces Edges annotated with time it takes to go from node to node Monitors and schedules resources associated with blocking points
Performance Results 15% speedup with Capriccio