Capriccio: Scalable Threads for Internet Services

Slides:

Advertisements

Similar presentations

Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley.

Advertisements

Chess Review May 8, 2003 Berkeley, CA Compiler Support for Multithreaded Software Jeremy ConditRob von Behren Feng ZhouEric Brewer George Necula.

1 SEDA: An Architecture for Well- Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.

CS533 Concepts of Operating Systems Jonathan Walpole.

1 Capriccio: Scalable Threads for Internet Services Matthew Phillips.

SEDA: An Architecture for Well- Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.

Why Events Are A Bad Idea (for high-concurrency servers) By Rob von Behren, Jeremy Condit and Eric Brewer.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of.

Capriccio: Scalable Threads for Internet Services ( by Behren, Condit, Zhou, Necula, Brewer ) Presented by Alex Sherman and Sarita Bafna.

Operating System Support Focus on Architecture

Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.

Capriccio: Scalable Threads for Internet Services (von Behren) Kenneth Chiu.

Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, George Necula and Eric Brewer University of California at Berkeley.

Capriccio: Scalable Threads For Internet Services Authors: Rob von Behren, Jeremy Condit, Feng Zhou, George C. Necula, Eric Brewer Presentation by: Will.

Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley.

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

Why Events Are a Bad Idea (for high-concurrency servers) Author: Rob von Behren, Jeremy Condit and Eric Brewer Presenter: Wenhao Xu Other References: [1]

“Why Events are a Bad Idea (For high-concurrency servers)” Paper by Rob von Behren, Jeremy Condit and Eric Brewer, May 2003 Presentation by Loren Davis,

Operational Analysis L. Grewe. Operational Analysis Relationships that do not require any assumptions about the distribution of service times or inter-arrival.

Computer Organization and Architecture

CS533 Concepts of Operating Systems Class 3 Integrated Task and Stack Management.

CPS110: Implementing threads/locks on a uni-processor Landon Cox.

A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.

Why Events Are A Bad Idea (for high-concurrency servers) Rob von Behren, Jeremy Condit and Eric Brewer Computer Science Division, University of California.

CS 3013 & CS 502 Summer 2006 Threads1 CS-3013 & CS-502 Summer 2006.

Why Events Are A Bad Idea (for high-concurrency servers) Rob von Behren, Jeremy Condit and Eric Brewer University of California at Berkeley

Layers and Views of a Computer System Operating System Services Program creation Program execution Access to I/O devices Controlled access to files System.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

CSCI-1680 Network Programming II Rodrigo Fonseca.

Why Events Are A Bad Idea (for high-concurrency servers) Rob von Behren, Jeremy Condit and Eric Brewer Presented by: Hisham Benotman CS533 - Concepts of.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services by, Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Servers: Concurrency and Performance Jeff Chase Duke University.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Presented by Changdae Kim and Jaeung Han OFFENCE.

Concurrent Programming. Concurrency  Concurrency means for a program to have multiple paths of execution running at (almost) the same time. Examples:

Scalable Internet Services Cluster Lessons and Architecture Design for Scalable Services BR01, TACC, Porcupine, SEDA and Capriccio.

Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley.

Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.

5204 – Operating Systems Threads vs. Events. 2 CS 5204 – Operating Systems Forms of task management serial preemptivecooperative (yield) (interrupt)

Presenting: Why Events Are A “Bad” Idea ( for high-concurrency servers) Paper by: Behren, Condit and Brewer, 2003 Presented by: Rania Elnaggar 1/30/2008.

CSE 60641: Operating Systems Next topic: CPU (Process/threads/scheduling, synchronization and deadlocks) –Why threads are a bad idea (for most purposes).

Capriccio: Scalable Threads for Internet Service

By: Rob von Behren, Jeremy Condit and Eric Brewer 2003 Presenter: Farnoosh MoshirFatemi Jan

SEDA An architecture for Well-Conditioned, scalable Internet Services Matt Welsh, David Culler, and Eric Brewer University of California, Berkeley Symposium.

Holistic Systems Programming Qualifying Exam Presentation UC Berkeley, Computer Science Division Rob von Behren June 21, 2004.

Threads versus Events CSE451 Andrew Whitaker. This Class Threads vs. events is an ongoing debate  So, neat-and-tidy answers aren’t necessarily available.

1 Why Events Are A Bad Idea (for high-concurrency servers) By Rob von Behren, Jeremy Condit and Eric Brewer (May 2003) CS533 – Spring 2006 – DONG, QIN.

An Efficient Threading Model to Boost Server Performance Anupam Chanda.

CS533 Concepts of Operating Systems Jonathan Walpole.

Paper Review of Why Events Are A Bad Idea (for high-concurrency servers) Rob von Behren, Jeremy Condit and Eric Brewer By Anandhi Sundaram.

Capriccio:Scalable Threads for Internet Services

SEDA: An Architecture for Scalable, Well-Conditioned Internet Services

Capriccio : Scalable Threads for Internet Services

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Why Events Are A Bad Idea (for high-concurrency servers)

Processes and Threads Processes and their scheduling

CS399 New Beginnings Jonathan Walpole.

William Stallings Computer Organization and Architecture

Presenter: Godmar Back

Capriccio – A Thread Model

CSE 451: Operating Systems Spring 2012 Module 6 Review of Processes, Kernel Threads, User-Level Threads Ed Lazowska 570 Allen.

Operating Systems Lecture 1.

Why Threads Are A Bad Idea (for most purposes)

Why Events Are a Bad Idea (for high concurrency servers)

Why Threads Are A Bad Idea (for most purposes)

Why Threads Are A Bad Idea (for most purposes)

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

CS 5204 Operating Systems Lecture 5

Dynamic Binary Translators and Instrumenters

Threads CSE 2431: Introduction to Operating Systems

Presentation transcript:

Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, George Necula and Eric Brewer University of California at Berkeley {jrvb, jcondit, zf, necula, brewer}@cs.berkeley.edu http://capriccio.cs.berkeley.edu

Load (concurrent tasks) The Stage Highly concurrent applications Internet servers & frameworks Flash, Ninja, SEDA Transaction processing databases Workload High performance Unpredictable load spikes Operate “near the knee” Avoid thrashing! Ideal Peak: some resource at max Performance Overload: some resource thrashing Load (concurrent tasks)

The Price of Concurrency What makes concurrency hard? Race conditions Code complexity Scalability (no O(n) operations) Scheduling & resource sensitivity Inevitable overload Performance vs. Programmability No current system solves Must be a better way! Threads Ideal Ease of Programming Events Threads Performance

The Answer: Better Threads Goals Simple programming model Good tools & infrastructure Languages, compilers, debuggers, etc. Good performance Claims Threads are preferable to events User-Level threads are key

“But Events Are Better!” Recent arguments for events Lower runtime overhead Better live state management Inexpensive synchronization More flexible control flow Better scheduling and locality All true but… Lauer & Needham duality argument Criticisms of specific threads packages No inherent problem with threads! Thread implementations can be improved

Threading Criticism: Runtime Overhead Criticism: Threads don’t perform well for high concurrency Response Avoid O(n) operations Minimize context switch overhead Simple scalability test Slightly modified GNU Pth Thread-per-task vs. single thread Same performance!

Threading Criticism: Synchronization Criticism: Thread synchronization is heavyweight Response Cooperative multitasking works for threads, too! Also presents same problems Starvation & fairness Multiprocessors Unexpected blocking (page faults, etc.) Both regimes need help Compiler / language support for concurrency Better OS primitives

Threading Criticism: Scheduling Criticism: Thread schedulers are too generic Can’t use application-specific information Response 2D scheduling: task & program location Threads schedule based on task only Events schedule by location (e.g. SEDA) Allows batching Allows prediction for SRCT Threads can use 2D, too! Runtime system tracks current location Call graph allows prediction Task Program Location

Threading Criticism: Scheduling Criticism: Thread schedulers are too generic Can’t use application-specific information Response 2D scheduling: task & program location Threads schedule based on task only Events schedule by location (e.g. SEDA) Allows batching Allows prediction for SRCT Threads can use 2D, too! Runtime system tracks current location Call graph allows prediction Task Program Location Threads

Threading Criticism: Scheduling Criticism: Thread schedulers are too generic Can’t use application-specific information Response 2D scheduling: task & program location Threads schedule based on task only Events schedule by location (e.g. SEDA) Allows batching Allows prediction for SRCT Threads can use 2D, too! Runtime system tracks current location Call graph allows prediction Task Program Location Events Threads

The Proof’s in the Pudding User-level threads package Subset of pthreads Intercept blocking system calls No O(n) operations Support > 100K threads 5000 lines of C code Simple web server: Knot 700 lines of C code Similar performance Linear increase, then steady Drop-off due to poll() overhead 100 200 300 400 500 600 700 800 900 1 4 16 64 256 1024 4096 16384 KnotC (Favor Connections) KnotA (Favor Accept) Haboob Concurrent Clients Mbits / second

Arguments For Threads More natural programming model Control flow is more apparent Exception handling is easier State management is automatic Better fit with current tools & hardware Better existing infrastructure

Arguments for Threads: Control Flow Events obscure control flow For programmers and tools Web Server Accept Conn. Threads Events thread_main(int sock) { struct session s; accept_conn(sock, &s); read_request(&s); pin_cache(&s); write_response(&s); unpin(&s); } pin_cache(struct session *s) { pin(&s); if( !in_cache(&s) ) read_file(&s); AcceptHandler(event e) { struct session *s = new_session(e); RequestHandler.enqueue(s); RequestHandler(struct session *s) { …; CacheHandler.enqueue(s); CacheHandler(struct session *s) { pin(s); if( !in_cache(s) ) ReadFileHandler.enqueue(s); else ResponseHandler.enqueue(s); . . . ExitHandler(struct session *s) { …; unpin(&s); free_session(s); } Read Request Pin Cache Read File Write Response Exit

Arguments for Threads: Control Flow Events obscure control flow For programmers and tools Web Server Accept Conn. Threads Events thread_main(int sock) { struct session s; accept_conn(sock, &s); read_request(&s); pin_cache(&s); write_response(&s); unpin(&s); } pin_cache(struct session *s) { pin(&s); if( !in_cache(&s) ) read_file(&s); CacheHandler(struct session *s) { pin(s); if( !in_cache(s) ) ReadFileHandler.enqueue(s); else ResponseHandler.enqueue(s); RequestHandler(struct session *s) { …; CacheHandler.enqueue(s); . . . ExitHandler(struct session *s) { …; unpin(&s); free_session(s); AcceptHandler(event e) { struct session *s = new_session(e); RequestHandler.enqueue(s); } Read Request Pin Cache Read File Write Response Exit

Arguments for Threads: Exceptions Exceptions complicate control flow Harder to understand program flow Cause bugs in cleanup code Web Server Accept Conn. Threads Events thread_main(int sock) { struct session s; accept_conn(sock, &s); if( !read_request(&s) ) return; pin_cache(&s); write_response(&s); unpin(&s); } pin_cache(struct session *s) { pin(&s); if( !in_cache(&s) ) read_file(&s); CacheHandler(struct session *s) { pin(s); if( !in_cache(s) ) ReadFileHandler.enqueue(s); else ResponseHandler.enqueue(s); RequestHandler(struct session *s) { …; if( error ) return; CacheHandler.enqueue(s); . . . ExitHandler(struct session *s) { …; unpin(&s); free_session(s); AcceptHandler(event e) { struct session *s = new_session(e); RequestHandler.enqueue(s); } Read Request Pin Cache Read File Write Response Exit

Arguments for Threads: State Management Events require manual state management Hard to know when to free Use GC or risk bugs Web Server Accept Conn. Threads Events thread_main(int sock) { struct session s; accept_conn(sock, &s); if( !read_request(&s) ) return; pin_cache(&s); write_response(&s); unpin(&s); } pin_cache(struct session *s) { pin(&s); if( !in_cache(&s) ) read_file(&s); CacheHandler(struct session *s) { pin(s); if( !in_cache(s) ) ReadFileHandler.enqueue(s); else ResponseHandler.enqueue(s); RequestHandler(struct session *s) { …; if( error ) return; CacheHandler.enqueue(s); . . . ExitHandler(struct session *s) { …; unpin(&s); free_session(s); AcceptHandler(event e) { struct session *s = new_session(e); RequestHandler.enqueue(s); } Read Request Pin Cache Read File Write Response Exit

Arguments for Threads: Existing Infrastructure Lots of infrastructure for threads Debuggers Languages & compilers Consequences More amenable to analysis Less effort to get working systems

Building Better Threads Goals Simplify the programming model Thread per concurrent activity Scalability (100K+ threads) Support existing APIs and tools Automate application-specific customization Mechanisms User-level threads Plumbing: avoid O(n) operations Compile-time analysis Run-time analysis

The Case for User-Level Threads Decouple programming model and OS Kernel threads Abstract hardware Expose device concurrency User-level threads Provide clean programming model Expose logical concurrency Benefits of user-level threads Control over concurrency model! Independent innovation Enables static analysis Enables application-specific tuning App User Threads OS OS – CPUs, disk network UL Threads – independent sessions, parallel accesses

The Case for User-Level Threads Decouple programming model and OS Kernel threads Abstract hardware Expose device concurrency User-level threads Provide clean programming model Expose logical concurrency Benefits of user-level threads Control over concurrency model! Independent innovation Enables static analysis Enables application-specific tuning App Threads User OS – CPUs, disk network UL Threads – independent sessions, parallel accesses OS

Capriccio Internals Cooperative user-level threads Kernel Mechanisms Fast context switches Lightweight synchronization Kernel Mechanisms Asynchronous I/O (Linux) Efficiency Avoid O(n) operations Fast, flexible scheduling

Safety: Linked Stacks The problem: fixed stacks Overflow vs. wasted space Limits thread numbers The solution: linked stacks Allocate space as needed Compiler analysis Add runtime checkpoints Guarantee enough space until next check overflow waste Linked Stack

Linked Stacks: Algorithm Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack 3 3 2 5 Questions for Jeremy: Q: Why don’t we know exactly the amount of stack used? Q: What is the stack layout? (where do the args of the callee go?) 2 4 3 6 MaxPath = 8

Linked Stacks: Algorithm Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack 3 3 2 5 2 4 3 6 MaxPath = 8

Linked Stacks: Algorithm Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack 3 3 2 5 2 4 3 6 MaxPath = 8

Linked Stacks: Algorithm Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack 3 3 2 5 2 4 3 6 MaxPath = 8

Linked Stacks: Algorithm Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack 3 3 2 5 2 4 3 6 MaxPath = 8

Linked Stacks: Algorithm Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack 3 3 2 5 2 4 3 6 MaxPath = 8

Scheduling: The Blocking Graph Web Server Lessons from event systems Break app into stages Schedule based on stage priorities Allows SRCT scheduling, finding bottlenecks, etc. Capriccio does this for threads Deduce stage with stack traces at blocking points Prioritize based on runtime information Accept Read Open Read Close Write Close

Resource-Aware Scheduling Track resources used along BG edges Memory, file descriptors, CPU Predict future from the past Algorithm Increase use when underutilized Decrease use near saturation Advantages Operate near the knee w/o thrashing Automatic admission control Web Server Accept Read Open Read Close Write Close

Thread Performance Slightly slower thread creation Capriccio Capriccio-notrace LinuxThreads NPTL Thread Creation 21.5 37.5 17.7 Context Switch 0.56 0.24 0.71 0.65 Uncontested mutex lock 0.04 0.14 0.15 Time of thread operations (microseconds) Slightly slower thread creation Faster context switches Even with stack traces! Much faster mutexes

Runtime Overhead Tested Apache 2.0.44 Stack linking 78% slowdown for null call 3-4% overall Resource statistics 2% (on all the time) 0.1% (with sampling) Stack traces 8% overhead

Web Server Performance

The Future: Compiler-Runtime Integration Insight Automate things event programmers do by hand Additional analysis for other things Specific targets Live state management Synchronization Static blocking graph Improve performance and decrease complexity

Conclusions Threads > Events Capriccio simplifies concurrency Equivalent performance Reduced complexity Capriccio simplifies concurrency Scalable & high performance Control over concurrency model Stack safety Resource-aware scheduling Enables compiler support, invariants Themes User-level threads are key Compiler-runtime integration very promising Threads Capriccio Ease of Programming Events Mention stack linking and RA scheduling here? Threads Performance

Apache Blocking Graph

Microbenchmark: Buffer Cache

Microbenchmark: Disk I/O

Microbenchmark: Producer / Consumer

Microbenchmark: Pipe Test

Threads v.s. Events: The Duality Argument General assumption: follow “good practices” Observations Major concepts are analogous Program structure is similar Performance should be similar Given good implementations! Web Server Accept Conn. Read Request Threads Events Monitors Exported functions Call/return and fork/join Wait on condition variable Event handler & queue Events accepted Send message / await reply Wait for new messages Pin Cache Different example would be better Problem here is that web server structure doesn’t match L&N as well, since there is little shared data Read File Write Response Exit

Threads v.s. Events: The Duality Argument General assumption: follow “good practices” Observations Major concepts are analogous Program structure is similar Performance should be similar Given good implementations! Web Server Accept Conn. Read Request Threads Events Monitors Exported functions Call/return and fork/join Wait on condition variable Event handler & queue Events accepted Send message / await reply Wait for new messages Pin Cache Different example would be better Problem here is that web server structure doesn’t match L&N as well, since there is little shared data Read File Write Response Exit

Threads v.s. Events: The Duality Argument General assumption: follow “good practices” Observations Major concepts are analogous Program structure is similar Performance should be similar Given good implementations! Web Server Accept Conn. Read Request Threads Events Monitors Exported functions Call/return and fork/join Wait on condition variable Event handler & queue Events accepted Send message / await reply Wait for new messages Pin Cache Different example would be better Problem here is that web server structure doesn’t match L&N as well, since there is little shared data Read File Write Response Exit

Threads v.s. Events: Can Threads Outperform Events? Function pointers & dynamic dispatch Limit compiler optimizations Hurt branch prediction & I-cache locality More context switches with events? Example: Haboob does 6x more than Knot Natural result of queues More investigation needed!

Threading Criticism: Live State Management Event State (heap) Criticism: Stacks are bad for live state Response Fix with compiler help Stack overflow vs. wasted space Dynamically link stack frames Retain dead state Static lifetime analysis Plan arrangement of stack Put some data on heap Pop stack before tail calls Encourage inefficiency Warn about inefficiency Thread State (stack) Live Dead Live Unused

Threading Criticism: Control Flow Criticism: Threads have restricted control flow Response Programmers use simple patterns Call / return Parallel calls Pipelines Complicated patterns are unnatural Hard to understand Likely to cause bugs