Capriccio:Scalable Threads for Internet Services

Slides:

Advertisements

Similar presentations

R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),

Advertisements

Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley.

Chess Review May 8, 2003 Berkeley, CA Compiler Support for Multithreaded Software Jeremy ConditRob von Behren Feng ZhouEric Brewer George Necula.

1 SEDA: An Architecture for Well- Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.

1 Capriccio: Scalable Threads for Internet Services Matthew Phillips.

SEDA: An Architecture for Well- Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of.

Capriccio: Scalable Threads for Internet Services ( by Behren, Condit, Zhou, Necula, Brewer ) Presented by Alex Sherman and Sarita Bafna.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.

Capriccio: Scalable Threads for Internet Services (von Behren) Kenneth Chiu.

Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, George Necula and Eric Brewer University of California at Berkeley.

Capriccio: Scalable Threads For Internet Services Authors: Rob von Behren, Jeremy Condit, Feng Zhou, George C. Necula, Eric Brewer Presentation by: Will.

Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley.

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

Computer Organization and Architecture

A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.

CS364 CH08 Operating System Support TECH Computer Science Operating System Overview Scheduling Memory Management Pentium II and PowerPC Memory Management.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services by, Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.

By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming  To allocate scarce memory resources.

Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley.

By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.

1 Combining Events and Threads for Scalable Network Services Peng Li and Steve Zdancewic University of Pennsylvania PLDI 2007, San Diego.

Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.

5204 – Operating Systems Threads vs. Events. 2 CS 5204 – Operating Systems Forms of task management serial preemptivecooperative (yield) (interrupt)

Department of Computer Science and Software Engineering

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of.

Capriccio: Scalable Threads for Internet Service

By: Rob von Behren, Jeremy Condit and Eric Brewer 2003 Presenter: Farnoosh MoshirFatemi Jan

SEDA An architecture for Well-Conditioned, scalable Internet Services Matt Welsh, David Culler, and Eric Brewer University of California, Berkeley Symposium.

Holistic Systems Programming Qualifying Exam Presentation UC Berkeley, Computer Science Division Rob von Behren June 21, 2004.

1 Why Events Are A Bad Idea (for high-concurrency servers) By Rob von Behren, Jeremy Condit and Eric Brewer (May 2003) CS533 – Spring 2006 – DONG, QIN.

An Efficient Threading Model to Boost Server Performance Anupam Chanda.

CS533 Concepts of Operating Systems Jonathan Walpole.

Paper Review of Why Events Are A Bad Idea (for high-concurrency servers) Rob von Behren, Jeremy Condit and Eric Brewer By Anandhi Sundaram.

Introduction to Operating Systems Concepts

Introduction to threads

Non Contiguous Memory Allocation

Chapter 2 Memory and process management

SEDA: An Architecture for Scalable, Well-Conditioned Internet Services

Capriccio : Scalable Threads for Internet Services

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Why Events Are A Bad Idea (for high-concurrency servers)

Chapter 3: Process Concept

CSE 120 Principles of Operating

Processes and Threads Processes and their scheduling

William Stallings Computer Organization and Architecture

Effective Data-Race Detection for the Kernel

Presenter: Godmar Back

Capriccio – A Thread Model

CSCI1600: Embedded and Real Time Software

Chapter 9: Virtual-Memory Management

Page Replacement.

Threads Chapter 4.

Multithreaded Programming

Operating Systems Lecture 1.

Capriccio: Scalable Threads for Internet Services

Prof. Leonardo Mostarda University of Camerino

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Why Events Are a Bad Idea (for high concurrency servers)

COMP755 Advanced Operating Systems

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

CSCI1600: Embedded and Real Time Software

Dynamic Binary Translators and Instrumenters

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Capriccio:Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, George Necula and Eric Brewer Presented by Guoyang Chen

Overview Motivation Threads VS Events User-Level Threads Capriccio Implementation Linked Stack Management Resource-Aware Scheduling Evaluation

Motivation High demand for web content Web services is getting more complex, requiring maximum server performance Well-conditioned Service?

Well-conditioned Service A well-conditioned service behave like a simple pipeline As offered load increases, throughput increases proportionally When saturated, throughput does not degrade substantially Ideal Peak: some resource at max Performance Overload: some resource thrashing Load (concurrent tasks)

Threads VS Events Thread-Based Concurrency To build the programming model ,there are two kinds of methods. Threads based programming and event-based proramming. Let’s start with thread based programming. Each incoming request is dispatched to a separate thread, which processes the request and returns a result to the client. Besides, other I/O operations, such as disk access, are not shown here, but would be incorporated into each threads‘ request processing. However, Too many threads will lead to High resource usage, context switch overhead, contended. The traditional solution is to bound total number of threads. This question is intractable High resource usage, context switch overhead, contended locks Traditional solution: Bound total number of threads But, how do you determine the ideal number of threads?

Threads VS Events Event-Based Concurrency Small number of event-processing threads with many FSMs Yields efficient and scalable concurrency Many examples: Click router, Flash web server, TP Monitors, etc.

SEDA Staged Event-Driven Architecture (SEDA) Decompose service into stages separated by queues Each stage performs a subset of request processing Stages internally event-driven Each stage contains a thread pool to drive stage execution However, threads are not exposed to applications Dynamic control grows/shrinks thread pools with demand

Drawbacks of Events Events systems hide the control flow Difficult to understand and debug Eventually evolved into call-and-return event pairs Programmers need to match related events Need to save/restore states Events require manual state management Capriccio: instead of event-based model, fix the thread-based model

Thread VS Event Why Thread？ More natural programming model Control flow is more apparent Exception handling is easier State management is automatic Better fit with current tools & hardware Better existing infrastructure

Capriccio Goals Mechanisms Simplify the programming model Thread per concurrent activity Scalability (100K+ threads) Support existing APIs and tools Automate application-specific customization Mechanisms User-level threads Plumbing: avoid O(n) operations Compile-time analysis Run-time analysis

Thread Design Principles Decouple programming model and OS Kernel threads Abstract hardware Expose device concurrency User-level threads Provide clean programming model Expose logical concurrency App User Threads OS

Thread Design Principles Decouple programming model and OS Kernel threads Abstract hardware Expose device concurrency User-level threads Provide clean programming model Expose logical concurrency App Threads User OS

Thread Design and Scalability User-Level Threads Flexibility Capriccio can use the new asynchronous I/O mechanisms without changing App code. User-level thread scheduler can be built along with the application. Extremely lightweight Performance Reduce the overhead of thread synchronization Do not require kernel crossings for mutex acquisition or release More efficient memory management at user level

Drawbacks of User-Level Threads An increased number of kernel crossings A blocking I/O call, will be replaced by non-blocking mechanisms(epoll) A wrapper layer for translating blocking mechanisms to non-blocking. Difficult to use multiple processors. Synchronization is no longer “for free”

Capriccio Implementation A user-level threading library. All thread operations are O(1) Linked stacks Address the problem of stack allocation for large numbers of threads Combination of compile-time and run-time analysis Resource-aware scheduler

Context Switches Built on top of Edgar Toernig’s coroutine library Fast context switches when threads yield

I/O Capriccio intercepts blocking I/O calls Uses epoll for non-blocking I/O

Scheduling Very much like an event-driven application Events are hidden from programmers

Synchronization Supports cooperative threading on single-CPU machines Requires only Boolean checks

Threading Microbenchmarks SMP, two 2.4 GHz Xeon processors 1 GB memory two 10 K RPM SCSI Ultra II hard drives Linux 2.5.70 Compared Capriccio, LinuxThreads, and Native POSIX Threads for Linux

Latencies of Thread Primitives NPTL: Native POSIX Thread Library Capriccio LinuxThreads NPTL Thread creation 21.5 17.7 Thread context switch 0.24 0.71 0.65 Uncontended mutex lock 0.04 0.14 0.15

Thread Scalability Producers put empty messages into a shared buﬀer, and consumers “process” each message by looping for a random amount of time.

I/O Performance Network performance By passing a number of TOKEN among pipes Simulates the effect of slow client links 10% overhead compared to epoll Twice as fast as both LinuxThreads and NPTL when more than 1000 threads Disk I/O is comparable to kernel threads

I/O Performance epoll_wait()

I/O Performance Benefit from kernel’s disk head scheduling algorithm since Capriccio uses asynchronous I/O primitives

Disk I/O with Buffer Cache At low miss rate, capriccio’s throughput is 50% of NPTL. The source of the overhead is asynchronous I/O interface

Linked Stack Management LinuxThreads allocates 2MB per stack 1 GB of VM holds only 500 threads Fixed Stacks

Safety: Linked Stacks The problem: fixed stacks Overflow vs. wasted space The solution: linked stacks Allocate space as needed Compiler analysis Add runtime checkpoints Guarantee enough space until next check overflow waste Linked Stack

Linked Stacks: Algorithm Build weighted call graph Insert checkpoints What is checkpoint? Determine whether there is enough stack space left to reach the next checkpoint. 3 3 2 5 2 4 3 6

Placing Checkpoints One checkpoint in every cycle’s back edge in the call graph Bound the size between checkpoints with the deepest call path

Linked Stacks: Algorithm Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack 3 3 2 5 c1 2 4 3 6 MaxPath = 8

Linked Stacks: Algorithm Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack A 3 B 3 D C 2 5 E c1 c2 2 4 F 3 6 H G MaxPath = 8

Linked Stacks: Algorithm Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack A 3 c3 B 3 D C 2 5 E c2 c1 2 4 F 3 6 G MaxPath = 8

Linked Stacks: Algorithm Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack A 3 c3 B 3 D C 2 5 E c4 c1 c2 2 4 F 3 6 G MaxPath = 8

Linked Stacks: Algorithm Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack c5 A 3 c3 B 3 D C 2 5 E c4 c1 c2 2 4 F 3 6 G MaxPath = 8

Dealing with Special Cases Function pointers Don’t know what procedure to call at compile time Can find a potential set of procedures

Dealing with Special Cases External functions Allow programmers to annotate external library functions with trusted stack bounds Allow larger stack chunks to be linked for external functions

Tuning the Algorithm Stack space will be wasted Tradeoffs Internal and external wasted space Tradeoffs Number of stack linkings (maxpath) External fragmentation (minchunk)) Internal external

Memory Benefits No preallocation of large stacks Reduce requirement to run a large numbers of threads Better paging behavior Stacks—LIFO

Case Study: Apache 2.0.44 MaxPath: 2KB MinChunk: 4KB Apache under SPECweb99 Overall slowdown is about 3% Dynamic allocation 0.1% Link to large chunks for external functions 0.5% Stack removal 10%

Scheduling: The Blocking Graph Web Server Lessons from event systems Capriccio does this for threads Each node is a location in the program that blocked Deduce stage with stack traces at blocking points Record information about thread behavior Accept Read Open Read Close Write Close

Scheduling: The Blocking Graph Annotate average running time for each edge(cycle counter) Annotate value for how long the next edge will take on average for each node Annotate the changes in resource usage Web Server Accept Read Open Read Close Write Close

Resource-Aware Scheduling Keep track of resource usage levels and decide dynamically if each resource is at its limit. Annotate each node with the resources used on its outgoing edges so we can predict the impact on each resource should we schedule threads from that node. Dynamically prioritize nodes (and thus threads) for scheduling based on information from the ﬁrst two parts. Increase use when underutilized Decrease use near saturation Advantages Operate near thrashing Automatic admission control Web Server Accept Read Open Read Close Write Close

Track Resources Memory usage: File descriptors: By using malloc() family Resource limit for memory: By watching page fault activity File descriptors: By tracking open() and close() calls Resource limit: By estimating the number of open connections at which response time jumps up

Pitfalls Tricky to determine the maximum capacity of a resource Thrashing depends on the workload Disk can handle more requests that are sequential instead of random Resources interact VM vs. disk Applications may manage memory themselves

Yield Profiling User threads are problematic if a thread fails to yield They are easy to detect, since their running times are orders of magnitude larger Yield profiling identifies places where programs fail to yield sufficiently often

Web Server Performance 4x500 MHz Pentium server 2GB memory Intel e1000 Gigabit Ethernet card Linux 2.4.20 Workload: requests for 3.2 GB of static file data

Web Server Performance Request frequencies match those of the SPECweb99 A client connects to a server repeated and issue a series of five requests, separated by 20ms pauses Apache’s performance improved by 15% with Capriccio

Web Server Performance

Runtime Overhead Tested Apache 2.0.44 Stack linking 78% slowdown for null call 3-4% overall Resource statistics 2% (on all the time) 0.1% (with sampling) Stack traces 8% overhead

Resource-Aware Admission Control Touching pages too quickly will cause thrashing. Producer threads loop, adding memory to a global pool and randomly touching pages to force them to stay in memory Consumer threads loop, removing memory from the global pool and freeing it. Capriccio can quickly detect the overload conditions and limit the number of producers

Related Work Programming Models for High Concurrency User-Level Threads(Capriccio is unique) Blocking graph Resource-aware scheduling Target at a large number of blocking threads POSIX compliant Application-Specific Optimization Stack Management Resource-Aware Scheduling

Future Work Multi-CPU machines Improve resource-aware scheduler and stack analysis Produce proﬁling tools to help tune Capriccio’s stack parameters to the application’s needs.

Thanks! Questions?