Capriccio: Scalable Threads For Internet Services Authors: Rob von Behren, Jeremy Condit, Feng Zhou, George C. Necula, Eric Brewer Presentation by: Will.

Slides:

Advertisements

Similar presentations

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Advertisements

Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley.

Chess Review May 8, 2003 Berkeley, CA Compiler Support for Multithreaded Software Jeremy ConditRob von Behren Feng ZhouEric Brewer George Necula.

Flash: An efficient and portable Web server Authors: Vivek S. Pai, Peter Druschel, Willy Zwaenepoel Presented at the Usenix Technical Conference, June.

1 SEDA: An Architecture for Well- Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.

1 Capriccio: Scalable Threads for Internet Services Matthew Phillips.

SEDA: An Architecture for Well- Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.

1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.

Why Events Are A Bad Idea (for high-concurrency servers) By Rob von Behren, Jeremy Condit and Eric Brewer.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of.

Capriccio: Scalable Threads for Internet Services ( by Behren, Condit, Zhou, Necula, Brewer ) Presented by Alex Sherman and Sarita Bafna.

Capriccio: Scalable Threads for Internet Services (von Behren) Kenneth Chiu.

Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, George Necula and Eric Brewer University of California at Berkeley.

Threads vs. Processes April 7, 2000 Instructor: Gary Kimura Slides courtesy of Hank Levy.

Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley.

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services by, Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Threads, Thread management & Resource Management.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Presented by Changdae Kim and Jaeung Han OFFENCE.

Scheduling Basic scheduling policies, for OS schedulers (threads, tasks, processes) or thread library schedulers Review of Context Switching overheads.

Scheduler Activations: Effective Kernel Support for the User- Level Management of Parallelism. Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska,

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley.

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.

1 Combining Events and Threads for Scalable Network Services Peng Li and Steve Zdancewic University of Pennsylvania PLDI 2007, San Diego.

Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.

5204 – Operating Systems Threads vs. Events. 2 CS 5204 – Operating Systems Forms of task management serial preemptivecooperative (yield) (interrupt)

Department of Computer Science and Software Engineering

M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.

Full and Para Virtualization

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of.

Capriccio: Scalable Threads for Internet Service

By: Rob von Behren, Jeremy Condit and Eric Brewer 2003 Presenter: Farnoosh MoshirFatemi Jan

SEDA An architecture for Well-Conditioned, scalable Internet Services Matt Welsh, David Culler, and Eric Brewer University of California, Berkeley Symposium.

Holistic Systems Programming Qualifying Exam Presentation UC Berkeley, Computer Science Division Rob von Behren June 21, 2004.

1 Why Events Are A Bad Idea (for high-concurrency servers) By Rob von Behren, Jeremy Condit and Eric Brewer (May 2003) CS533 – Spring 2006 – DONG, QIN.

Threads. Readings r Silberschatz et al : Chapter 4.

An Efficient Threading Model to Boost Server Performance Anupam Chanda.

CS533 Concepts of Operating Systems Jonathan Walpole.

for Event Driven Servers

Paper Review of Why Events Are A Bad Idea (for high-concurrency servers) Rob von Behren, Jeremy Condit and Eric Brewer By Anandhi Sundaram.

Operating Systems Lecture 9 Introduction to Paging Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Capriccio:Scalable Threads for Internet Services

SEDA: An Architecture for Scalable, Well-Conditioned Internet Services

Capriccio : Scalable Threads for Internet Services

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Why Events Are A Bad Idea (for high-concurrency servers)

Processes and Threads Processes and their scheduling

Chapter 9 – Real Memory Organization and Management

Chapter 4 Threads.

Presenter: Godmar Back

Capriccio – A Thread Model

Chapter 4: Threads.

Department of Computer Science University of California, Santa Barbara

Page Replacement.

Multithreaded Programming

Capriccio: Scalable Threads for Internet Services

Threads vs. Processes Hank Levy 1.

Why Events Are a Bad Idea (for high concurrency servers)

Department of Computer Science University of California, Santa Barbara

CS703 - Advanced Operating Systems

CS 5204 Operating Systems Lecture 5

Presentation transcript:

Capriccio: Scalable Threads For Internet Services Authors: Rob von Behren, Jeremy Condit, Feng Zhou, George C. Necula, Eric Brewer Presentation by: Will Hrudey

Introduction Capriccio: “a spritely improvisational musical dance involving multiple voices” Introduces a fast, scalable user-level thread package for thread management and synchronization

Motivation Internet Servers And Databases – Have ever-increasing scalability needs – Need to handle hundreds of thousands of simultaneous connections without significant degradation – Need for a programming model to achieve efficient, robust servers with ease

Approach Utilizes user level threads to provide a natural abstraction for high concurrency programming – Prior work discussed threads versus events Decouples thread package from OS to take advantage of: – Cooperative threading – New asynchronous I/O interfaces – Compiler support Provides 3 key features: – Scalability – Linked stacks – Resource aware scheduling

Goals To allow high performance without high complexity Support for existing thread API’s (POSIX) Scalability to 100,000’s threads Flexibility to address application-specific needs Little or no modification of application itself

User Level Threads Provide performance & flexibility advantages Provide a clean programming model with useful invariants and semantics Decouples thread package from OS – Hides both OS variation & kernel evolution – Integrate compiler support Can complicate preemption Can interact badly with kernel scheduler

User Level Threads Flexibility – Take advantage of new asynchronous I/O mechanisms – Tailored scheduling – Lightweight (scale to 100,000 threads) Performance – Reduced synchronization overhead on uniprocessors – More efficient memory management Disadvantages – Blocking I/O – Wrapper layer to translate blocking to non-blocking I/O – Lightweight synchronization diminished on multiprocessors

User Level Threads Implementation (user level library for Linux) – Context switches coroutine library – I/O: intercepts blocking I/O calls epoll() for pollable file descriptors and Linux AIO – Scheduling: Main loop looks like an event-driven application Run threads and checks for I/O completions – Synchronization Cooperative scheduling to improve synchronization – Efficiency Thread management functions have bounded worst case running times

User Level Threads Microbenchmark: – Testbed: 2x2.4GHz Xeon / 1GB / 2x10K RPM SCSI Ultra II HD / 3xGigabit Ethernet / Linux – Thread packages: Capriccio, LinuxThreads, NPTL

Efficient Stack Management Optimizes stack allocation for many threads – Reduces size of VM dedicated to stacks Small non-contiguous stack chunks – Grow and shrink at run time Compiler analysis and runtime checks – Generates a weight directed call graph

Efficient Stack Management Nodes are functions weighted by max stack size Edges indicate function calls between nodes Path is a sequence of stack frames Checkpoints are code inserted at call sites Weighted Call graph

Efficient Stack Management Places a reasonable bound on the amount stack space consumed by each thread Checkpoints determine if enough space left to reach next checkpoint without overflow – If not, new stack chunk allocated & SP adjusted Checkpoint placement – Break cycles – Scan nodes to ensure path within desired bound

Efficient Stack Management Special cases – Function pointers complicate analysis – External function calls Tuning to optimize memory usage – MaxPath – MinChunk Linked stacks can improve paging behavior Apache SPECweb99 results: 3-4% slowdown overall

Resource Aware Scheduling Thread scheduling and admission control adapt to resource usage Application viewed as sequence of stages separated by blocking points Dynamic scheduling decisions are finer grained Blocking graphs generated at runtime – Learn behavior dynamically to improve scheduling – Determine impact on resource utilization if schedule thread

Resource Aware Scheduling Nodes are program locations where threads block Edges reflect consecutive blocking points Edges annotated with weighted averages reflecting resource usage Nodes annotated with weighted outer edge values Threads walk this graph independently Blocking Graph

Resource Aware Scheduling Promote nodes that release resources and demote nodes that acquire resources Dynamically prioritize nodes (threads) for scheduling Responds to changes in resource consumption due to type of work and offered load Implement using separate run queues for each node

Resource Aware Scheduling Usage – Drive each resource to max capacity, throttle back, coupled with hysteresis, keeps system at full throttle Challenges – Determination of max capacity of resources is tricky – Interaction between resources – Thrashing can be difficult to detect – Application specific resources – memory mgmt

Performance Evaluate real-world web server workload Testbed – 4x500 MHz Pentium / 2GB / Gigabit Ethernet – Linux – Kernel version doesn’t support epoll or AIO (used poll) – Client load up to 16 similar configurations – 3.2GB static file data with various file sizes – Clients repeatedly connect, issue 5 requests waiting 20ms apart – Limited cache sizes: Haboob / Knot to 200MB to force disk activity – Request frequencies for each size and file based on SPECweb99

Performance 15% increase with Apache Knot comparable to event-based Haboob

Performance Overhead involved in maintaining information about resources at each node – Gathering and maintaining statistics: <2% for edges in Apache Statistics remained fairly steady in tested workloads Ratio of 1/20 reduces aggregate overhead to 0.1% – Stack trace overhead significant (8% - Apache / 36% - Knot) Could be reduced with compiler integration

Future Work Incorporate multiprocessor support Reduce kernel crossings under heavy load with a batching interface for async I/O Improve thrashing detection Improve stack analysis – function pointers (CCured) Develop profiler tools to optimize tuning parameters Generate blocking graph at compile time Implement blocking point fairness strategies

Conclusion Thread package was “fixed” to support scalable, high concurrency Internet servers Threading model is more useful for high concurrency programming User level thread package is decoupled from OS – Can benefit from new I/O mechanisms and compiler support Linked stacks and scheduler delivered significant improvements in scalability and performance compared with existing systems

Observations External function call stack size doesn’t scale Offloads responsibility to compiler support “compiler technology will play an important role in the evolution of the techniques described in this paper” Performance test – Data not qualified: how many runs? Are results repeatable? – Kernel didn’t have same non-blocking call support so comparison is difficult; are the results still meaningful? Stated goal of achieving 100,000’s of threats not explicitly evident

Discussion 1. It seems as though using a graph to dynamically adjust the stack size (vs a default large stack size) is a smart thing to do, especially if memory is a problem. I'm trying to figure out if this is a new era of more intelligent thread packages, or if this is an overly complex solution which has been avoided. So what is the expense (in terms of computation) of this intelligent stack management? Is it necessary for this application to succeed?

Discussion 2. Capriccio can scale to 100,000 threads, what about more than 100,000 thread? Will the system just crash? Is there no mechanism in place if that happens? 3. I was wondering whether the dynamic stack chunks are mapped contiguously in the virtual memory of the thread? If this was the case, how could they achieve adding a chunk of memory to the stack as small as half a page?

Discussion 4. In the experimental section there is no mention of how many tests were performed, and from the looks of it, there was just one---since otherwise vanilla-apache seems to dip and then improve in bandwidth as more clients connect. Also Knot seems to have approximately the same performance as Haboob, so I'm wondering how conclusive these tests really are?

Discussion 5. The authors continually refer to their program’s ‘event-driven behavior’ (page 3,8, 11). In this way, it is a similar implementation to SEDA (in that both event and thread behaviors are exhibited). What is the implied advantage of fixing threads to behave like events over fixing events to behave like (or use) threads?

Discussion 6. What the authors seem to be doing with the scheduling of the system is wrap an event-based behavior (for I/O) into a thread-based abstraction. Is this extra layer of abstraction really needed? How much does the extra layer of abstraction affect the performance of the system in general? Also, why is it that people don't accept the fact that events are better for this type of task and just use them as they are, as opposed to dressing them up in thread costumes?

Discussion 7. One assumption that the authors make is that resource usage is likely to be similar for many tasks at a blocking point. They say that this assumption *seems* to hold in practice. This is of course not too convincing. Is this actually a good assumption to make? Are there any systems where this does not hold, and what would be the consequences on this piece of work?

Discussion 8. Authors commented that the resource-aware scheduling is completely adaptive, but also confess that the system suffers from several parameter tuning problem like knowing maximum capacity of each resource, adjusting speed of adaptation (no reason why they use exponentially weighted averages). Finding optimal parameters can be another huge work to do which could be too hard to be tuned by hand. Isn't it making things more complicated or uncontrollable?

Discussion 9. One of the key features that is incorporated into Capriccio is a new method of stack management, linked stack management, whose goal is to improve performance by reducing the amount of wasted stack space, typical with other types of stack management. Their approach is contingent on compiler support. Is it realistic to expect to see the development of a compiler for this purpose?

Discussion 10. In the case study, the authors choose MaxPath and MinChunk, the two tuning parameters available with their linked stack management algorithm, based on profiling information. Is it reasonable to expect the programmer to supply this information? How sensitive is the algorithm to these parameters?

Discussion 11. Would it be possible to use something like NPTL under low-load, since it performs better than Capriccio, then switch to Capriccio under higher loads when it begins to outperform NPTL? This would give the best of both and constantly maintain good performance.

Discussion 12. In Section 3.1, the authors used whole-program analysis to determine the maximum amount of stack space that a single stack frame for that a function will consume. How about dynamic memory allocation? If the codes allocate various size of memory during run-time, how could the program estimate the maximum stack size (or they just give a rough estimation?)?