Holistic Systems Programming Qualifying Exam Presentation UC Berkeley, Computer Science Division Rob von Behren June 21, 2004.

Holistic Systems Programming Qualifying Exam Presentation UC Berkeley, Computer Science Division Rob von Behren (jrvb@cs.berkeley.edu) June 21, 2004

Server Applications High concurrency Internet servers & frameworks Web servers, e-commerce, etc. Transaction processing databases Many independent tasks Workload High performance Unpredictable load spikes Operate “near the knee” Avoid thrashing! Ideal Peak: some resource at max Overload: some resource thrashing Load (concurrent tasks) Performance

The Price of Concurrency What makes concurrency hard? Race conditions Code complexity Scalability (no O(n) operations) Scheduling & resource sensitivity Inevitable overload Performance vs. Programmability No current system solves Need tools to manage complexity Performance Ease of Programming Thr ead s Events Id ea l

Insight: Many Standard Idioms Memory management Object pools Generic containers (lists, trees, etc.) Ownership transfer Timeouts Error handling Protocol processing Locking Component interaction APIs

Insight: Wrong System Boundaries OS Assumes adversarial applications Provides “one size fits all” service Hides performance details Implicit performance API Runtime No application knowledge Generic, thin shim b/w app and OS Language / Compiler Throws away programmer idioms Assumes infinite stacks

Application control & extensible kernels ACPM, U-net, scheduler activiations SPIN, Exokernel Problem: hard to use OS-level adaptation Simple scheduling tweaks Disk access pattern prediction Problem: too general, ignores app structure Language Race detection: Flannagan & Qadeer, NesC first-class concurrency: Java, ML Problem: ignores the OS and runtime Related Work

Solution: Holistic System Programming OS Flexible and lightweight Greater user-level control (Declarative API) Integrated runtime & compiler Control concurrency model & scheduling Improve performance Provide guarantees Manage resources Analysis, prediction and adaptation Programming Environment Simple, expressive programming interface

Capriccio: Better Threads Goals Simplify the programming model Thread per concurrent activity Scalability (100K+ threads) Support existing APIs and tools Automate application-specific customization Mechanisms User-level threads Plumbing: avoid O(n) operations Compile-time analysis Run-time analysis

Capriccio Internals Cooperative user-level threads Fast context switches Lightweight synchronization Kernel Mechanisms Asynchronous I/O (Linux) Efficiency Avoid O(n) operations Fast, flexible scheduling Scales to over 100,000 threads

Thread Performance 0.150.140.04 Uncontested mutex lock 0.650.710.240.56Context Switch 17.737.521.5 Thread Creation NPTLLinuxThreadsCapriccio-notraceCapriccio Slightly slower thread creation Faster context switches Even with stack traces! Much faster mutexes Time of thread operations (microseconds)

The Blocking Graph Lessons from event systems Break app into stages Schedule based on stage priorities Allows SRCT scheduling, finding bottlenecks, etc. Capriccio does this for threads Deduce stage with stack traces at blocking points Prioritize based on runtime information Acce pt WriteRead Open Web Server Close

Resource-Aware Scheduling Track resources along BG edges Memory, file descriptors, CPU Predict future from the past Algorithm Increase use when underutilized Decrease use near saturation Advantages Operate near the knee w/o thrashing Automatic admission control Acce pt WriteRead Open Web Server Close

Apache Blocking Graph

Runtime Overhead Tested Apache 2.0.44 Stack linking 78% slowdown for null call 3-4% overall Resource statistics 2% (on all the time) 0.1% (with sampling) Stack traces 8% overhead

Web Server Performance Simple web server: Knot 700 lines of C code A few days to write Network-bound test SPEC web, 100MB workload KnotA: favor accept() KnotC: favor connections Results Low overhead  high throughput Overload in Haboob and KnotA Due to poll() overhead Scheduling prevents overload in KnotC 0 100 200 300 400 500 600 700 800 900 1 4 16 64 256 1024 4096 16384 KnotC (Favor Connections) KnotA (Favor Accept) Haboob Concurrent Clients Bandwidth (Mb/s)

Web Server Performance Disk-bound test SPEC Web, 3 GB workload Apache w/ and w/o Capriccio Results Knot and Haboob were similar Roughly 15% improvement for Apache w/ Capriccio No impact from scheduling Need further investigation

Timeline: Capriccio Core Goals Multiprocessor support Fast thread-local memory allocator Better resource tracking Additional scheduling options Status Multiprocessor support (mostly) working Recently finished rewrite should make most other features easier to add Deadline August 2004

Timeline: Kernel Work Goals Unified interface for disk & network I/O Support for simple atomic sections Status Initial design of kernel API complete Basic support for many things exists in Linux 2.6 Prototype of some kernel features done (Feng Zhou) Deadline December 2004

Timeline: Compiler Tools Goals Linked stacks Improve/integrate Jeremy Condit's work on this Stack fingerprints Limited thread-local state analysis Status Use Cil for program transformations and analysis framework Others working on related things Jeremy Condit & Bill McCloskey Deadline March 2005

Conclusion Holistic approach to systems programming Examine entire stack Powerful, adaptive runtime Exploit program structure Promising initial progress Capriccio threading system (+SOSP paper) Blocking graph Resource-aware scheduling Future work Kernel interface Simple compile-time tools

Microbenchmark: Buffer Cache

Microbenchmark: Disk I/O

Microbenchmark: Producer / Consumer

Microbenchmark: Pipe Test

The Case for User-Level Threads Decouple programming model and OS Kernel threads Abstract hardware Expose device concurrency User-level threads Provide clean programming model Expose logical concurrency Benefits of user-level threads Control over concurrency model! Independent innovation Enables static analysis Enables application-specific tuning Threads App OS User

The Case for User-Level Threads Threads OS User App Decouple programming model and OS Kernel threads Abstract hardware Expose device concurrency User-level threads Provide clean programming model Expose logical concurrency Benefits of user-level threads Control over concurrency model! Independent innovation Enables static analysis Enables application-specific tuning

Safety: Linked Stacks The problem: fixed stacks Overflow vs. wasted space Limits thread numbers The solution: linked stacks Allocate space as needed Compiler analysis Add runtime checkpoints Guarantee enough space until next check Fixed Stacks Linked Stack waste overflow

Linked Stacks: Algorithm 5 4 2 6 3 3 23 Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack MaxPath = 8

Linked Stacks: Algorithm 5 4 2 6 3 3 23 MaxPath = 8 Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack

Linked Stacks: Algorithm 54263 3 23 MaxPath = 8 Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack

Holistic Systems Programming Qualifying Exam Presentation UC Berkeley, Computer Science Division Rob von Behren June 21, 2004.

Similar presentations

Presentation on theme: "Holistic Systems Programming Qualifying Exam Presentation UC Berkeley, Computer Science Division Rob von Behren June 21, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Holistic Systems Programming Qualifying Exam Presentation UC Berkeley, Computer Science Division Rob von Behren June 21, 2004.

Similar presentations

Presentation on theme: "Holistic Systems Programming Qualifying Exam Presentation UC Berkeley, Computer Science Division Rob von Behren June 21, 2004."— Presentation transcript:

Similar presentations

About project

Feedback