Instant Profiling: Instrumentation Sampling for Profiling Datacenter Applications Hyoun Kyu Cho 1, Tipp Moseley 2, Richard Hank 2, Derek Bruening 2, Scott Mahlke University of Michigan 2 Google
2 Datacenter Applications In 2010, US Datacenters spent 70~90 billion kWh * Datacenter application performance is critical Profiling can help *[Koomey`11]
Challenges for Datacenters Need to run on live traffic Difficult to isolate Overheads Value profiling 3.8x slowdown 1 Path profiling 31%, edge profiling 16% 2 Binary management Many programs, multiple versions 3 Traditional Profiling Source Code Instrumented Binary Input Data Instrumentation Build Training Run Profile Data 1 [Calder`99] 2 [Ball`96]
Continuous profiling infrastructure for datacenters Negligible overhead Sampling based Aggregated profiling overhead less than 0.01% Limitations Heavily rely on Performance Monitoring Units Limited flexibility and portabiliity [Ren et al.`10] 4 Google-Wide Profiling
Unified profiling infrastructure for datacenters Flexible types of profile data Portable across heterogeneous datacenter While maintaining Low overhead Does not burden binary management 5 Goals Sampling Dynamic Binary Instrumentation
6 Instrumentation Sampling hardware operating system application system call gateway
6 Instrumentation Sampling hardware operating system application [Bruening`04] dispatch instrumentation engine client code cache DynamoRIO context switch
6 Instrumentation Sampling hardware operating system application shepherding thread start profiling dispatch instrumentation engine client code cache stop profiling
Unbounded profiling periods due to fragment linking Latency degradation due to initial instrumentation Multi-threade programs 7 Problems with Basic Implementation
code cache 8 Temporal Unlinking/Relinking of Fragments BB1 BB2 dispatch context switch BB2->BB1
9 S/W Code Cache Pre-population hardware operating system application shepherding thread dispatch instrumentation engine client code cache Still have latency degradation for intial instrumentation phases
Sampling makes it possible to miss thread operations Forces Instant Profiling’s signal handler for every thread Enumerates all threads and sends profiling start signal to each thread 10 Multithreaded Program Support
6-core Intel Xeon 2.67GHz w/ 12MB L3 12GB main memory Linux kernel gcc w/ -O3 SPEC INT2006, BigTable, Web search Edge profiling client 11 Experimental Setup
12 Naïve Edge Profiling
13 Profiling Overhead
14 S/W Code Cache Prepopulation
15 Profiling Accuracy
16 Asymptotic Accuracy
Low-overhead, portable, flexible profiling needed Instant Profiling Combines sampling and DBI Pre-populates S/W code cache Tunable tradeoff between overhead and information Provides eventual profiling accuracy Less than 5% overhead, more than 80% accuracy for naïve edge profiling client 17 Conclusion
18 Thank you!