Static and Dynamic Program Analysis: Synergies and Applications

Static and Dynamic Program Analysis: Synergies and Applications
Mayur Naik Intel Labs, Berkeley CS 243, Stanford University March 9, 2011

Today’s Computing Platforms
Trends: Traits: parallel cloud mobile numerous diverse distributed I’ll begin with the context for my research. There are three major trends today in computing, namely parallel computing, by which I mean multi-core CPUs and GPUs that drive today’s desktops and laptops, cloud computing, which is Internet-based computing in which end-users do not have to know the physical location or configuration of the hardware that is delivering them software services, and mobile computing, which is computing on mobile devices such as smartphones. These three kinds of computing platforms have unique traits that distinguish them from traditional computing platforms. I’ve listed some dominant traits here: in the case of parallel computing, there are numerous cores on the same CPU, in the case of cloud computing, there are hardware devices which are diverse or heterogeneous, and geographically distributed. These traits pose unprecedented challenges. I’m going to next talk about one software challenge in each of these three kinds of computing platforms. Unprecedented software engineering challenges in reliability, productivity, scalability, energy-efficiency

A Challenge in Mobile Computing
Rich apps are hindered by resource-constrained mobile devices (battery, CPU, memory, ...) How can we seamlessly partition mobile apps and offload compute- intensive parts to the cloud? Smartphones such as the iphone and android phones have become ubiquitous today, and so have the software applications that run on them. For instance, Apple’s iphone App store hit its 10 billionth download last month. But rich software applications, for instance compute intensive ones, are hindered by resource constraints on these smartphones, such as battery power, CPU speed, and memory size. The challenge, then, is: how can we enable programmers develop rich mobile apps, and seamlessly partition and offload their compute-intensive parts to the cloud?

A Challenge in Cloud Computing
service level agreements energy efficiency data locality scheduling Speaking of the cloud, cloud computing has become viable today thanks to the growth of the Internet, and it poses new software engineering challenges. For example, a cloud provider might be expected to meet service level agreements for jobs arriving in the cloud, and there are also various resource management questions in cloud computing, such as how to be energy-efficient, how to do scheduling in a manner that improves for instance throughput, and how to exploit data locality in tasks accessing geographically distributed data. One way to solving these problems involves automatically predicting performance metrics of programs, for instance, if we knew how much time an incoming job is expected to take, we could prioritize it so that it meets a desired SLA. Likewise, If we knew what data an incoming job in the cloud is expected to touch, we could schedule it in a way that takes advantage of data locality. How can we automatically predict performance metrics of programs?

A Challenge in Parallel Computing
“ Most Java programs are so rife with concurrency bugs that they work only by accident. – Brian Goetz Java Concurrency in Practice ” And finally, here’s a software engineering challenge in parallel computing. Pre 2004, CPUs got faster and faster every year, and sequential software ran faster and faster without changing. But since 2004, CPUs stopped getting faster due to various physical limits, and instead started accumulating more and more cores, each of which wasn’t getting any faster. What this shift in hardware meant for software was that unlike pre-2004, sequential software wouldn’t be able to exploit the performance benefits of these multi-core CPUs with each passing year unless it was rewritten to be concurrent, that is, it was written in a way that could run different sub-tasks on different cores. But concurrent programming is significantly harder than sequential programming, for instance, here is a quote from a popular concurrent programming book … The software engineering challenge in parallel computing, then, is how can we automatically make these programs more reliable? How can we automatically make concurrent programs more reliable?

Terminology Program Analysis Dynamic Analysis Static Analysis
Discovering facts about programs Dynamic Analysis Program analysis using program executions Static Analysis Program analysis without running programs I’ll introduce some terminology before I proceed. Program analysis is a body of work that concerns discovering facts about programs. One can discover these facts either by running the program, in which case it is called dynamic analysis. Or one can discover these facts without running the program, in which case it is called static analysis. Static analysis typically involves inspecting the program’s source code or executable code.

This Talk Techniques Challenges
Synergistically combine diverse techniques to solve modern software engineering challenges Techniques Challenges static analysis dynamic analysis program scalability program reliability This talk is about combining diverse techniques such as static and dynamic program analysis, which I just defined, as well as machine learning, in novel ways to solve modern software engineering challenges such as performance, reliability, and scalability. machine learning program performance

Our Result: Mobile Computing
Seamless Program Partitioning: Upto 20X decrease in energy used on phone Techniques Challenges static analysis dynamic analysis program scalability program reliability I’ll first give you a preview of our main results. For mobile computing, the challenge is scaling rich apps on resource-constrained smart phones, and I will show how we have combined static and dynamic program analysis to seamlessly partition such apps and offload compute-intensive parts to the cloud, resulting in upto 20X decrease in energy used on the phone. machine learning program performance

Our Result: Cloud Computing
Automatic Performance Prediction: Prediction error < 7% at < 6% program runtime cost Techniques Challenges static analysis dynamic analysis program scalability program reliability In cloud computing, the challenge is estimating the performance of jobs in the cloud, and I will show how we have combined static analysis, dynamic analysis, and machine learning to automatically predict the performance of general-purpose programs machine learning program performance

Our Result: Parallel Computing
Scalable Program Verification: 400 concurrency bugs in 1.5 MLOC Java programs Techniques Challenges static analysis dynamic analysis program scalability program reliability In parallel computing, the challenge is improving the reliability of concurrent programs, and I will show how we have combined static and dynamic program analysis to expose about 400 concurrency bugs in 1.5 MLOC of widely used concurrent Java programs, many of which were fixed by the developers of those programs within a week of reporting machine learning program performance

Seamless Program Partitioning Automatic Performance Prediction
Talk Outline Overview Seamless Program Partitioning Automatic Performance Prediction Scalable Program Verification Future Directions That was an overview of my talk. I’m next going to talk about how we addressed each of the three software engineering challenges I mentioned in mobile, cloud, and parallel computing, and finally outline some future directions. I wil also mention here that the technical depth of the program analyses employed is going to increase progressively.

Program Partitioning: CloneCloud [EuroSys’11]
Offline Optimization using ILP static analysis yields constraints rich app …. …… offload ………… …….. ……….. resume ….. …. … ….. offload dictates correct solutions uses program’s call graph to avoid nested migration ………… …….. ……..... resume dynamic analysis yields objective function I’m now going to talk about a technique and system we have developed, called clonecloud, for seamless program partitioning that enables rich apps on resource-constrained mobile devices. Suppose you have such a device, an iphone here, on which you want to run a rich app, in this case a face detection algorithm on images stored on the iphone. Since this algorithm is compute-intensive, you might want to offload the computation along with the data, in this case, the images stored on the iphone, to a more powerful machine in the cloud, and get back the results. One way to do this would be to provide offload and resume instructions, and automatically introduce these instructions at the entry and exit of functions that we wish to migrate to the cloud. When the app encouters the offload instruction, it would migrate the state … and when it encounters the resume instruction, it would reintegrate the state and resume on the mobile device. The question then becomes, how do we automatically determine the functions at whose boundaries we must introduce offload/resume instructions. Offloading and resuming itself has overhead of data transfer, so we must avoid doing it unnecessarily. Our solution is to formulate this as an optimization problem, in particular as an Integer Linear Programming problem whose goal is to determine the optimal way to do this partitioning. We obtain the constraints of the ILP using static analysis, and we obtain the objective function using dynamic analysis. The constraints dictate what partitionings are correct, while the objective function dictates which partitionings are optimal. The static analysis involves computing what is called a call graph, which summarizes all possible control-flow paths that might be executed. One of our constraints uses this call graph to ensure that there must be no nested offload/resume along any path. Another constraint states that native functions such as those accessing a camera or other sensor on the phone must run on the phone. The dynamic analysis, on the other hand, uses program profiles obtained from both the phone and the cloud, to express the cost of feasible partitionings, so that the ILP can pick the one with the least cost. We have experimented with two cost models, one minimizing the total execution time and the other minimizing the energy consumption on the phone. The entire ILP solving process occurs offline. How do we automatically find which function(s) to migrate? dictates optimal solutions uses program’s profiles to minimize time or energy

CloneCloud on Face Detection App
phone = HTC G1 running Dalvik VM on Android OS cloud = desktop running Android x86 VM on Linux 1 image 100 images energy used on phone (J) We experimented with 3 rich apps, one of which is the face detection app I talked about. For this app, we did two experiments, one using just 1 image and another using 100 images for face detection. In the first experiment, the ILP chose not to partition the app, whereas in the second experiment, the ILP chose to partition it, resulting in a 20X decrease in the case in which a WiFi network was used to transfer the 100 images to a desktop, and a 14X decrease in the case in which a 3G network was used. Note that the energy consumed in this case includes the energy consumed due to transferring 100 images and getting back the results. We have similar results for two other apps, and for total running time instead of energy used on phone. In summary, I have showed u how we can combine sa and da in simple but powerful way to achieve dramatic scalability for rich apps on rcmd Upto 20X decrease in energy used on phone Similar for total running time, and other apps

Talk Outline Overview Seamless Program Partitioning Automatic Performance Prediction Scalable Program Verification Future Directions

From Offline to Online Program Partitioning
CloneCloud uses same partitioning regardless of input But different partitionings optimal for different inputs 1 image: phone-only optimal 100 images: (phone + cloud) optimal Challenge: automatically predict running time of a function on an input can be used to decide online whether or not to partition but also has many other applications To show you how widely applicable this problem is, I’ll motivate it using CloneCloud itself. … applications in many diverse domains, and particularly in cloud computing, for example to solve challenges such as scheduling, data locality, and meeting service level agreements.

The Problem: Predicting Program Performance
Inputs: Program P Input I class Bldg { List events, floors; main() { Bldg b = new Bldg(); for (…) BP v = b.events.get(i); int t = v.time; } Bldg() { floors = new List(); for (…) floors.add(new Floor()); for (…) Elev e = new Elev(floors); e.start(); events = new List(); for (…) events.add(new BP(…)); } } java Elev /foo/input.txt 9 2 1 0 6 2 1 5 5 4 8 7 7 0 Output: Estimated running time of P(I) Goals: Accurate Efficient General-purpose Automatic

Our Solution: Mantis [NIPS’10]
program P input I Offline Online performance model estimated running time of P(I) training inputs I1, …, IN

Offline Stage of Mantis
class Bldg { List events, floors; main() { Bldg b = new Bldg(); for (…) BP v = b.events.get(i); int t = v.time; } Bldg() { floors = new List(); for (…) floors.add(new Floor()); for (…) Elev e = new Elev(floors); e.start(); events = new List(); for (…) events.add(new BP(…)); } } Instrument program with broad classes of features Collect feature values and time on training data Express time as function of few features using sparse regression Obtain evaluators for features using static slicing analysis f4 += t; f1++; f2++; A hallmark of any program performance prediction or modeling approach is what features to use to model performance. Since we want to be both automatic and general-purpose, we cannot use expert-knowledge or domain-specific knowledge. So we cast a wide net … * Mention that here even though just f4 is chosen, in practice multiple features might be chosen, and we can also have cross terms, for example modeling the number of times a doubly nested loop runs (give matrix multiply). * Mention why we use sparse regression: online we will need to evaluate the chosen features, and also we do not want to overfit to the training data f3++; f1 ... fM R I1 IN R = f f42

Static Slicing Analysis
static-slice(v) = { all actions that may affect value of v } Computed using data and control dependencies

From Dependencies to Slices
class Bldg { List events, floors; main() { Bldg b = new Bldg(); for (…) BP v = b.events.get(i); int t = v.time; } Bldg() { floors = new List(); for (…) floors.add(new Floor()); for (…) Elev e = new Elev(floors); e.start(); events = new List(); for (…) events.add(new BP(…)); } } f4 += t; static-slice(f4) =

class Bldg { List events, floors; main() { Bldg b = new Bldg(); for (…) BP v = b.events.get(i); int t = v.time; } Bldg() { floors = new List(); for (…) floors.add(new Floor()); for (…) Elev e = new Elev(floors); e.start(); events = new List(); for (…) events.add(new BP(…)); } } Instrument program with broad classes of features Collect feature values and time on training data Express time as function of few features using sparse regression Obtain evaluators for features using static slicing analysis If feature is costly, discard it and repeat process f4 += t; f1++; f2++; mention that with each iteration, the accuracy of performance prediction is going to decrease progressively, since the regression algorithm will have fewer and fewer features strongly correlated with running time to pick in the performance model. but recall that our goal is not just accurate performance modeling, it is also efficient performance prediction, and so evaluators for those features have to be cheap. Mantis automatically strikes this tradeoff between accuracy and cost of performance prediction. f3++; f1 ... fN R I1 IM R = f f42

Instrument program with broad classes of features Collect feature values and time on training data Express time as function of few features using sparse regression Obtain evaluators for features using static slicing analysis If feature is costly, discard it and repeat process dynamic analysis training data profile data rejected features machine learning static analysis chosen features

Mantis on Apache Lucene
Popular open-source indexing and search engine Datasets used: Shakespeare and King James Bible 1000 inputs, 100 for training Feature counts: instrumented = 6,900 considered = 410 (6%) chosen = 2 Similar results for other apps prediction error = 4.8% running time ratio = 28X (program/slices)

Static Analysis of Concurrent Programs
Data or control flow of one thread can be affected by actions of other threads! f4 += t; class Bldg { List events, floors; main() { Bldg b = new Bldg(); for (…) BP v = b.events.get(i); int t = v.time; } Bldg() { floors = new List(); for (…) floors.add(new Floor()); for (…) Elev e = new Elev(floors); e.start(); events = new List(); for (…) events.add(new BP(…)); } } Either prove these actions thread-local ... … or include these actions in the slice as well

The Thread-Escape Problem
thread-local(p,v): Is v reachable from single thread at p on all inputs? class Bldg { List events, floors; main() { Bldg b = new Bldg(); for (…) BP v = b.events.get(i); int t = v.time; } Bldg() { floors = new List(); for (…) floors.add(new Floor()); for (…) Elev e = new Elev(floors); e.start(); events = new List(); for (…) events.add(new BP(…)); } } v List Elev Bldg BP events floors Object[] elems Floor 1 p: a program state at p = local = shared

The Thread-Escape Problem
thread-local(p,v): Is v reachable from single thread at p on all inputs? Needs to reason about all inputs  Use static analysis this.elems query: can be proven by flow-insens as well, serves as placeholder for a library query, and can be used to evoke need for k-cfa etc. bp.time query: farthest away from data structure creation this.ctrl: cannot be proven by flow-insens, needs recency

The Need for Program Abstractions
All static analyses need abstraction represent sets of concrete entities as abstract entities Why? Cannot reason directly about infinite concrete entities For scalability Our static analysis: How are pointer locations abstracted? How is control flow abstracted? Mention why our thresc analysis needs to reason about pointer values: becos thresc is about reachability in linked data structures, and why it needs to reason about control flow: because the same object can change state from thread-local to thread-shared at different program points

Example: Trivial Pointer Abstraction
class Bldg { List events, floors; main() { Bldg b = new Bldg(); for (…) BP v = b.events.get(i); int t = v.time; } Bldg() { floors = new List(); for (…) floors.add(new Floor()); for (…) Elev e = new Elev(floors); e.start(); events = new List(); for (…) events.add(new BP(…)); } } thread-local(p, v)? Bldg Elev events floors p: floors List List floors elems elems Elev Object[] Object[] 1 1 BP BP Floor Floor v class List { List() { this.elems = new Object[…]; } }

Example: Allocation Sites Pointer Abstraction
class Bldg { List events, floors; main() { Bldg b = new Bldg(); for (…) BP v = b.events.get(i); int t = v.time; } Bldg() { floors = new List(); for (…) floors.add(new Floor()); for (…) Elev e = new Elev(floors); e.start(); events = new List(); for (…) events.add(new BP(…)); } } thread-local(p, v)? Bldg Elev events floors p: floors List List floors elems elems Elev Object[] Object[] 1 1 BP BP Floor Floor v class List { List() { this.elems = new Object[…]; } }

Example: k-CFA Pointer Abstraction
class Bldg { List events, floors; main() { Bldg b = new Bldg(); for (…) BP v = b.events.get(i); int t = v.time; } Bldg() { floors = new List(); for (…) floors.add(new Floor()); for (…) Elev e = new Elev(floors); e.start(); events = new List(); for (…) events.add(new BP(…)); } } thread-local(p, v)? v List Elev Bldg BP events floors Object[] elems Floor 1 p: class List { List() { this.elems = new Object[…]; } }

Complexity of Static Analysis
pointer abstraction max abstract values (N) trivial 1 allocation sites H k-CFA H . I^k control-flow abstraction max abstract states flow and context insensitive 1 flow sensitive context insensitive L flow and context sensitive L . 2^(N2 . F) precise scalable H = allocation sites, I = call sites L = program points, F = fields Challenge: an abstraction that is both precise and scalable Our Static Analysis: 2-partition 2 flow and context sensitive Q . L . 4^F Q = queries

Drawback of Existing Static Analyses
Different queries require different parts of the program to be abstracted precisely But existing analyses use the same abstraction to prove all queries simultaneously ⇒ existing analyses sacrifice precision and/or scalability P static analysis P ⊢ Q 1? Q 1 abstraction A P ⊢ Q 2? Q 2

Insight 1: Client-Driven Static Analysis
Query-driven: allows using separate abstractions for proving different queries Parametrized: parameter dictates how much precision to use for each program part for a given query Q 1 Q 2 static analysis static analysis abstraction A 1 abstraction A 2 P P ⊢ Q 1? P ⊢ Q 2?

Example: Client-Driven Static Analysis
class Bldg { List events, floors; main() { Bldg b = new Bldg(); for (…) BP v = b.events.get(i); int t = v.time; } Bldg() { floors = new List(); for (…) floors.add(new Floor()); for (…) Elev e = new Elev(floors); e.start(); events = new List(); for (…) events.add(new BP(…)); } } thread-local(p, v)? h1: Bldg Elev events floors p: floors List List h2: floors elems elems Elev h3: Object[] Object[] h4: 1 1 h5: we have 2 insights which achieve this. first is what is called client-driven static analysis. analysis takes not just program and query but also a parameter, in our case, it is a boolean vector with 1 bit per allocation site in the program each bit will tell whether to treat locations allocated at that site as relevant or irrelevant to proving the query at hand. and we would like to set as few sites as possible relevant because scalability of the analysis depends on it consider this particular value for the parameter: it states that locations allocated at sites h1, h5, h6, and h7 must be treated as relevant, and everything else as irrelevant. even though it looks here like the majority of sites are relevant, you can imagine that in practice, there will be hundreds of other sites creating locations that are irrelevant. this parameter value induces the following partitioning because the 5 locations here are created at one of sites h1, h5, h6, and h7, and all the rest are created at sites h2, h3, h4. and thus we are able to prove this query now, this parameter value will likely not be able to prove a different query arising from another part of the program, because that query will perhaps require certain other sites to be treated as relevant but that is fine, our analysis for different queries can be run in parallel with different parameter values. the key difference between our analysis and existing analyses is that those analyses try to prove all queries simultaneously using the same abstraction, requiring them to create a number of partitions linear in program size, whereas we tailor the abstraction to each query, enabling us to have just 2 partitions per query BP BP Floor Floor h6: v h1 h2 h3 h4 h5 h7 h6 class List { List() { this.elems = new Object[…]; } } h7:

Insight 2: Leveraging Dynamic Analysis
Challenge: Efficiently find cheap parameter to prove query 2^H choices, most choices imprecise or unscalable Our solution: Use dynamic analysis parameter is inferred efficiently (linear in H) it can fail to prove query, but it is precise in practice and no cheaper parameter can prove query H inputs I1 ... In dynamic analysis static analysis Q P ⊢ Q? abstraction A P

Example: Leveraging Dynamic Analysis
class Bldg { List events, floors; main() { Bldg b = new Bldg(); for (…) BP v = b.events.get(i); int t = v.time; } Bldg() { floors = new List(); for (…) floors.add(new Floor()); for (…) Elev e = new Elev(floors); e.start(); events = new List(); for (…) events.add(new BP(…)); } } thread-local(p, v)? h1: v List Elev Bldg BP events floors Object[] elems Floor 1 p: h2: h3: h4: h5: in summary what I’ve shown you is a way by which dynamic analysis can be used to circumvent the need to explore a search space exponential in program size. h6: h1 h2 h3 h4 h5 h7 h6 class List { List() { this.elems = new Object[…]; } } h7:

Benchmark Characteristics
classes methods (x 1000) bytecodes (x 1000) allocation sites (x 1000) queries (x 1000) hedc 309 1.9 151 0. 6 weblech 532 3.1 230 3.0 0.7 lusearch 611 3.8 267 3.5 7.2 hsqldb 771 6.4 472 5.1 14.4 avrora 1498 5. 9 312 5.9 sunflow 992 6.6 478 6.1 10.0

Precision Comparison Pointer abstraction: Control abstraction:
Previous Approach Our Approach Pointer abstraction: Allocation sites Control abstraction: Flow insensitive Context insensitive Pointer abstraction: 2-partition Control abstraction: Flow sensitive Context sensitive

Precision Comparison Previous Approach Our Approach Previous scalable approach resolves 27% of queries Our approach resolves 82% of queries 55% of queries are proven thread-local 27% of queries are observed thread-shared

Running Time Breakdown
baseline static analysis our approach dynamic analysis static analysis total per query group mean max hedc 24s 6s 38s 1s 2s weblech 39s 8s 1m 4s lusearch 43s 31s 8m 3s hsqldb 1m08s 35s 86m 11s 21s avrora 1m00s 32s 41m 5s sunflow 1m18s 3m 74m 9s 19s

Sparsity of Our Abstraction
total # sites # sites set to all queries proven queries mean max hedc 1,914 3.2 12 1.4 5 weblech 2,958 2.2 8 1.5 lusearch 3,549 18 hsqldb 5,056 2.7 56 1.3 avrora 5,923 12.1 195 2.3 31 sunflow 6,053 15

Feedback from Real-World Deployments
16 bugs in jTDS Before: As far as we know, there are no concurrency issues in jTDS After: Probably the whole synchronization approach in jTDS should be revised from scratch 17 bugs in Apache Commons Pool Thanks to an audit by Mayur Naik many potential synchronization issues have been fixed Release notes for Commons Pool 1.3 319 bugs in Apache Derby This looks like very valuable information … could this tool be run on a regular basis? It is likely that new races could get introduced as new code is submitted “ ” “ ” “ ” “ ”

Program Analyses as Building Blocks
applied to partitioning, but also applicable for better scheduling, etc. partitioning analysis performance analysis slicing analysis call-graph analysis pointer analysis datarace analysis applied to datarace analysis, but also used for deadlock detection, etc. applied to slicing, but can also yield simpler memory models, etc. open source, publicly available reusable: analyses reused across clients client-driven: analyses tailored to clients efficient runtime: demand-driven computation thread-escape analysis may-happen-in- parallel analysis CHORD: a versatile program analysis platform [PLDI’11 tutorial]

Dependency Graphs of Analyses in Chord
A pointer analysis in Chord: 37 tasks, 49 targets, 154 edges Core analyses in Chord: 147 tasks, 246 targets, 1050 edges

Applications Built Using Chord
Systems: CloneCloud: Program partitioning [EuroSys’11] Mantis: Performance prediction [NIPS’10] Tools: CheckMate: Dynamic deadlock detection [FSE’10] Static datarace detection [PLDI’06, POPL’07] Static deadlock detection [ICSE’09] Frameworks: Evaluating heap abstractions [OOPSLA’10] Learning minimal abstractions [POPL’11] Abstraction refinement [PLDI’11]

Example Application: Configuration Debugging
Program configuration options often badly documented Chord has been used to automatically produce list of options, along with types Can be more accurate than human-produced “Static Extraction of Program Configuration Options” Ariel Rabkin and Randy Katz, ICSE’11

Conclusion http://jchord.googlecode.com Download Chord at:
Modern computing platforms pose exciting and unprecedented software engineering problems Static analysis, dynamic analysis, and machine learning can be combined to solve these problems effectively Program analyses can serve as reusable components in solving diverse software engineering problems Download Chord at:

Static and Dynamic Program Analysis: Synergies and Applications

Similar presentations

Presentation on theme: "Static and Dynamic Program Analysis: Synergies and Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Static and Dynamic Program Analysis: Synergies and Applications

Similar presentations

Presentation on theme: "Static and Dynamic Program Analysis: Synergies and Applications"— Presentation transcript:

Similar presentations

About project

Feedback