Recent Advances in Software Engineering in Microsoft Research Judith Bishop Microsoft Research University of Nanjing, 28 May 2015.

Recent Advances in Software Engineering in Microsoft Research Judith Bishop Microsoft Research jbishop@microsoft.com University of Nanjing, 28 May 2015

Statistics Trends WER, CRANE Testing IntelliTest Code Hunt Z3 And Friends Prevention Education Hardware Maintenance

Software runs on hardware – lots of it Worldwide PC units for personal devices increased by 5% year over year in 1Q14 with sales of basic and utility tablets in emerging markets, plus smartphones driving total device market growth during the quarter. Gartner June 2014

Connected Devices and The Cloud

Most recent technology shift

Desktop operating system market share Source: www.netmarketshare.com

Mobile/tablet market share Source: www.netmarketshare.com

Market share of operating systems in the United States from January 2012 to September 2014 Not Windows

The Challenge for Microsoft Microsoft ships software to 1 billion users around the world We want to fix bugs regardless of source application or OS software, hardware, or malware prioritize bugs that affect the most users generalize the solution to be used by any programmer get the solutions out to users most efficiently try to prevent bugs in the first place 11

Debugging in the Large with WER… 12 5 5 17 23,450,649 Minidump

The huge data based can be mined to prioritize work Fix bugs from most (not loudest) users Correlate failures to co-located components Show when a collection of unrelated crashes all contain the same culprit (e.g. a device driver) Proven itself “in the wild” Found and fixed 5000 bugs in beta releases of Windows after programmers had found 100 000 with static analysis and model checking tools. WER’s properties Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt, Debugging in the (Very) Large: Ten Years of Implementation and Experience, in SOSP '09, Big Sky, MT, October 2009Debugging in the (Very) Large: Ten Years of Implementation and Experience

Bucketing Mostly Works One bug can hit multiple buckets up to 40% of error reports duplicate buckets must be hand triaged Multiple bugs can hit one bucket up to 4% of error reports harder to isolate each bug But if bucketing is wrong 44% of the time? Solution: scale is our friend With billions of error reports, we can throw away a few million 14

Top 20 Buckets for MS Word 2010 15 3-week internal deployment to 9,000 users. 3-week internal deployment to 9,000 users.  Just 20 buckets account for 50% of all errors  Fixing a small # of bugs will help many users Bucket #: CDF

Hardware: Processor Bug 16 Day #:  WER helped fix hardware error  Manufacturer could have caught this earlier w/ WER

WER works because … … bucketing mostly works Windows Error Reporting (WER) is the first post-mortem reporting system with automatic diagnosis the largest client-server system in the world (by installs) helped 700 companies fix 1000s of bugs and billions of errors fundamentally changed software development at Microsoft http://winqual.microsoft.com 17

CRANE: Risk Prediction and Change Risk Analysis Goal: to improve hotfix quality and response time

CRANE adoption in Windows Retrospective evaluation of CRANE on Windows Categorization of fixes that failed in the field Recommendation: Make metrics simple, empirical and insightful, project and context specific, non-redundant and actionable. Jacek Czerwonka, Rajiv Das, Nachiappan Nagappan, Alex Tarvo, Alex Teterev: CRANE: Failure Prediction, Change Analysis and Test Prioritization in Practice - Experiences from Windows. ICST 2011: 357-366Rajiv DasNachiappan NagappanAlex TarvoAlex TeterevICST 2011

I MPROVING T ESTING P ROCESSES Release cycles impact verification process Testing becomes bottleneck for development. How much testing is enough? How reliable and effective are tests? When should we run a test? Kim Herzig £$, Michaela Greiler $, Jacek Czerwonka $, Brendan Murphy £ The Art of Testing Less without Sacrificing Code Quality, ICSE 2015.

Engineering Process Engineers desktopIntegration process

System and Integration Testing Quality gates Developers have to pass quality gates (no control over test selection) Checking system constraints: e.g. compatibility or performance Failures not isolated  involve human inspections  causes development freeze for corresponding branch

System and Integration Testing Software testing is expensive 10k+ gates executed, 1M+ test cases Different branches, architectures, languages, … Aims to find code issues as early as possible Slows down product development

Research Objective Only run effective and reliable tests Not every tests performs equally well, depends on code base Reduce execution frequency of tests that cause false test alarms (failures due to test and infrastructure issues) Do not sacrifice code quality Run every test at least once on every code change Eventually find all code defects, taking risk of finding defects later ok. Running less tests increases code velocity We cannot run all tests on all code changes anymore. Identify tests that are more likely to find defects (not coverage).

H ISTORIC T EST F AILURE P ROBABILITIES Analyzing past test runs: failure probabilities Execution history These probabilities depend on the execution context!

Does it Pay Off? Less test executions reduce cost Taking risk increases cost ~11 month period > 30 million test execs multiple branches ~3 month period > 1.2 million test execs single branch ~12 month period > 6.5 million test execs multiple branches

Across All Products Results vary Branching structure Runtime of tests We save cost on all products Fine-tuning possible, better results but not general

D YNAMIC & S ELF -A DAPTIVE Probabilities are dynamic (change over time) Skipping tests influences risk factors (of higher level branches) Tests re-enabled when code quality drops Feedback-loop between decision points Training period automatically enable tests again

Impact on Development Process Secondary Improvements Machine Setup We may lower the number of machines allocated to testing process Developer satisfaction Removing false test failures increases confidence in testing process Development speed Impact on development speed hard to estimate through simulation Product teams invest as they believe that removing tests:  Increases code velocity (at least lower bound)  Avoids additional changes due to merge conflicts  Reduces the number of required integration branches as their main purpose is to test product “We used the data your team has provided to cut a bunch of bad content and are running a much leaner BVT system […] we’re panning out to scale about 4x and run in well under 2 hours” (Jason Means, Windows BVT PM)

Prevention

Continual abstraction

Automated Theorem Prover Won 19/21 divisions in SMT 2011 Competition The most influential tool paper in the first 20 years of TACAS (2014) 33 Z3 reasons over a combination of logical theories Boolean Algebra Bit Vectors Linear Arithmetic Floating Point First-order Axioms Non-linear, Reals Algebraic Data Types Sets/Maps/… 33 Leonardo de Moura and Nikolaj Bjørner. Satisfiability modulo theories: introduction and applications. Commun. ACM, 54(9):69-77, 2011.

SAGE: Binary File Fuzzing Symbolic execution of x86 traces to generate new input files Z3 theories: bit vectors and arrays Finds assertion violations using stratified inlining of procedures and calls to Z3 Z3 theories: arrays, linear arithmetic, bit vectors, un-interpreted functions Automated Test Generation and Safety/Termination Checking Random + Regression All OthersSAGE Fuzzing bugs found in Win7 (over 100s of file parsers): 34 Corral: Whole Program analysis As of Windows Threshold, Corral is the program analysis engine for SDV (Static Driver Verifier)

Problem: 1000s of devices Low level access control lists for different policies Updates to Edge ACL can break policies Complexity is “inhumane” Validating Network ACLs in the Datacenter 35

Education

Available in Visual Studio since 2010 (as Pex and Smart Unit Tests) IntelliTest in Visual Studio 2015 Nikolai Tillmann, Jonathan de Halleux, Tao Xie: Transferring an automated test generation tool to practice: from pex to fakes and code digger. ASE 2014: 385-396Jonathan de HalleuxTao XieASE 2014

Working and learning for fun Enjoyment adds to long term retention on a task Discovery is a powerful driver, contrasting with direct instructions Gaming joins these two, and is hugely popular Can we add these elements to coding? Code Hunt can! www.codehunt.com

Code Hunt Is a serious programming game Works in C# and Java (Python coming) Appeals to coders wishing to hone their programming skills And also to students learning to code Code Hunt has had over 300,000 users since launching in March 2014 with around 1,000 users a day Stickiness (loyalty) is very high

Gameplay 1.User writes code in browser 2.Cloud analyzes code – test cases show differences As long as there are differences: User must adapt code, repeat When they are no more differences: User wins level! secret code test cases

void CoverMe(int[] a) { if (a == null) return; if (a.Length > 0) if (a[0] == 1234567890) throw new Exception("bug"); } a.Length>0 a[0]==123… T F T F F a==null T Constraints to solve a!=null a!=null && a.Length>0 a!=null && a.Length>0 && a[0]==123456890 Input null {} {0} {123…} Execute&Monitor Solve Choose next path Observed constraints a==null a!=null && !(a.Length>0) a==null && a.Length>0 && a[0]!=1234567890 a==null && a.Length>0 && a[0]==1234567890 Done: There is no path left. Dynamic Symbolic Execution

Code Hunt - the APCS (default) Zone Opened in March 2014 129 problems covering the Advanced Placement Computer Science course By August 2014, over 45,000 users started.

Effect of difficulty on drop off in sectors 1- 3 Yellow – Division Blue – Operators Green - Sectors

Aug 2014 and Feb 2015 PuzzleLevelAugFeb-A Compute -X1.11722 Compute 4 / X1.61821 Compute X-Y1.71822 Compute X/Y1.113238 Compute X%3+11.131518 Compute 10%X1.141216 Construct a list of numbers 0..N-12.13748 Construct a list of multiples of N2.21923 Compute x^y3.11118 Compute X! the factorial of X3.21619 Compute sum of i*(i+1)/23.51722

Towards a Course Experience

Total Try Count Average Try Count Max Try Count Total Solved Users 1337436313061581 Public Data release in open source For ImCupSept 257 users x 24 puzzles x approx. 10 tries = about 13,000 programs For experimentation on how people program and reach solutions Github.com/microsoft/code-hunt

Upcoming events PLOOC 2015 PLOOC 2015 at PLDI 2015, June 14 2015, Portland, OR, USA PLDI 2015 CHESE 2015CHESE 2015 at ISSTA 2015, July 14, 2015, Baltimore, MD, USAISSTA 2015 Worldwide intern and summer school contests Public Code Hunt Contests are over for the summer Special ICSE attendees Contest. Register at aka.ms/ICSE2015 Code Hunt Workshop February 2015

Summary: Code Hunt: A Game for Coding 1.Powerful and versatile platform for coding as a game 2.Unique in working from unit tests not specifications 3.Contest experience fun and robust 4.Large contest numbers with public data sets from cloud data Enables testing of hypotheses and making conclusions about how players are mastering coding, and what holds them up 5.Has potential to be a teaching platform collaborators needed

Total Try Count Average Try Count Max Try Count Total Solved Users 1337436313061581 Public Data release in open source For ImCupSept 257 users x 24 puzzles x approx. 10 tries = about 13,000 programs For experimentation on how people program and reach solutions Github.com/microsoft/code-hunt

Upcoming events PLOOC 2015 PLOOC 2015 at PLDI 2015, June 14 2015, Portland, OR, USA PLDI 2015 CHESE 2015CHESE 2015 at ISSTA 2015, July 14, 2015, Baltimore, MD, USAISSTA 2015 Worldwide intern and summer school contests Public Code Hunt Contests are over for the summer Special ICSE attendees Contest. Register at aka.ms/ICSE2015 Code Hunt Workshop February 2015

Summary: Code Hunt: A Game for Coding 1.Powerful and versatile platform for coding as a game 2.Unique in working from unit tests not specifications 3.Contest experience fun and robust 4.Large contest numbers with public data sets from cloud data Enables testing of hypotheses and making conclusions about how players are mastering coding, and what holds them up 5.Has potential to be a teaching platform collaborators needed

Websites Game Project Community Data Release Blogs Office Mix www.codehunt.com research.microsoft.com/codehunt research.microsoft.com/codehuntcommunity github.com/microsoft/code-hunt Linked on the Project page mix.office.com

Conclusions 1.Software runs on hardware and hardware is increasingly varied 2.The hardware sector that is growing (mobile) is the most tricky 3.Maintenance increases in complexity with the number of deployments 4.Addressing human factors in large maintenance teams pays off 5.Prevention is a hugely valuable aid to maintenance 6.Gaming is a way for practicing software engineering skills Thank you! Questions?

Recent Advances in Software Engineering in Microsoft Research Judith Bishop Microsoft Research University of Nanjing, 28 May 2015.

Similar presentations

Presentation on theme: "Recent Advances in Software Engineering in Microsoft Research Judith Bishop Microsoft Research University of Nanjing, 28 May 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Recent Advances in Software Engineering in Microsoft Research Judith Bishop Microsoft Research University of Nanjing, 28 May 2015.

Similar presentations

Presentation on theme: "Recent Advances in Software Engineering in Microsoft Research Judith Bishop Microsoft Research University of Nanjing, 28 May 2015."— Presentation transcript:

Similar presentations

About project

Feedback