Deep Learning for Program Repair Aditya Kanade Indian Institute of Science Dagstuhl Seminar on Automated Program Repair, January 2017
A Probabilistic Perspective on Program Repair Specification = A set of examples = { (x, y) | x is a faulty program and y is its fixed version } Represent the faulty and fixed programs as sequences of tokens x = x1, x2, …, xn and y = y1, y2, …, yk Learn a conditional probability distribution P(Y = y | X = x) Repair procedure: To fix an unseen program x, generate a y such that y = argmax P(Y = y’ | X = x)
Seq2seq Neural Net with Attention Mechanism Attention mechanism to focus on diff. parts of input sequence With fixed-size context vector
Neural Network Architecture Encoder Decoder
DeepFix: Fixing Common Programming Errors Common programming errors: Unlike logical errors, these are not specific to a programming task at hand, but relate to the overall syntax and structure of the programming language Analogous to grammatical errors in natural languages. Both experienced developers and novice students make them Detected by compilers as syntactic errors, but are neither accurately localized nor fixed by the compilers
Example Input faulty program with attention weights Program as repaired by DeepFix
The Iterative Repair Strategy of DeepFix An end-to-end deep learning solution where the neural net learns to both localize and fix the faults.
Experimental Setup Identifiers (variable and method names) are canonicalized Resulting vocabulary size is 129, tokens embedded in 50-dimensional vectors Seq2seq w/ attention network implemented in TensorFlow Both encoder and decoder are 4 stacked GRU layers of 300 cells in each layer Dropout at the rate of 0.2 on non-recurrent connections Training configuration Trained using ~150K examples created synthetically Stochastic gradient descent using Adam optimizer with mini-batch of 128 Gradients clipped at [-1,1] Training time of 1-2 hours per epoch Select the model with peak validation performance over 20 epochs
Results Evaluated on 6971 faulty C programs submitted to 93 programming tasks in an introductory programming tasks Fixed 27% programs completely such that the programs compile w/o errors Fixed additional 19% programs partially Evaluated on 9230 C programs with seeded faults Fixed 56% programs completely Fixed additional 17% programs partially Localization accuracy on seeded faults: Top-1 = 79%, Top-5 = 89% Types of errors fixed: missing delimiters, extraneous symbols, swapped symbols, missing variable declarations, etc.
Some Plots PCA projects of vector-representations of correct (circles) and faulty (triangles) programs Distribution of fixed programs by tasks
Lightweight Machine Learning
Statistical Correlation to Search for Likely Fixes Given a test suite and a potential faulty location, how to search for expressions that are likely to appear in the fixed version? Use symbolic execution to obtain desired expression values Enumerate a set of expressions and their values on all executions Rank expressions using statistical correlation with the desired values Used for synthesis of repair hints in MintHint [ICSE’14]
Clustering Programs for Large-Scale Feedback Generation on Student Code ? ? ? ? ? ? ? ? ? ? Clustering ? ? ? ? ? ? ? ? by solution strategy ? ? ? ? ? ? ? ? ? Instructor certifies one submission from each cluster as correct Student programs ? ? ? ? ? Verified feedback ? ? ? ? ? ? Used in CoderAssist [FSE’16] to generate feedback on student programs
Acknowledgements DeepFix: Fixing common C language errors by deep learning Rahul Gupta, Soham Pal, Aditya Kanade, Shirish Shevade AAAI’17 Semi-supervised verified feedback generation Shalini Kaleeswaran, Anirudh Santhiar, Aditya Kanade, Sumit Gulwani FSE’16 MintHint: Automated synthesis of repair hints Shalini Kaleeswaran, Varun Tulsian, Aditya Kanade, Alessandro Orso ICSE’14 Some images taken from the Deep Learning book by Goodfellow, Bengio and Courville
Discussion Points Improving performance of DeepFix, e.g., using a copying mechanism? Fixing more challenging programming errors? Handling larger programs? Learning better dependences? General-purpose deep nets versus special-purpose deep nets designed for program text? How to obtain high-quality training data? Mutation strategies?