PROGRAMMING USING AUTOMATA AND TRANSDUCERS Loris D’AntoniMargus Veanes.

Slides:



Advertisements
Similar presentations
Semantics Static semantics Dynamic semantics attribute grammars
Advertisements

Formal Language, chapter 4, slide 1Copyright © 2007 by Adam Webber Chapter Four: DFA Applications.
Equivalence of Extended Symbolic Finite Transducers Presented By: Loris D’Antoni Joint work with: Margus Veanes.
Finite Automata CPSC 388 Ellen Walker Hiram College.
CS7100 (Prasad)L16-7AG1 Attribute Grammars Attribute Grammar is a Framework for specifying semantics and enables Modular specification.
1 Compiler Construction Intermediate Code Generation.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
YES-NO machines Finite State Automata as language recognizers.
Compilation Encapsulation Or: Why Every Component Should Just Do Its Damn Job.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
1 Introduction to Computability Theory Lecture4: Regular Expressions Prof. Amos Israeli.
Introduction to Computability Theory
1 Introduction to Computability Theory Lecture7: PushDown Automata (Part 1) Prof. Amos Israeli.
Functional Design and Programming Lecture 1: Functional modeling, design and programming.
CS 330 Programming Languages 09 / 13 / 2007 Instructor: Michael Eckmann.
The Theory of NP-Completeness
Program Design and Development
ISBN Chapter 4 Lexical and Syntax Analysis The Parsing Problem Recursive-Descent Parsing.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Adrian Ilie COMP 14 Introduction to Programming Adrian Ilie June 27, 2005.
Topics Automata Theory Grammars and Languages Complexities
CS Chapter 2. LanguageMachineGrammar RegularFinite AutomatonRegular Expression, Regular Grammar Context-FreePushdown AutomatonContext-Free Grammar.
Building An Interpreter After having done all of the analysis, it’s possible to run the program directly rather than compile it … and it may be worth it.
Language Translation Principles Part 1: Language Specification.
Streaming Tree Transducers Loris D'Antoni University of Pennsylvania Joint work with Rajeev Alur 1.
STRINGS AND AUTOMATA MODULO THEORIES Margus Veanes July 18, 2015SMT'15, San Fransisco1.
Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes.
Regular Model Checking Ahmed Bouajjani,Benget Jonsson, Marcus Nillson and Tayssir Touili Moran Ben Tulila
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 7 Mälardalen University 2010.
Imperative Programming
Computational Linguistics Yoad Winter *General overview *Examples: Transducers; Stanford Parser; Google Translate; Word-Sense Disambiguation * Finite State.
Streaming Tree Transducers Loris D'Antoni University of Pennsylvania Joint work with Rajeev Alur 1.
FAST : a Transducer Based Language for Manipulating Trees Presented By: Loris D’Antoni Joint work with: Margus Veanes, Ben Livshits, David Molnar.
DEPARTMENT OF COMPUTER SCIENCE & TECHNOLOGY FACULTY OF SCIENCE & TECHNOLOGY UNIVERSITY OF UWA WELLASSA 1 CST 221 OBJECT ORIENTED PROGRAMMING(OOP) ( 2 CREDITS.
Minimization of Symbolic Automata Presented By: Loris D’Antoni Joint work with: Margus Veanes 01/24/14, POPL14.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
Constraint Satisfaction Problems (CSPs) CPSC 322 – CSP 1 Poole & Mackworth textbook: Sections § Lecturer: Alan Mackworth September 28, 2012.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Hello.java Program Output 1 public class Hello { 2 public static void main( String [] args ) 3 { 4 System.out.println( “Hello!" ); 5 } // end method main.
Fast and Precise Sanitizer Analysis with B EK Pieter Hooimeijer Ben Livshits David Molnar Prateek Saxena Margus Veanes USENIX Security.
CS 363 Comparative Programming Languages Semantics.
MA/CSSE 474 Theory of Computation Decision Problems DFSMs.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Chapter 3 Part II Describing Syntax and Semantics.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
Muhammad Idrees, Lecturer University of Lahore 1 Top-Down Parsing Top down parsing can be viewed as an attempt to find a leftmost derivation for an input.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
CS 203: Introduction to Formal Languages and Automata
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
CMSC 330: Organization of Programming Languages Operational Semantics.
Compiler Construction CPCS302 Dr. Manal Abdulaziz.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen Department of Computer Science University of Texas-Pan American.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
Operational Semantics Mooly Sagiv Tel Aviv University Sunday Scrieber 8 Monday Schrieber.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Announcements Assignment 2 Out Today Quiz today - so I need to shut up at 4:25 1.
Operational Semantics Mooly Sagiv Reference: Semantics with Applications Chapter 2 H. Nielson and F. Nielson
Operational Semantics Mooly Sagiv Reference: Semantics with Applications Chapter 2 H. Nielson and F. Nielson
Decision Procedures for String Constraints Pieter Hooimeijer 2.
Lecture 2 Lexical Analysis
Jaya Krishna, M.Tech, Assistant Professor
The Metacircular Evaluator
6.001 SICP Interpretation Parts of an interpreter
Presentation transcript:

PROGRAMMING USING AUTOMATA AND TRANSDUCERS Loris D’AntoniMargus Veanes

2

3

4

5

6 All features of general purpose language Features needed replace, match, char…

FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task it is simple to use enables to automatically reason about what the programs do compiles into efficient code 7

OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 8

AUTOMATA, TRANSDUCERS, AND PROGRAMS 9

FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task, it is simple to use enables to automatically reason about what the programs do compiles into efficient code 10

11 type alphabet = A | T | C | G let rec all_TG (l: base list) : bool = match l with [ ] -> true | h : : t -> (h = T || h = G) && (all_TG t ) let rec all_AC (l: base list) : bool = match l with [ ] -> true | h : : t -> (h = A || h = C) && (all_TG t ) let rec map_base (l: base list) : base list = match l with [ ] -> [ ] | A : : t -> T : : ( map_base t ) | T : : t -> A : : ( map_base t ) | G : : t -> C : : ( map_base t ) | C : : t -> G : : ( map_base t ) let rec filter_AC (l: base list) : base list = match l with [ ] -> [ ] | A : : t -> A : : ( filter_AC t ) | T : : t -> filter_AC t | G : : t -> filter_AC t | C : : t -> C : : ( filter_AC t ) Finite alphabet Languages of strings Transformations from strings to strings q0q0 T G q0q0 A C all_TGall_AC ε A/T map_base T/A G/CC/G ε A/A T/ε G/εC/C filter_AC

FINITE AUTOMATA 12 a b a b ababYes abaNo bbYes aNo

FINITE STATE TRANSDUCERS 13 a/aa b/bb zz a/aa b/bb abaabbzz bbbzz abaUNDEFINED a

BENEFITS OF AUTOMATA AND TRANSDUCERS Closure and decidability for automata: Intersection, union, complement Decidable emptiness Decidable equivalence Can be minimized 14

BENEFITS OF AUTOMATA AND TRANSDUCERS Transducer composition let m_f_DNA l : base list = filter_AC (map_base l) 15 q0q0 A/T map_base T/A G/CC/G q0q0 A/AT/ε G/εC/C filter_AC q0q0 A/εT/ A G/CC/ε m_f_DNA

BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking map_base o (¬ all_AC) 16 input in all_TG map_base output in all_AC map_base only defined if output in (¬ all_AC)

BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking dom(map_base o (¬ all_AC)) 17 input in all_TG map_base output in all_AC Inputs for which map_base does not output in all_AC

BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking dom(map_base o (¬ all_AC)) ∩ all_TG = ∅ 18 input in all_TG map_base output in all_AC

BENEFITS OF AUTOMATA AND TRANSDUCERS Transducer equivalence let m_f_DNA l : base list = filter_AC (map_base l) let f_m_DNA l : base list = map_base (filter_AC l) Is m_f_DNA equivalent to f_m_DNA ? 19

FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task it is simple to use enables to automatically reason about what the programs do compiles into efficient code 20

OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 21

[USENIX11, POPL12] P. HooimeijerM. VeanesB. LivshitsD. Molnar BEK analysis of string sanitizers P. Saxena

23

24

25 Q UESTION : What could possibly go wrong?

26 Attacker: gollum.png' onload='javascript:...

27 Attacker: gollum.png' onload='javascript:... Result: <img src='gollum.png' onload='javascript:…

28 Attacker: im.png' onload='javascript:... Result: <img src='im.png' onload='javascri I found my PRECIOUSS S.

29

FIRST LINE OF DEFENSE: SANITIZERS Sanitizer: a string transformation function. PLDI'12 submission presentations 30 “im.png' …”“img.png' …” Sanitized dataUntrusted data Dec 8, 2011

COMPARING SANITIZERS 31

32 ' ' single quote html entity

33 some untrusted input

34 Library A Name: Around for: Availability: HtmlEncode Years Readily available to C# developers some untrusted input

35 Library A Name: Around for: Availability: Library B Name: Around for: Availability: HtmlEncode Years Readily available to C# developers HtmlEncode Years Readily available to C# developers some untrusted input

36 Library A Name: Around for: Availability: Library B Name: Around for: Availability: HtmlEncode Years Readily available to C# developers HtmlEncode Years Readily available to C# developers ' ' ' ' ✔ ✘

37 public static string HtmlEncode(string s) { if (s == null) return null; int num = IndexOfHtmlEncodingChars(s, 0); if (num == -1) return s; StringBuilder builder=new StringBuilder(s.Length+5); int length = s.Length; int startIndex = 0; Label_002A: if (num > startIndex) { builder.Append(s, startIndex, num-startIndex); } char ch = s[num]; if (ch > '>') { builder.Append("&#"); builder.Append(((int) ch).ToString(NumberFormatInfo.InvariantInfo)); builder.Append(';'); } else { char ch2 = ch; if (ch2 != '"') { switch (ch2) { case '<': builder.Append("<"); goto Label_00D5; case '=': goto Label_00D5; case '>': builder.Append(">"); goto Label_00D5; case '&': builder.Append("&"); goto Label_00D5; } else { builder.Append("""); } Label_00D5: startIndex = num + 1; if (startIndex < length) { num = IndexOfHtmlEncodingChars(s, startIndex); if (num != -1) { goto Label_002A; } builder.Append(s, startIndex, length-startIndex); } return builder.ToString(); }.NET WebUtility MS AntiXSS private static string HtmlEncode(string input, bool useNamedEntities, MethodSpecificEncoder encoderTweak) { if (string.IsNullOrEmpty(input)) { return input; } if (characterValues == null) { InitialiseSafeList(); } if (useNamedEntities && namedEntities == null) { InitialiseNamedEntityList(); } // Setup a new character array for output. char[] inputAsArray = input.ToCharArray(); int outputLength = 0; int inputLength = inputAsArray.Length; char[] encodedInput = new char[inputLength * 10]; SyncLock.EnterReadLock(); try { for (int i = 0; i < inputLength; i++) { char currentCharacter = inputAsArray[i]; int currentCodePoint = inputAsArray[i]; char[] tweekedValue; // Check for invalid values if (currentCodePoint == 0xFFFE || currentCodePoint == 0xFFFF) { throw new InvalidUnicodeValueException(currentCodePoint); } else if (char.IsHighSurrogate(currentCharacter)) { if (i + 1 == inputLength) { throw new InvalidSurrogatePairException(currentCharacter, '\0'); } // Now peak ahead and check if the following character is a low surrogate. char nextCharacter = inputAsArray[i + 1]; char nextCodePoint = inputAsArray[i + 1]; if (!char.IsLowSurrogate(nextCharacter)) { throw new InvalidSurrogatePairException(currentCharacter, nextCharacter); } // Look-ahead was good, so skip. i++; // Calculate the combined code point long combinedCodePoint = 0x ((currentCodePoint - 0xD800) * 0x400) + (nextCodePoint - 0xDC00); char[] encodedCharacter = SafeList.HashThenValueGenerator(combinedCodePoint); encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (char.IsLowSurrogate(currentCharacter)) { throw new InvalidSurrogatePairException('\0', currentCharacter); } else if (encoderTweak != null && encoderTweak(currentCharacter, out tweekedValue)) { for (int j = 0; j < tweekedValue.Length; j++) { encodedInput[outputLength++] = tweekedValue[j]; } else if (useNamedEntities && namedEntities[currentCodePoint] != null) { char[] encodedCharacter = namedEntities[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (characterValues[currentCodePoint] != null) { // character needs to be encoded char[] encodedCharacter = characterValues[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else { // character does not need encoding encodedInput[outputLength++] = currentCharacter; } finally { SyncLock.ExitReadLock(); } return new string(encodedInput, 0, outputLength); }

private static string HtmlEncode(string input, bool useNamedEntities, MethodSpecificEncoder encoderTweak) { if (string.IsNullOrEmpty(input)) { return input; } if (characterValues == null) { InitialiseSafeList(); } if (useNamedEntities && namedEntities == null) { InitialiseNamedEntityList(); } // Setup a new character array for output. char[] inputAsArray = input.ToCharArray(); int outputLength = 0; int inputLength = inputAsArray.Length; char[] encodedInput = new char[inputLength * 10]; SyncLock.EnterReadLock(); try { for (int i = 0; i < inputLength; i++) { char currentCharacter = inputAsArray[i]; int currentCodePoint = inputAsArray[i]; char[] tweekedValue; // Check for invalid values if (currentCodePoint == 0xFFFE || currentCodePoint == 0xFFFF) { throw new InvalidUnicodeValueException(currentCodePoint); } else if (char.IsHighSurrogate(currentCharacter)) { if (i + 1 == inputLength) { throw new InvalidSurrogatePairException(currentCharacter, '\0'); } // Now peak ahead and check if the following character is a low surrogate. char nextCharacter = inputAsArray[i + 1]; char nextCodePoint = inputAsArray[i + 1]; if (!char.IsLowSurrogate(nextCharacter)) { throw new InvalidSurrogatePairException(currentCharacter, nextCharacter); } // Look-ahead was good, so skip. i++; // Calculate the combined code point long combinedCodePoint = 0x ((currentCodePoint - 0xD800) * 0x400) + (nextCodePoint - 0xDC00); char[] encodedCharacter = SafeList.HashThenValueGenerator(combinedCodePoint); encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (char.IsLowSurrogate(currentCharacter)) { throw new InvalidSurrogatePairException('\0', currentCharacter); } else if (encoderTweak != null && encoderTweak(currentCharacter, out tweekedValue)) { for (int j = 0; j < tweekedValue.Length; j++) { encodedInput[outputLength++] = tweekedValue[j]; } else if (useNamedEntities && namedEntities[currentCodePoint] != null) { char[] encodedCharacter = namedEntities[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (characterValues[currentCodePoint] != null) { // character needs to be encoded char[] encodedCharacter = characterValues[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else { // character does not need encoding encodedInput[outputLength++] = currentCharacter; } finally { SyncLock.ExitReadLock(); } return new string(encodedInput, 0, outputLength); } public static string HtmlEncode(string s) { if (s == null) return null; int num = IndexOfHtmlEncodingChars(s, 0); if (num == -1) return s; StringBuilder builder=new StringBuilder(s.Length+5); int length = s.Length; int startIndex = 0; Label_002A: if (num > startIndex) { builder.Append(s, startIndex, num-startIndex); } char ch = s[num]; if (ch > '>') { builder.Append("&#"); builder.Append(((int) ch).ToString(NumberFormatInfo.InvariantInfo)); builder.Append(';'); } else { char ch2 = ch; if (ch2 != '"') { switch (ch2) { case '<': builder.Append("<"); goto Label_00D5; case '=': goto Label_00D5; case '>': builder.Append(">"); goto Label_00D5; case '&': builder.Append("&"); goto Label_00D5; } else { builder.Append("""); } Label_00D5: startIndex = num + 1; if (startIndex < length) { num = IndexOfHtmlEncodingChars(s, startIndex); if (num != -1) { goto Label_002A; } builder.Append(s, startIndex, length-startIndex); } return builder.ToString(); } 38.NET WebUtility MS AntiXSS Same behavior on all inputs? If not, what is a differentiating input? Can it generate any known ‘bad’ outputs?

39 PHP Trunk Changes to html.c,

40 PHP Trunk Changes to html.c, 1999—2011 R7,841 April loc R309,482 March loc

41 PHP Trunk Changes to html.c, 1999—2011 R32,564 September 2000 ENT_QUOTES introduced R7,841 April loc R309,482 March loc

42 PHP Trunk Changes to html.c, 1999—2011 R32,564 September 2000 ENT_QUOTES introduced R242,949 September 2007 $double_encode=true R7,841 April loc R309,482 March loc

43 PHP Trunk Changes to html.c, 1999—2011 Safe to apply twice? Safe to combine with other sanitizers?

MOTIVATION 44 Writing string sanitizers correctly is difficult There is no cheap way to identify problems with sanitizers ‘Correctness’ is a moving target What if we could say more about sanitizer behavior?

CONTRIBUTIONS 45 B EK  Frontend: a small language for string manipulation; similar to how sanitizers are written today  Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation B EK  Frontend: a small language for string manipulation; similar to how sanitizers are written today  Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation

CONTRIBUTIONS 46 B EK  Frontend: a small language for string manipulation; similar to how sanitizers are written today  Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation B EK  Frontend: a small language for string manipulation; similar to how sanitizers are written today  Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation Evaluation  Converted sanitizers from a variety of sources  Checked properties like reversibility, idempotence, equivalence, and commutativity Evaluation  Converted sanitizers from a variety of sources  Checked properties like reversibility, idempotence, equivalence, and commutativity

47 s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program BEK ARCHITECTURE

48 Symbolic Finite Transducers Z3 Transformation Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program BEK ARCHITECTURE

49 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program BEK ARCHITECTURE

50 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE

51 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE

52 escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM: ESCAPE QUOTES

53 escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM: ESCAPE QUOTES iterate over the characters in string s

54 escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM: ESCAPE QUOTES iterate over the characters in string s while updating one boolean variable b Simple dedicated syntax

55 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE

FINITE STATE TRANSDUCERS 56 a/A Problem: alphabet has 2 16 characters TOO MANY TRANSITIONS b/B z/Z … … &/&

SYMBOLIC FINITE TRANSDUCERS 57 Only two transitions!! x in [a-z] / x-32 x not in [a-z] / x

SYMBOLIC FINITE TRANSDUCERS 58 x>5/x+1,x x%2=1/x-1,x,x+4 true/5 true/x-4 Predicates Sequence of functions Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability

59 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE

60 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen Now what? BEK ARCHITECTURE

SFT Algorithms 61 EQUIVALENCE CHECKING IS DECIDABLE! Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability

SFT Algorithms 62 AntiXSS.HtmlEncode = WebUtility.HtmlEncode EQUIVALENCE CHECKING

63 SFT A  B inout SFT A inout SFT B CLOSED UNDER COMPOSITION

SFT Algorithms 64 SFT A  B inout SFT A inout SFT B JavaScriptEncode(HtmlEncode(w)) = HtmlEncode(JavaScriptEncode(w)) COMPOSITION

65 PRE-IMAGE COMPUTATION Regular Language O Regular Language I outin SFT A

66 PRE-IMAGE COMPUTATION MALICIOUS INPUTS Vulnerability signature outin SFT A

67 B EK  Frontend: a small language for string manipulation; similar to how sanitizers are written today  Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation B EK  Frontend: a small language for string manipulation; similar to how sanitizers are written today  Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation Contributions Evaluation  Converted sanitizers from a variety of sources  Checked properties like reversibility, idempotence, equivalence, and commutativity Evaluation  Converted sanitizers from a variety of sources  Checked properties like reversibility, idempotence, equivalence, and commutativity CONTRIBUTIONS

68 Can BEK model existing sanitizers? Can we use to check interesting properties on real sanitizers? QUESTIONS?

Language Features 69 Data: 1x OWASP HTMLencode 13x Google AutoEscape 21x IE 8 XSS Filter 7x Synthetic inspect feature counts WHAT FEATURES ARE NEEDED?

Language Features 70 Majority (76%) of sanitizers can be ported without extending the language With multi-character lookahead: 90% WHAT FEATURES ARE NEEDED?

71 Data 4x MS internal HtmlEncode 3x ‘for hire’ HtmlEncode based on English- language specification (C#) Commutative? Equivalent? CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS?

72 Short answer: Yes! CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS?

73 Short answer: Yes! EQ results take less than a minute to obtain: ✔✔✔✘✘✔✘ 2 ✔✔✘✘✔✘ 3 ✔✘✘✔✘ 4 ✔✘✘✘ 5 ✔✘✘ 6 ✔✘ 7 ✔ CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS?

74 CommutativitySelf-Equivalence DOES IT SCALE?

The Cheat Sheet 75 One out of seven implementations correctly encodes all strings for use in both HTML and attribute contexts WERE ALL SANITIZERS BROKEN?

76 B EK is a domain-specific language for writing string sanitizers B EK can model programs without approximation using symbolic finite transducers, enabling e.g., equivalence checks B EK was evaluated using real-world sanitizers from a variety of different sources Conclusion BEK IN A NUTSHELL

OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 77

BEX ANALYSIS OF STRING ENCODERS Loris D’AntoniMargus Veanes [VMCAI13, CAV13]

79 Hi, I’m plain text! Nice to meet you! SGkgSSdtIHBsYWluI HRleHQsIG5pY2Ugd G8gbWVldCB5b3Uh Encoder Decoder

NOT SO EASY TO GET RIGHT 80

WHEN ARE THEY CORRECT? 81 T Encoder T’ Decoder T Encoder T’T

CAN WE USE TRANSDUCERS? 82 T Encoder T’ Decoder T Encoder o Decoder = Identity

Language Features 83 Majority (76%) of sanitizers can be ported without extending Bek With multi-character lookahead: 90% BEK: WHAT FEATURES WERE NEEDED?

BASE64 encoder 3 Bytes  4 Base64 characters 84 Text contentMan Bytes Bit Pattern Index Base64 EncodedTWFu

85 HOW DO WE EXTEND BEK?

86 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE Symbolic finite transducers don’t have registers 

TRANSDUCERS WITH REGISTERS x / [ r | (x>>6), x&0x3F ], r := 0 x / [ x>>2 ], r := (x&3)<<4 x / [r|(x>>4)], r := (x&0xF)<<2 0 Transducers with registers are closed under composition Equivalent to Turing Machines 

88 EXPLORE REGISTERS VALUES Register has finitely many values: Remember last value 2 |bits| states 

89 BASE64 IN BEX DEMO

90

91 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE

92 ? Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bex Program Code Gen C#JavaScriptC Code Gen BEX ARCHITECTURE

EXTENDED SYMBOLIC FINITE TRANSDUCERS 93 Man… p 3 qp x 1 ≤FF ∧ x 2 ≤FF ∧ x 3 ≤FF / [ x 1 >>2, ((x 1 &3) >4), ((x 2 &0xF) >6), x 3 &0x3F ] x1x1 x2x2 x3x3 …

EXTENDED SYMBOLIC FINITE TRANSDUCERS 94 Man… pq TWFu… 3 qp x 1 ≤FF ∧ x 2 ≤FF ∧ x 3 ≤FF / [ x 1 >>2, ((x 1 &3) >4), ((x 2 &0xF) >6), x 3 &0x3F ] x1x1 x2x2 x3x3

MORE EXPRESSIVE THAN SYMBOLIC FINITE TRANSDUCERS x 1 >x 2 / [x 1 +x 2 ] Do they still have nice properties?

WHAT DO WE NEED? 96 T Encoder T’ Decoder T Encoder o Decoder = Identity CompositionEquivalence

NEGATIVE RESULTS 97 ESFAs: – equivalence is undecidable – are not closed under intersection – are not closed under complement ESFTs – equivalence is undecidable – are not closed under composition

A FRIENDLIER RESTRICTION 98

CARTESIAN EXTENDED SYMBOLIC FINITE TRANSDUCERS 99 Negative results use binary predicates and encoders do not use this feature Only allow conjunctions of unary predicates q p x 1 <x 2 +1 q p x 1 >5 ∧ x 2 =1 / [x 1 +x 2, x 1 ]

CARTESIAN ESFA = SFA 100 Cartesian ESFAs are now equivalent to SFAs 10 x 1 >5 ∧ x 2 =1 0,1 0 x=1x>5 1

STILL MORE EXPRESSIVE THAN SFTS 101 Cartesian ESFTs are strictly more expressive than SFTs!! 10 x 1 >5 ∧ x 2 =1 / [x 1 +x 2 ] ?

WHAT DO WE NEED? 102 T Encoder T’ Decoder T Encoder o Decoder = Identity CompositionEquivalence

RESULTS 103 Cartesian ESFTs – equivalence is decidable – are not closed under composition

COMPOSITION IN PRACTICE 104

105 BEK WITH REGISTERS?

TRANSDUCERS WITH REGISTERS x / [ r | (x>>6), x&0x3F ], r := 0 x / [ x>>2 ], r := (x&3)<<4 x / [r|(x>>4)], r := (x&0xF)<<2 0 Transducers with registers are closed under composition Equivalent to Turing Machines 

COMPOSING CARTESIAN ESFTS 107 A Cartesian ESFTs A’B’ B Transducers with registers A’ o B’ A o B Cartesian ESFT ?

REGISTER ELIMINATION 12 x / [ r+x, x+1], r := 0 x / [ x+4 ], r := (x-2) 0 [x 1,x 2 ] / [ x 1 +4, x 1 -2+x 2, x 2 +1 ], r:=0 0 ESFT

DOES IT WORK? 109

UNICODE UTF8 to UTF16 encoder (E) and decoder (D) 110 TestRunning Time Dom(E) = UTF1647 ms Dom(EoD) = UTF16109 ms Dom(D) = UTF8156 ms Dom(DoE) = UTF8320 ms EoD=Identity16 ms DoE=Identity24 ms Complete analysis in about a second

BASE64 Base64 encoder (E) and decoder (D) 111 TestRunning Time Dom(E) = bytes13 ms Dom(EoD) = bytes55ms Dom(D) = 6bits+76 ms Dom(DoE) = 6bits+56 ms EoD=Identity53 ms DoE=Identity19 ms

112 Cartesian Extended Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? EoD=I Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bex Program Code Gen C#JavaScriptC Code Gen BEX ARCHITECTURE

113 B EX is a domain-specific language for writing string encoders B EX can model programs without approximation using Cartesian extended symbolic finite transducers B EX was evaluated using real-world string encoders Conclusion BEX IN A NUTSHELL

OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 114

FAST ANALYSIS OF PROGRAMS MANIPULATING TREES Loris D’AntoniMargus VeanesBen LivshitsDavid Molnar [PLDI14]

116

SOLUTION: USE AN HTML SANITIZER Remove malicious active code from HTML documents SANITIZE 117 alert(“This is Sparta!”); I swear this HTML is safe! I swear this HTML is safe!

TYPICAL TRANSFORMATIONS Remove scripts Remove malicious URLs Replace deprecated tags Given a sanitizer S: Does S always produce a safe and well-formed output? Is S defined on every possible HTML file? Does executing S twice produce the same output as executing S once? Can we execute S fast? 118 Typical transformations Interesting questions

HOW DO WE WRITE ONE? 119 DEMODEMO: 1

120

121

122

123

124

KEY IDEA: HTML CODE IS A TREE body script malicious code div p I swear this HTML is safe! 125 SANITIZE body div p I swear this HTML is safe!

MOTIVATION Trees are common input/output data structures – XML query, type-checking, etc… – Compilers/optimizers (from parse tree to parse tree) – Tree manipulating programs: data structures algorithms, ontologies, etc… 126

127 ? Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Fast Program Code Gen C#JavaScriptC Code Gen FAST ARCHITECTURE

CHOOSING THE RIGHT FORMALISM 128

SEMANTICS AS TRANSDUCERS Goal: find a class of tree transducers that can express the previous examples and is closed under composition 129

TOP DOWN TREE TRANSDUCERS [ENGELFRIET75] q(a(x 1,x 2 ))  b(c,q 1 (x 1 )) Decidable properties: type-checking, etc… Domain expressiveness: only finite alphabets ab c q q1q1 x1x1 x2x2 x1x1 130

SYMBOLIC TREE TRANSDUCERS [PSI11] q(λa.a>3,(x 1,x 2 ))  λa.a+1,(λa.a-2,q 1 (x 1 )) Decidable properties: type-checking, etc… Domain expressiveness: infinite alphabets using predicates and functions Structural expressiveness: can’t delete a node without reading it first q q1q1 x1x1 x2x2 x1x1 Such that 5>3 is true 131 Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability

IMPROVING STRUCTURAL EXPRESSIVENESS Transformation: delete the left child if it contains a script If we delete the node we can’t check that the left child contained a script divq q 132 Regular Look-Ahead (RLA) ??

REGULAR LOOK AHEAD : Transformation: delete the left child if it contains a script Rules can ask whether the children are in particular languages – p 1 : the language of trees that contain a script node – p 2 : the language of all trees Decidable properties: type-checking, etc… Domain expressiveness: infinite alphabets Structural expressiveness: good enough to express our examples div q p1p1 p2p2 q Transformation now is safe 133

DecidabilityComplexityStructuralExpressiveness Infinite alphabets Top Down Tree Transducers [Engelfriet75]VVXX Top Down Tree Transducers with Regular Look-ahead [Engelfriet76]VV~X Streaming Tree Transducers [AlurDantoni12]VXVX Data Automata [Bojanczyk98]~XXV Symbolic Tree Transducers [VeanesBjoerner11]VVXV Symbolic Tree Transducers RLAVV~V 134

COMPOSITION OF STT R This is not always possible!! Find the biggest class for which it is possible 135 T1T1 T1T1 T2T2 T2T2 T 1 o T 2

WHEN CAN WE COMPOSE? Theorem: T(x) = T 2 (T 1 (x)) definable by a Symbolic Tree Transducers with RLA if – T 1 is deterministic All our examples fall in this category 136 Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability

137 Symbolic Tree Transducers with RLA Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Fast Program Code Gen C#JavaScriptC Code Gen FAST ARCHITECTURE

CASE STUDIES AND EXPERIMENTS 138

CASE STUDIES AND EXPERIMENTS Program Optimization: Deforestation of functional programs Verification: HTML sanitization Analysis of functional programs Augmented reality app store 139 Infinite Alphabets: Integer Data types

DEFORESTATION Removing intermediate data structures from programs ADVANTAGE: the program is a single transducer reads the input list only once, thanks to transducers composition 140 alphabet ILIst [i : int] { nil(0), cons(1) } trans mapC: IList  IList { nil() to nil [0] | cons(x) to cons [(i+5)%26] (mapC x) } def mapC 2 : IList  IList := compose mapC mapC

STAGES BY EXAMPLE 141 mapCmapC2 Transducers

DEFORESTATION: SPEEDUP 142 f(f(f(…f(x)...) (f;f;f;…;f)(x)

ANALYSIS OF FUNCTIONAL PROGRAMS 143

AR INTERFERENCE ANALYSIS Recognizers output data that can be seen as a tree structure Spine Hip Neck HeadKnee Ankle Foot …. 144

APPS AS TREE TRANSFORMATIONS Applications that use recognizers can be modeled as FAST programs 145 trans addHat: STree -> STree Spine(x,y) to Spine(addHat(x), y) | Neck(h,l,r) to Neck(addHat(h), l, r) | Head(a) to Head(Hat(a))

COMPOSITION OF PROGRAMS Two FAST programs can be composed into a single FAST program p1p1p1p1 p2p2p2p2 p 1 ;p 2 146

ANOTHER RECOGNIZER 147 Room Floor Wall Table Chair …. Chair ….

INTERFERENCE ANALYSIS Apps can be malicious: try to overwrite outputs of other apps Apps interfere when they annotate the same node of a recognizer’s output We can compose them and check if they interfere statically!! – Put checker in the AppStore and analyze Apps before approval Interfering apps Add cat earsAdd hat Add pin to a cityBlur a city Amazon Buy Now button Malicious Buy Now button 148

INTERFERENCE ANALYSIS IN PRACTICE 100 generated FAST programs, up to 85 functions each Check statically if they conflict pairwise for ANY possible input Checked 99% of program pair in less than 0.5 sec! For an App store these are perfectly fine

TWO PENDING PATENTS 150

151 F AST is a domain-specific language for writing tree manipulating programs F AST can model programs without approximation using Symbolic tree transducers with regular lookahead F AST was evaluated using real-world programs Conclusion FAST IN A NUTSHELL

OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 152

WHAT’S NEXT 153

FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task, it is simple to use enables to automatically reason about what the programs do compiles into efficient code 154

DREX EFFICIENT STRING MANIPULATION Loris D’Antoni Mukund Raghothaman Here at POPL15! Rajeev Alur

DECLARATIVE LANGUAGE FOR STRING SCRIPTS (15/1, 2PM, SEC. 2B) 156 a b a b b/b (a|b)*b iterate(choice(a->a, b->b)) a/a Execute this code in linear time left- to-right pass on the input string!!

BEX 2.0 PARALLEL EXECUTION OF STRING ENCODERS Margus Veanes Here at POPL 15!! David MolnarBen Livshits Todd Mytkowicz

FROM TRANSDUCERS TO PARALLEL EXECUTIONS (15/1, 2PM, SEC. 2B) Efficient data-parallel code x / [ r+x, x+1], r := 0 x / [ x+4 ], r := (x-2) 02

PROGRAM BOOSTING OR CROWD-SOURCING FOR CORRECTNESS Here at POPL 15!! Loris D’Antoni David Molnar Benjamin Livshits Margus Veanes Robert Cochran

CROWD-SOURCING PROGRAMS WITH AUTOMATA (17/1, 4PM, SEC. 9B) 160 Specification

YOU CAN HELP TOO! 161

INTERESTING DIRECTIONS A transducer-based language for – WebSrapers – Spradsheet transformations – Compiler optimizations – XML processing – Html rendering 162

SUMMARIZING… 163

164 Transducer Model Z3 Transformation Analysis Does it do the right thing? Analysis question Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; DSL Code Gen C#JavaScriptC Code Gen OUR RECIPE FOR EACH TASK

BEK Fast and precise sanitizer analysis with BEK Hooimeijer, Livshits, Molnar, Saxena, Veanes, USENIX11 Symbolic finite state transducers: algorithms and applications Veanes, Hooimeijer, Livshits, Molnar, Bjorner, POPL12 BEX Static analysis of string encoders and decoders D’Antoni, Veanes, VMCAI13 Equivalence of extended symbolic finite transducers D’Antoni, Veanes, CAV13 Data parallel string manipulating programs Veanes, Mytkowicz, Molnar, Livshits, POPL15 FAST Fast: a transducer based language for tree manipulatio D’Antoni, Veanes, Livshits, Molnar, PLDI14 165