Static Analysis of String Encoders and Decoders Presented By: Loris D’Antoni Joint work with: Margus Veanes
Motivation String Encoders and Decoders are – Ubiquitous: – Ubiquitous: transformation from Unicode text files in the Internet to in-memory representation of text – Hard to write: – Hard to write: they use unintuitive logic in order to enable efficiency – Hard to verify: – Hard to verify: big state space, alphabets are very big (2 16 elements). Previous techniques blow up for small decoders. 2
A simple example: BASE64 encoder 3 Bytes 4 Base64 3 Bytes 4 Base64 characters Decoder similar (every 4 encodes 3) bit manipulations Uses bit manipulations to be efficient How do we model it and prove it correct? 3 Text contentMan Bytes Bit Pattern Index Base64 EncodedTWFu
What Properties do we check? Encoder, Decoder denoted by E,D E o D = I D o E = I dom(E) = bytes dom(D) = Base64 bytes We need – Equivalence checking – Function Composition (our model should be closed under composition) 4
Bek code program base64encode(input){ return iter(x in input)[q:=0;r:=0;]{ case (x>0xFF): raise InvalidCharacter; case (q==0): yield (base64(x>>2)); q:=1; r:=(x&3)<<4; case (q==1): yield (base64(r|(x>>4))); q:=2; r:=(x&0xF)<<2; case (q==2): yield (base64((r|(x>>6))), base64(x&0x3F)); q:=0; r:=0; end case (q==1): yield (base64(r),'=','='); end case (q==2): yield (base64(r),'='); }; } 5 How do we analyze this code?
Trust me! It is tricky! [12/12/12 11:35:49 PM] Margus Veanes: I think it is doable, smth that is like ([A-Z2-7]{4}... )* [12/12/12 11:35:57 PM] Loris D'Antoni: ok ill try [12/12/12 11:36:22 PM] Margus Veanes: then you can ry to see the difference compared to the domain of the decoder [12/12/12 11:37:42 PM] Loris D'Antoni: it seems that also on this counterex it doesn't work [12/12/12 11:37:43 PM] Loris D'Antoni: DP2A==== [12/12/12 11:37:50 PM] Loris D'Antoni: which maybe it's a bad one in this sense [12/12/12 11:37:52 PM] Loris D'Antoni: ill check now [12/12/12 11:40:45 PM] Margus Veanes: actually the domain of the decoder looks wrong, it allows 8 and 9 [12/12/12 11:40:46 PM] Margus Veanes: [12/12/12 11:40:58 PM] Loris D'Antoni: yeh i fixed that in my version …COUPLE OF HACKS LATER… [12/13/12 12:24:02 AM] Loris D'Antoni: ok, found bug and fixed it, now proved them correct. Will work on others tomorrow. Was very silly but hard to spot [12/13/12 12:24:35 AM] Margus Veanes:... this is why the analysis we can do is useful :-) [12/13/12 12:24:45 AM] Loris D'Antoni: yeh i was mapping [12/13/12 12:24:46 AM] Loris D'Antoni: ==> 2..7 [12/13/12 12:24:58 AM] Loris D'Antoni: instead of ==> '2'..'7' 6 Brief DEMO
Attempt 1: Finite Transducers 7 MMa n / [TWFu] M / [] a / [] ….. Finite set of states Each transition reads an input symbol and outputs a sequence of symbols Mapping from strings into strings Blue state (final), for which the mapping is defined 2 8 edges out of every state and 2 16 states Decidable equivalence and closure under composition
Attempt 2: Symbolic Finite Transducers [POPL12] 8 MMa λx. x==‘M’ / [λx. x>>2] λx. x==‘a’ / [λx. x>>4,…] ….. Guards are predicates over any decidable theory instead of single characters Output is a function of the input In this case uses theory of bit-vectors Better reflects implementation operations Analysis is still decidable (equivalence, composition) We did not improve much: still state explosion Supports symbolic updates such as bit-vectors
Attempt 3: Symbolic Transducers [POPL12] 9 12 True / [r|(x>>6), x&0x3F], r := 0 True / [x>>2], r := (x&3)<<4 True / [r|(x>>4)], r := (x&0xF)<<2 0 Register can store values and is updated in transitions Inputs and outputs can inspect and use register value Logic is the same as for implementation!! No state explosion No state explosion!! Closed under sequential composition undecidable Analysis (equivalence) is undecidable in general… We need a way to eliminate the registers Registers
Register Elimination: the naïve way x / [r|(x>>6), x&0x3F], r := 0 x / [x>>2], r := (x&3)<<4 x / [r|(x>>4)], r := (x&0xF)<<2 0 MMa n / [(((((M&3) >4))&0xF) >6), n&0x3F] M / [M>>2] a / [((M&3) >4)] ….. Via enumeration: State Explosion, but automatic Can do analysis, but very slow… Doesn’t work if alphabet infinite: waste of Symbolic analysis We need a Better model ST SFT
Text contentMan Byte Bit Pattern Index Base64 EncodedTWFu A simple example: BASE64 3 Bytes4 Base64 3 Bytes 4 Base64 characters Decoder similar (every 4 encodes 3) bit manipulations Uses bit manipulations to be efficient How do we model it and prove it correct? 11
Extended Symbolic Finite Transducers 12 [x 1,x 2,x 3 ] / [x 1 >>2, ((x 1 &3) >4), ((x 2 &0xF) >6), x 3 &0x3F] 0 No state explosion No state explosion Analysis can be done for several interesting cases (in particular for encoders) But, how do we pass from STs to ESFTs? Read sequences of symbols Output is a function of all the 3 symbols
Register Elimination: the good way 1/2 12 x / [r|(x>>6), x&0x3F], r := 0 x / [x>>2], r := (x&3)<<4 x / [r|(x>>4)], r := (x&0xF)<<2 0 [x 1,x 2 ] / [r|(x 1 >>4), ((x 1 &0xF) >6), x 2 &0x3F], r:=0 0 ST ESFT 13 1 x / [x>>2], r := (x&3)<<4 2
Register Elimination: the good way 2/2 [x 1,x 2,x 3 ] / [x 1 >>2, ((x 1 &3) >4), ((x 2 &0xF) >6), x 3 &0x3F] 0 Fast and supports infinite alphabets Not always possible, but works for encoders/decoders 14 [x 1,x 2 ] / [r|(x 1 >>4), ((x 1 &0xF) >6), x 2 &0x3F], r:=0 0 1 x / [x>>2], r := (x&3)<<4 1
Composition of ESFTs 15 ESFT E ESFT D ST E’ ST D’ ST E’oD’ ESFT EoD Use of registers to remember values Uses ST closure under composition Register elimination Not closed in general
Equivalence Semi-Decision Procedure First we check equivalence on domain intersection (hard) then we check domain equivalence (easier in this case). 16 (λ(x1,x2).True)/[x1,x2] λ(x).True/ [x] 10 0,1 λ(x1,x2).True/([x1,x2],[x1,x2]) We build a product transducer
Unicode Case Study We analyzed UTF8 to UTF16 encoder (E) and decoder (D) 17 TestRunning Time Dom(E) = UTF1647 ms Dom(EoD) = UTF16109 ms Dom(D) = UTF8156 ms Dom(DoE) = UTF8320 ms EoD=Identity (naive) 82,000 ms DoE=Identity (naive) 134,000 ms EoD=Identity (new algorithm) 123 ms DoE=Identity (new algorithm) 215 ms Complete analysis in less than a second
Result Summary ESFTs ESFTs a new transducer model for representing encoders and decoders register elimination algorithm ST ESFTs A new register elimination algorithm from ST to ESFTs, independent from input alphabet Correctness analysis Correctness analysis of real programs: Unicode, Base64 encoders and decoders Automatic code generation Automatic Javascript code generation of the verified code Check it out Transducers are cool!! 18
Future Work theory Understand the theory of ESFTs (coming soon) – Composition closure, equivalence… tree transformations Extend the model to tree transformations – Widely used in NLP Analyze more complex scenarios – List manipulating programs 19
Thank you Loris D’Antoni 20