cs3102: Theory of Computation Class 10: DFAs in Practice Spring 2010 University of Virginia David Evans
Menu Today: – Preparing for Exam 1 – Language class for Deterministic PDAs – Applications of DFAs Thursday: – Exam Review (if you send questions and/or topics) – Applications of probabilistic DFAs and Grammars
Exam 1 In class, next Tuesday, 2 March Covers: Classes 1-9 (10 and 11) Sipser Ch 0-2 Problem Sets Comments Exam 1 Note: unlike nearly all other sets we draw in this class, all of these sets are finite, and the size (roughly) represents the relative size.
What’s on the Exam? Definitions Language, problem, sets Constructing and understanding computing models Finite automata (DFA, NFA) Pushdown automata (DPDA, NPDA) Grammars (Context-Free Grammar) Language Classes: Regular and Context Free Show a language is in the class Show a language is not in the class Prove or disprove a closure property Proof Methods Proof by Induction Proof by Construction Understand and use the pumping lemmas for RL and CFL Sample exam on website should give you a good idea what to expect Your exam will probably also have “what’s wrong with this proof” questions
Exam 1 Notesheet For Exam 1, you may use only: – Your own brain and body – A low-tech writing instrument (pen or pencil) – A single page (both sides) of notes that you create You may work with others to create your notes page.
Admiral Grace Hopper John von Neumann Albert Einstein
Exam Help Available Office Hours: – Thursdays, 8:30-9:30am – Thursdays, after class – Fridays, 10-11:30am (Sonali in Stacks) – Mondays, 1:15-3pm TA’s Exam Review Session – This Sunday, 5-6:30pm, Olsson 228E
s All Languages Regular Languages (DFA, NFA, RE, RG) Finite Languages Context-Free (CFG or NPDA) w anan anbncnanbncn ww Where are the languages recognized by a Deterministic PDA?
Proving Set Equivalence A = B A B and B A Sets A and B are equivalent if A is a subset of B and B is a subset of A. B A A A BB AB A
Proving Formalism Equivalence
Proving Formalism Non-Equivalence
s All Languages Regular Languages (DFA, NFA, RE, RG) Context-Free (CFG or NPDA) Which of these could be true? anbnanbn
Regular Languages (DFA, NFA, RE, RG) Context-Free (NPDA) DPDA Regular Languages (DFA, NFA, RE, RG) Context-Free (NPDA) DPDA How can we distinguish these two plausible possibilities?
Regular Languages (DFA, NFA, RE, RG) Context-Free (NPDA) DPDA Regular Languages (DFA, NFA, RE, RG) Context-Free (NPDA) DPDA How can we distinguish these two plausible possibilities? Find some language A that can be recognized by some NPDA but not by any DPDA. A Prove by construction: for any NPDA, there is a DPDA that recognizes the same language.
ε, ε $ a, ε + ε, ε ε b, + ε ε, $ ε ε, ε ε b, + ε b, ε ε ε, $ ε
Proof by contradiction: Assume there is a DPDA that recognizes A. Show how to construct a NPDA that recognizes some language we know is not context free. Proved by construction: We showed an NPDA that recognizes A.
Proof by contradiction. Suppose there is a DPDA M that recognizes A. It must be in an accept state only after processing a i b i and a i b 2i. … a, α β b, α β 2i transitions, consuming 0 i 1 i … b, α β i transitions, consuming 1 i Construct M’ : copy all the states on the second half, replacing b with c : … a, α β b, α β … c, α β What is the language of M’ ?
Proof by contradiction. Suppose there is a DPDA M that recognizes A. It must be in an accept state only after processing a i b i and a i b 2i. … a, α β b, α β … Construct M’ : copy all the states on the second half, replacing b with c : … a, α β b, α β … c, α β Not a Context-Free Language! We have a contradiction: if A is in L(DPDA), we could use the DPDA that recognizes A to construct an DPDA that recognizes a non-context-free language! Hence, A must not be in L(DPDA).
s All Languages Regular Languages (DFA, NFA, RE, RG) Context-Free (CFG or NPDA) anbnanbn A Deterministic Context-Free Languages Recognized by a DPDA (or DCFG) Context-Free Languages Deterministic Context-Free Languages Regular Languages
DFAs in Practice
Malware Scanner W32.Bolzano.Gen: 576a222bd2c b4c240cd9ffff 07fbffffff{0-2}5c4e544c445200{0-2} 5c57494e4e545c d 33325c6e746f736b726e6c2e {0-29}3b4658 W32.MyLife.E: 7a *40656d 61696c2e636f6d Note: These are the signatures from ClamAV, an open source virus scanner. Files Network Traffic
String Matching q0q1q2q3q4q5 t ru t h We hold these truths to be self-evident, that … How much work is it to scan a string of length N for a signature?
Faster String Matching q0q1q2q3q4q5 t ru t h We hold these truths to be self-evident, that … s[4] = h? s[10] = h? truth s[9] = t? s[8] = u? truth Skip table: a, b, c, d, e, f, g, i, j, k, l, m, n, o, p, q, r, s, v, w, x, y, z: 6 h: 0 r: 4 t: 1 u: 2
DFA / Skipping DFA Is a “Skipping DFA” still a DFA? (That is, does it still only accept the Regular Languages?)
J. Strother Moore (UT Austin) Boyer-Moore Fast String Searching Algorithm (1977) Best case: N/(w+1) comparisons where N is the length of the text and w is the length of the search string Is this fast enough for a malware scanner?
Virus Detection Total number of signatures: 720,033 Nate Paul’s study Can we scan one input for many possible malware signatures quickly?
Combining DFAs? Regular languages closed under union: q0q0 q A0 q B0 q A1 q B1 ε ε a a … … How many states are there now?
Signatures First byte:Set of signatures: ~720000/ ~720000/ ~720000/256 … ~720000/256
Try a Trie q0 q00 q01 q02 qFF 0x00 0x01 0x02 0xFF … q0000 q0001 q0002 q01FF 0x00 0x01 0x02 0xFF … /(256*256) ~ 11 Alfred V. Aho and Margaret J. Corasick, 1975 q0000 Alure ona 0x02
Scanner Demo
Evasive Malware Metamorphic Code: as virus propagates, each new copy is different How hard is it to automatically modify code without changing its behavior?
Detecting Evasive Malware Less exact signatures (e.g., W32.MyLife.E: 7a *40656d61696c2e636f6d) – Dangerous – start matching benign programs if you’re not careful! Behavioral signatures: match the behavior, not the program text – Undecidable in general (we’ll see in a few weeks) – Expensive and difficult in practice (but done by all decent scanners)
Faster String Scanning
Charge We focus on DFAs, NFAs, PDAs, CFGs, etc. as abstract models: Number of states, time to process, etc. don’t matter Lots of real applications of these models: but in practice, what matters is different If you have topics you want me to review, post comments (on today’s class announcement) by 5pm tomorrow.