©2008 Secure Computing Corporation. All Rights Reserved. 1 10/20/2015 Adaptive Language Parsing Teaching Parsers to Program Themselves J. Zdziarski J. Zdziarski
2 The Problem Adaptive spam filters have proven to work Heuristics have proven to decompose So how is this a problem?
3 The Problem It’s a problem because: Adaptive spam filters still use heuristics! Well, most do anyway
4 Rules-Based Parsing The Problem with Rules-Based Parsers They make assumptions about language syntax Many languages have their own set of rules Requires foreknowledge of the languages being used Spammers can’t obey RFC, let alone proper English A machine can learn how to read better than you can Some languages don’t support whitespace… THEREDWELLSAMISSLATE THE RED WELL, SAM IS SLATE THERE DWELLS A MISS, LATE (G. Sinnamon) ENDANGERSSPARSEAMANSWORDS ENDANGER! SPAR, SEAMAN SWORDS! END ANGERS; PARSE A MAN’S WORDS (D. Higgs) Parse-o-Grams Courtesy of Robert Craigen, Univ. of Manitoba
5 Adaptive Language Parsing 3 Steps to Adaptive Parsing 1. Build a statistical hypothesis space for all parsing options This can be all ASCII chars, wchars, biGram separators, legacy heuristic rules 2. Calculate the probability that each parsing rule yields interesting data For each potential delimiter or rule, how often was it found in an uninteresting token (LOW) vs. how often was it found in an interesting token (HI). 3. Use this data to reprogram the parser Take the most uninteresting N possible delimiters and use them to parse the document differently; wash, rinse, and repeat. * Counters are per-token, not per-message LOW DELIM / LOW TOTAL (LOW DELIM / LOW TOTAL ) + (HI DELIM / HI TOTAL ) P DELIM =
6 Adaptive Language Parsing Some Examples Final Delimiter Set for a SpamAssassin Corpus run: Header Body Delimiters:T?N,I?OS.pEmroaicthldesn Interesting Data Generated [ ],+click (8s, 0i) Click more interesting when with comma [ ] igh (105s, 2i) |-|igh, High, H-IGH Interest Mortgage [ ] $888 (15s, 0i) Yup, we knew about this one. So does Hal. [ ] ional_Inc.+Now (6s, 0i) [ ] s0r+C|ubs (12s, 0i) Junk can be very useful to the machine Foreign Character Sets [ ]!+ESC(B (50s, 0i) [ ]ESC$B(-ESC(B (19s, 0i) [ ]ESC$B!!!!!!!!!!ESC(B (29s, 0i) I have no idea what this means, but the machine says it’s Japanese spam.
7 Adaptive Language Parsing Some Tests SpamAssassin Corpus TPTNFPFNPrecisionRecallFScore Whitespace Static Defaults Adaptive,W Adaptive,W Pure,W Pure,W Chinese ISP Corpus TPTNFPFNPrecisionRecallFscore Whitespace Static Defaults Kakasi* Adaptive,W Adaptive,W * Kakasi was not designed for Chinese, but it works pretty well anyway
8 Counter-Example So can we break it too? SpamAssassin Corpus TPTNFPFNPrecisionRecallFScore Whitespace Static Defaults Adaptive,W Adaptive,W Pure,W Pure,W Counter Example More junk tokens = lower efficiency, and of course less ability to catch anything (good or bad) Answer: To some degree
9 Future Work What Else Could We Do With This? Extend support to statistically stem words and parse inflection Extend the hypothesis space to include biGram and triGram delimiters, and position of split (before, after, or as delimiter). Character Set Detection Certain parsing models will no doubt adhere to specific character sets Fuzzy Data Mining Improve text retrieval by parsing documents to be more machine-coherent Apply to binary parsing challenges Parse executable files, forensic recovery of hard drives, pixel border detection, etc.
10 Questions Questions? Jonathan Zdziarski Zdziarski