©2008 Secure Computing Corporation. All Rights Reserved. 1 10/20/2015 Adaptive Language Parsing Teaching Parsers to Program Themselves J. Zdziarski

Slides:



Advertisements
Similar presentations
Anti-SPAM experience at LAL Michel Jouvin LAL / IN2P3
Advertisements

Dealing With Spam The kind, not the Food product.
Extracting Key-Substring-Group Features for Text Classification KDD 2006 Dell Zhang: Univ of London Wee Sun Lee: Nat Univ of Singapore Presented by: Payam.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Structure of a C program
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
CS31: Introduction to Computer Science I Discussion 1A 4/2/2010 Sungwon Yang
Document Type Definitions. XML and DTDs A DTD (Document Type Definition) describes the structure of one or more XML documents. Specifically, a DTD describes:
Gentle Introduction to Programming in Java Dr. Jey Veerasamy 1.
SI485i : NLP Day 1 Intro to NLP. Assumptions about You You know… how to program Java basic UNIX usage basic probability and statistics (we’ll also review)
Evidence from Content INST 734 Module 2 Doug Oard.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Lee CSCE 314 TAMU 1 CSCE 314 Programming Languages Syntactic Analysis Dr. Hyunyoung Lee.
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
General Programming Introduction to Computing Science and Programming I.
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
Goals of Course Introduction to the programming language C Learn how to program Learn ‘good’ programming practices.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
What is XML (Extensible Markup Language)? XML is basically a better comma delimited file. Example: Your client asks you to write a new reporting system.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
CPS120: Introduction to Computer Science Compiling Your Programs Using Visual C++
Chapter 9-Text File I/O. Overview n Text File I/O and Streams n Writing to a file. n Reading from a file. n Parsing and tokenizing. n Random Access n.
Introduction to programming in the Java programming language.
Project Presentation B 王 立 B 陳俊甫 B 張又仁 B 李佳穎.
Introduction to Unix – CS 21 Lecture 12. Lecture Overview A few more bash programming tricks The here document Trapping signals in bash cut and tr sed.
Examples of comparing strings. “ABC” = “ABC”? yes “ABC” = “ ABC”? No! note the space up front “ABC” = “abc” ? No! Totally different letters “ABC” = “ABCD”?
1 Data Representation Characters, Integers and Real Numbers Binary Number System Octal Number System Hexadecimal Number System Powered by DeSiaMore.
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
Internet & World Wide Web How to Program, 5/e © by Pearson Education, Inc. All Rights Reserved.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Clustering Prof. Ramin Zabih
Chapter 3 The Power of HEX Finding Slivers of Data.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
Python Let’s get started!.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
CSE1222: Lecture 1The Ohio State University1. Computing Basics  Computers CPU, Memory & Input/Output (IO)  Program Sequence of instructions for the.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
HTML HTML stands for Hyper Text Markup Language. HTML is used in making the base of a Website You can just use an online website maker like weebly.com.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
CSC 4630 Perl 3 adapted from R. E. Beck. Problem But we worked on it first: Input: Read from a text file named in a command line argument Output: List.
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
XML & JSON. Background XML and JSON are to standard, textual data formats for representing arbitrary data – XML stands for “eXtensible Markup Language”
Introduction to Computing Science and Programming I
John Woodward A Simple Program – Hello world
English-Korean Machine Translation System
Grammars and Parsing.
Working with Java.
CS170 – Week 1 Lecture 3: Foundation Ismail abumuhfouz.
CS 3304 Comparative Languages
Python Let’s get started!.
Tutorial On Lex & Yacc.
Intro to PHP & Variables
Teaching Computing to GCSE
Chapter 1: Computer Systems
Building Java Programs
XML Problems and Solutions
Fundamentals of Data Representation
Creating your first C program
Building Java Programs Chapter 2
Project Presentation B 王 立 B 陳俊甫 B 張又仁
Building Java Programs
CSV Files and ETL The Good, Bad, and Ugly
Building Java Programs Chapter 2
Zorah Fung University of Washington, Winter 2016
Presentation transcript:

©2008 Secure Computing Corporation. All Rights Reserved. 1 10/20/2015 Adaptive Language Parsing Teaching Parsers to Program Themselves J. Zdziarski J. Zdziarski

2 The Problem Adaptive spam filters have proven to work Heuristics have proven to decompose So how is this a problem?

3 The Problem It’s a problem because: Adaptive spam filters still use heuristics! Well, most do anyway

4 Rules-Based Parsing The Problem with Rules-Based Parsers They make assumptions about language syntax Many languages have their own set of rules Requires foreknowledge of the languages being used Spammers can’t obey RFC, let alone proper English A machine can learn how to read better than you can Some languages don’t support whitespace… THEREDWELLSAMISSLATE THE RED WELL, SAM IS SLATE THERE DWELLS A MISS, LATE (G. Sinnamon) ENDANGERSSPARSEAMANSWORDS ENDANGER! SPAR, SEAMAN SWORDS! END ANGERS; PARSE A MAN’S WORDS (D. Higgs) Parse-o-Grams Courtesy of Robert Craigen, Univ. of Manitoba

5 Adaptive Language Parsing 3 Steps to Adaptive Parsing 1. Build a statistical hypothesis space for all parsing options This can be all ASCII chars, wchars, biGram separators, legacy heuristic rules 2. Calculate the probability that each parsing rule yields interesting data For each potential delimiter or rule, how often was it found in an uninteresting token (LOW) vs. how often was it found in an interesting token (HI). 3. Use this data to reprogram the parser Take the most uninteresting N possible delimiters and use them to parse the document differently; wash, rinse, and repeat. * Counters are per-token, not per-message LOW DELIM / LOW TOTAL (LOW DELIM / LOW TOTAL ) + (HI DELIM / HI TOTAL ) P DELIM =

6 Adaptive Language Parsing Some Examples Final Delimiter Set for a SpamAssassin Corpus run: Header Body Delimiters:T?N,I?OS.pEmroaicthldesn Interesting Data Generated [ ],+click (8s, 0i) Click more interesting when with comma [ ] igh (105s, 2i) |-|igh, High, H-IGH Interest Mortgage [ ] $888 (15s, 0i) Yup, we knew about this one. So does Hal. [ ] ional_Inc.+Now (6s, 0i) [ ] s0r+C|ubs (12s, 0i) Junk can be very useful to the machine Foreign Character Sets [ ]!+ESC(B (50s, 0i) [ ]ESC$B(-ESC(B (19s, 0i) [ ]ESC$B!!!!!!!!!!ESC(B (29s, 0i) I have no idea what this means, but the machine says it’s Japanese spam.

7 Adaptive Language Parsing Some Tests SpamAssassin Corpus TPTNFPFNPrecisionRecallFScore Whitespace Static Defaults Adaptive,W Adaptive,W Pure,W Pure,W Chinese ISP Corpus TPTNFPFNPrecisionRecallFscore Whitespace Static Defaults Kakasi* Adaptive,W Adaptive,W * Kakasi was not designed for Chinese, but it works pretty well anyway

8 Counter-Example So can we break it too? SpamAssassin Corpus TPTNFPFNPrecisionRecallFScore Whitespace Static Defaults Adaptive,W Adaptive,W Pure,W Pure,W Counter Example More junk tokens = lower efficiency, and of course less ability to catch anything (good or bad) Answer: To some degree

9 Future Work What Else Could We Do With This? Extend support to statistically stem words and parse inflection Extend the hypothesis space to include biGram and triGram delimiters, and position of split (before, after, or as delimiter). Character Set Detection Certain parsing models will no doubt adhere to specific character sets Fuzzy Data Mining Improve text retrieval by parsing documents to be more machine-coherent Apply to binary parsing challenges Parse executable files, forensic recovery of hard drives, pixel border detection, etc.

10 Questions Questions? Jonathan Zdziarski Zdziarski