Ad Hoc Data and the Token Ambiguity Problem Qian Xi *, Kathleen Fisher +, David Walker *, Kenny Zhu * 2009/1/19 * : Princeton University, + : AT&T Labs Research
Ad Hoc Data 1/19 Standardized data formats: HTML, XML Data processing tools: Visualizers (HTML browsers), XQuery Non-standard, semi-structured Not many data processing tools Examples: web server log (CLF), phone call provisioning data … [15/Oct/1997:18:46: ] "GET /tk/p.txt HTTP/1.0" [16/Oct/1997:14:32: ] "POST /scpt/ddorg/confirm HTTP/1.0" | |1| | | | ||no_ii152272|EDTF _6|0|MARVINS1|UNO|10| | |1| | | | ||no_ii15222|EDTF_ 6|0|MARVINS1|UNO|10| |20| |17| |19|1001
learnPADS Goal Automatically generates a description of the format Automatically generates a suite of data processing tools 2/19 “ 0,24 ” “ bar,end ” “ foo,16 ” XML converter, Grapher, etc. Punion payload { Pint32 i; PstringFW(3) s2; }; Pstruct source { ‘\”’; payload p1; “,”; payload p2; ‘\”’; } Declarative Description
Chunking & Tokenization Structure Discovery Format Refinement Data Description Raw Data PADS Compiler Profiler XML converter Analysis Report XML Format Inference Engine learnPADS Architecture 3/19
Format Refinement learnPADS framework 4/19 Chunking & Tokenization Structure Discovery “0,24” “bar,end” “foo,bag” “0,56” “cat,name” “int, int” “str, str” “int, int” “str, str” struct “ union, “ INTSTRINTSTR struct “ union “ struct 0INTSTR,,
Token Ambiguity Problem (TAP) Given a string, there ’ re multiple ways to tokenize it. Message Word White Word White Word White... White URL Word White Quote Filepath Quote White Word White... 5/19 old learnPADS user defines a set of base tokens with fixed order take the first, longest match new solution: probabilistic tokenization use probabilistic models to find most likely token sequences
Probabilistic Graphical Models 6/19 earthquakeburglar alarm parent comes home node: random variable edge: probabilistic relationship
Hidden Markov Model (HMM) Observation/Character C i Character Features: upper/lower case, digit, punctuation... Hidden state/Pseudo-token T i maximize probability P(token sequence|character sequence) 7/19 QuoteWord CommaInt Quote “ foo, 16 “ WordCommaIntQuotetokens: pseudo-tokens: input characters: transition probability: P(T i |T i-1 )emission probability: P(C i |T i )
Hidden Markov Model Formula 8/19 the probability of token sequence given character sequence = the probability that token T 1 comes first * the probability that token T i follows T i-1 for all i * the probability that we see character C i given token T i for all i transition probabilityemission probability
Hidden Markov Model Parameters 9/19 transition probability emission probability
Hierarchical Models 10/19 Quote “ Comma WordQuote Int, foo “ 16 Maximum Entropy Support Vector Machines
Three Probabilistic Tokenizers 11/19 Character-by-character Hidden Markov Model (HMM) One pseudo-token only depends on the previous one. Hierarchical Maximum Entropy Model (HMEM) The upper level models the transition probabilities. The lower level constructs Maximum Entropy models for individual tokens. Hierarchical Support Vector Machines (HSVM) Same as HMEM, except that the lower level constructs Support Vector Machine models for individual tokens.
Tokenization By the old learnPADS, HMM and HMEM Sat Jun 24 06:38:46 crashreporterd[120]: mach_msg() reply failed: (ipc/send) invalid destination port date[Sat Jun 24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] word[crashreporterd] punctuation:[[[] int[120] punctuation:][]] punctuation::[:] message[mach_msg() reply failed] punctuation::[:] message[(ipc/send) invalid destination port] date[Sat Jun 24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] string[crashreporterd] char[[] int[120] char[]] char[:] white[ ] string[mach_msg] char[(] char[)] white[ ] string[reply] white[ ] string[failed] char[:] white[ ] char[(] string[ipc] char[/] string[send] char[)] white[ ] string[invalid] white[ ] string[destination] white[ ] string[port] word[Sat] white[ ] word[Jun] white[ ] int[24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] word[crashreporterd] punctuation:[[[] int[120] punctuation:][]] punctuation::[:] message[mach_msg() reply failed] punctuation::[:] message[(ipc/send) invalid destination port] 12/19
Test Data Sources 13/19 Data source KB/ChunksDescription 1967Transactions.short70/999 transaction records ai /3000 webserver log yum.txt18/328 log from package install rpmpkgs.txt218/886 package name list railroad.txt6/67 US rail road information dibbler /999 AT&T phone provision data asl.log279/1500 log file of Mac ASL scrollkeeper.log66/671 application log page_log28/354 printer logs MER_T01_01.csv22/491 comma-sep records crashreporter.log50/491 crash log ls-l.txt2/35 stdout from Unix command ls-l windowserver_last.log52/680 log from LoginWindow server on Mac netstat_an14/202 output from netstat boot.txt16/262 Mac OS boot log quarterlypersonalincome10/62 personal income spread sheet corald.log.head83/78 application log from coral project coraldnssrv.log.head41/21 probed.log.head309/100 coralwebsrv.log.head47/29
Evaluation 1 – Tokenization Accuracy 14/19 input string: qian Jan/19/09 ideal token sequence: id white date inferred token sequence: id white filepath token error rate = 1/3 token boundary error rate = 0/3 Token error rate = % misidentified tokens Token boundary error rate = % misidentified token boundaries
Evaluation 1 – Tokenization Accuracy 15/19 Token Error Token Boundary Error HMMHMEMHSVMHMMHMEMHSVM # data sources PT * decreases error rate # data sources PT increases error rate # data sources PT doesn ’ t change error rate average error rate decrease on files where there is an improvement 59.2%46.7%50.3%52.8%55.4%47.4% average error rate increase on files where there is a decrease in effectiveness 13.3% 4.9% 4.9%14.8%12.0% 6.1% 6.1%14.7% PT: probabilistic tokenization # testing data sources: 20
Evaluation 2 – Type and Data Costs 16/19 Type Cost Data Cost HMMHMEMHSVMHMMHMEMHSVM # data sources PT * decreases cost # data sources PT increases cost # data sources PT doesn ’ t change cost type cost: cost in bits of transmitting the description data cost: cost in bits of transmitting the data given the description PT: probabilistic tokenization # testing data sources: 20
Evaluation 3 – Execution Time 17/19 The old learnPADS system takes 10 secs to 25 mins. The new system using probabilistic tokenization approaches takes a few seconds to several hours. The new system using probabilistic tokenization approaches takes a few seconds to several hours. requires extra time to find all possible token sequences requires extra time to find all possible token sequences requires extra time to find the most likely token sequences requires extra time to find the most likely token sequences fastest: Hidden Markov Model fastest: Hidden Markov Model most time-consuming: Hierarchical Support Vector Machines most time-consuming: Hierarchical Support Vector Machines
Related Work 18/19 induction & structure discovery without token ambiguity problem Grammar induction & structure discovery without token ambiguity problem Arasu & Garcia-Molina ’ 03 “ extracting structure from web pages ” Garofalakis et al. ’ 00 “ XTRACT for infering DTDs ” Kushmerick et al. ’ 97 “ wrapper induction ” Detect row table components by Hidden Markov Model & Conditional Random Fields: Detect row table components by Hidden Markov Model & Conditional Random Fields: Pinto et al. ’ 03 Extract certain fields in records from text: Extract certain fields in records from text: Borkar et al. ’ 01 Predict exons and introns in DNA sequences using generalized HMM: Predict exons and introns in DNA sequences using generalized HMM: Kulp ‘ 96 Part-of-speech tagging in natural language processing: Part-of-speech tagging in natural language processing: Heeman Heeman ’ 99 (Decision Tree) Speech Recognition: Rabiner ‘ 89
Contributions Identify the Token Ambiguity Problem and take initial steps towards solving it by statistical models Identify the Token Ambiguity Problem and take initial steps towards solving it by statistical models Use all possible token sequences. Integrate 3 statistical approaches into the learnPADS framework. Hidden Markov Model Hidden Markov Model Hierarchical Maximum Entropy Model Hierarchical Maximum Entropy Model Hierarchical Support Vector Machines Model Hierarchical Support Vector Machines Model Evaluate correctness and performance by a number of measures Evaluate correctness and performance by a number of measures Results have shown that multiple token sequences and statistical methods achieve partial success. 19/19
End
Future Work How to make use of “ vertical ” information How to make use of “ vertical ” information one record is not independent of others key: alignment Conditional Random Fields Online learning: Online learning: old description + new data new description
Evaluation 3 – Qualitative Comparison The description is too general and it loses much useful information. The description is too verbose and the structure is unclear. Data Source lexHMMHMEMHSVM lexHMMHMEMHSVM 1967Transactions0000crashreporter2011 ai ls-l.txt2011 yum.txt210windowserver2011 rpmpkgs.txt2-20netstat-an2-200 railroad.txt2111boot.txt211 dibbler quarterlyincome1111 asl.log2-222corald.log0110 scrollkeeper.log1211coraldnssrv.log011 page_log0000probed.log0000 MER_T01_01.csv0100coralwebsrv.log011 optimal