Ad Hoc Data and the Token Ambiguity Problem Qian Xi *, Kathleen Fisher +, David Walker *, Kenny Zhu * 2009/1/19 * : Princeton University, + : AT&T Labs.

Slides:



Advertisements
Similar presentations
Presenter: James Huang Date: Sept. 29,  HTTP and WWW  Bottle Web Framework  Request Routing  Sending Static Files  Handling HTML  HTTP Errors.
Advertisements

Compiler construction in4020 – lecture 2 Koen Langendoen Delft University of Technology The Netherlands.
Hidden Markov Models (HMM) Rabiner’s Paper
CSCI 6962: Server-side Design and Programming Input Validation and Error Handling.
DSLs: The Good, the Bad, and the Ugly Kathleen Fisher AT&T Labs Research.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan
From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker Princeton University with David Burke, Kathleen Fisher, Peter White & Kenny.
O PTICAL C HARACTER R ECOGNITION USING H IDDEN M ARKOV M ODELS Jan Rupnik.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Optimizing Statistical Information Extraction Programs Over Evolving Text Fei Chen Xixuan (Aaron) Feng Christopher Ré Min Wang.
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Aki Hecht Seminar in Databases (236826) January 2009
Traditional Information Extraction -- Summary CS652 Spring 2004.
XML Why is XML so popular? –It’s flexible –It enables you to create your own documents with elements (tags) that you define Non-XML example: This is a.
From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
From Dirt to Shovels: Inferring PADS descriptions from ASCII Data July 2007 Kathleen Fisher David Walker Peter White Kenny Zhu.
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
Information Retrieval in Practice
Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)
Webpage Understanding: an Integrated Approach
XP Tutorial 6New Perspectives on Creating Web Pages with HTML, XHTML, and XML 1 Creating Web Page Forms Designing a Product Registration Form Tutorial.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Chapter 8 Prediction Algorithms for Smart Environments
WorkPlace Pro Utilities.
K. Jamroendararasame*, T. Matsuzaki, T. Suzuki, and T. Tokuda Department of Computer Science, Tokyo Institute of Technology, JAPAN Two Generators of Secure.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
CSCI 6962: Server-side Design and Programming Validation Tools in Java Server Faces.
Hidden Markov Models Applied to Information Extraction Part I: Concept Part I: Concept HMM Tutorial HMM Tutorial Part II: Sample Application Part II: Sample.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 1: Introduction to Computers and Programming.
Graphical models for part of speech tagging
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Mastering Char to ASCII AND DOING MORE RELATED STRING MANIPULATION Why VB.Net ?  The Language resembles Pseudocode - good for teaching and learning fundamentals.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Information Extraction: Distilling Structured Data from Unstructured Text. -Andrew McCallum Presented by Lalit Bist.
The WinMine Toolkit Max Chickering. Build Statistical Models From Data Dependency Networks Bayesian Networks Local Distributions –Trees Multinomial /
Exploring an Open Source Automation Framework Implementation.
Data Storage Choices File or Database ? Binary or Text file ? Variable or fixed record length ? Choice of text file record and field delimiters XML anyone.
11 3 / 12 CHAPTER Databases MIS105 Lec15 Irfan Ahmed Ilyas.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Navigation Framework using CF Architecture for a Client-Server Application using the open standards of the web Kedar Desai presented by.
Java server pages. A JSP file basically contains HTML, but with embedded JSP tags with snippets of Java code inside them. A JSP file basically contains.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
Tokenization & POS-Tagging
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
From Dirt to Shovels: Automatic Tool Generation from Ad Hoc Data Kenny Zhu Princeton University with Kathleen Fisher, David Walker and Peter White.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Reviews Crawler (Detection, Extraction & Analysis) FOSS Practicum By: Syed Ahmed & Rakhi Gupta April 28, 2010.
Collaborative Planning Training. Agenda  Collaboration Overview  Setting up Collaborative Planning  User Setups  Collaborative Planning and Forecasting.
The PADS-Galax Project Enabling XQuery over Ad-hoc Data Sources Yitzhak Mandelbaum.
Presentation Title 1 1/27/2016 Lucent Technologies - Proprietary Voice Interface On Wireless Applications Protocol A PDA Implementation Sherif Abdou Qiru.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Dr. Mohamed Ramadan Saady 314ALL CH1.1 Chapter 1: Introduction to Compiling.
B Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Working with PDF and eText Templates.
Glink for Java: applet, application and an API for integrating access to Bull, IBM, UNIX and Minitel systems with your Java based e-business applications.
Interstage BPM v11.2 1Copyright © 2010 FUJITSU LIMITED INTEGRATION.
NXT File System Just like we’re able to store multiple programs and sound files to the NXT, we can store text files that contain information we specify.
PHP: Further Skills 02 By Trevor Adams. Topics covered Persistence What is it? Why do we need it? Basic Persistence Hidden form fields Query strings Cookies.
Unix RE’s Text Processing Lexical Analysis.   RE’s appear in many systems, often private software that needs a simple language to describe sequences.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Text by: Lambert and Osborne
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
CSc4730/6730 Scientific Visualization
Getting Started With Solr
Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.
Presentation transcript:

Ad Hoc Data and the Token Ambiguity Problem Qian Xi *, Kathleen Fisher +, David Walker *, Kenny Zhu * 2009/1/19 * : Princeton University, + : AT&T Labs Research

Ad Hoc Data 1/19 Standardized data formats: HTML, XML Data processing tools: Visualizers (HTML browsers), XQuery Non-standard, semi-structured Not many data processing tools Examples: web server log (CLF), phone call provisioning data … [15/Oct/1997:18:46: ] "GET /tk/p.txt HTTP/1.0" [16/Oct/1997:14:32: ] "POST /scpt/ddorg/confirm HTTP/1.0" | |1| | | | ||no_ii152272|EDTF _6|0|MARVINS1|UNO|10| | |1| | | | ||no_ii15222|EDTF_ 6|0|MARVINS1|UNO|10| |20| |17| |19|1001

learnPADS Goal Automatically generates a description of the format Automatically generates a suite of data processing tools 2/19 “ 0,24 ” “ bar,end ” “ foo,16 ” XML converter, Grapher, etc. Punion payload { Pint32 i; PstringFW(3) s2; }; Pstruct source { ‘\”’; payload p1; “,”; payload p2; ‘\”’; } Declarative Description

Chunking & Tokenization Structure Discovery Format Refinement Data Description Raw Data PADS Compiler Profiler XML converter Analysis Report XML Format Inference Engine learnPADS Architecture 3/19

Format Refinement learnPADS framework 4/19 Chunking & Tokenization Structure Discovery “0,24” “bar,end” “foo,bag” “0,56” “cat,name” “int, int” “str, str” “int, int” “str, str” struct “ union, “ INTSTRINTSTR struct “ union “ struct 0INTSTR,,

Token Ambiguity Problem (TAP) Given a string, there ’ re multiple ways to tokenize it.  Message  Word White Word White Word White... White URL  Word White Quote Filepath Quote White Word White... 5/19 old learnPADS user defines a set of base tokens with fixed order take the first, longest match new solution: probabilistic tokenization use probabilistic models to find most likely token sequences

Probabilistic Graphical Models 6/19 earthquakeburglar alarm parent comes home node: random variable edge: probabilistic relationship

Hidden Markov Model (HMM) Observation/Character C i  Character Features: upper/lower case, digit, punctuation... Hidden state/Pseudo-token T i maximize probability P(token sequence|character sequence) 7/19 QuoteWord CommaInt Quote “ foo, 16 “ WordCommaIntQuotetokens: pseudo-tokens: input characters: transition probability: P(T i |T i-1 )emission probability: P(C i |T i )

Hidden Markov Model Formula 8/19 the probability of token sequence given character sequence = the probability that token T 1 comes first * the probability that token T i follows T i-1 for all i * the probability that we see character C i given token T i for all i transition probabilityemission probability

Hidden Markov Model Parameters 9/19 transition probability emission probability

Hierarchical Models 10/19 Quote “ Comma WordQuote Int, foo “ 16 Maximum Entropy Support Vector Machines

Three Probabilistic Tokenizers 11/19 Character-by-character Hidden Markov Model (HMM) One pseudo-token only depends on the previous one. Hierarchical Maximum Entropy Model (HMEM) The upper level models the transition probabilities. The lower level constructs Maximum Entropy models for individual tokens. Hierarchical Support Vector Machines (HSVM) Same as HMEM, except that the lower level constructs Support Vector Machine models for individual tokens.

Tokenization By the old learnPADS, HMM and HMEM Sat Jun 24 06:38:46 crashreporterd[120]: mach_msg() reply failed: (ipc/send) invalid destination port date[Sat Jun 24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] word[crashreporterd] punctuation:[[[] int[120] punctuation:][]] punctuation::[:] message[mach_msg() reply failed] punctuation::[:] message[(ipc/send) invalid destination port] date[Sat Jun 24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] string[crashreporterd] char[[] int[120] char[]] char[:] white[ ] string[mach_msg] char[(] char[)] white[ ] string[reply] white[ ] string[failed] char[:] white[ ] char[(] string[ipc] char[/] string[send] char[)] white[ ] string[invalid] white[ ] string[destination] white[ ] string[port] word[Sat] white[ ] word[Jun] white[ ] int[24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] word[crashreporterd] punctuation:[[[] int[120] punctuation:][]] punctuation::[:] message[mach_msg() reply failed] punctuation::[:] message[(ipc/send) invalid destination port] 12/19

Test Data Sources 13/19 Data source KB/ChunksDescription 1967Transactions.short70/999 transaction records ai /3000 webserver log yum.txt18/328 log from package install rpmpkgs.txt218/886 package name list railroad.txt6/67 US rail road information dibbler /999 AT&T phone provision data asl.log279/1500 log file of Mac ASL scrollkeeper.log66/671 application log page_log28/354 printer logs MER_T01_01.csv22/491 comma-sep records crashreporter.log50/491 crash log ls-l.txt2/35 stdout from Unix command ls-l windowserver_last.log52/680 log from LoginWindow server on Mac netstat_an14/202 output from netstat boot.txt16/262 Mac OS boot log quarterlypersonalincome10/62 personal income spread sheet corald.log.head83/78 application log from coral project coraldnssrv.log.head41/21 probed.log.head309/100 coralwebsrv.log.head47/29

Evaluation 1 – Tokenization Accuracy 14/19 input string: qian Jan/19/09 ideal token sequence: id white date inferred token sequence: id white filepath token error rate = 1/3 token boundary error rate = 0/3 Token error rate = % misidentified tokens Token boundary error rate = % misidentified token boundaries

Evaluation 1 – Tokenization Accuracy 15/19 Token Error Token Boundary Error HMMHMEMHSVMHMMHMEMHSVM # data sources PT * decreases error rate # data sources PT increases error rate # data sources PT doesn ’ t change error rate average error rate decrease on files where there is an improvement 59.2%46.7%50.3%52.8%55.4%47.4% average error rate increase on files where there is a decrease in effectiveness 13.3% 4.9% 4.9%14.8%12.0% 6.1% 6.1%14.7% PT: probabilistic tokenization # testing data sources: 20

Evaluation 2 – Type and Data Costs 16/19 Type Cost Data Cost HMMHMEMHSVMHMMHMEMHSVM # data sources PT * decreases cost # data sources PT increases cost # data sources PT doesn ’ t change cost type cost: cost in bits of transmitting the description data cost: cost in bits of transmitting the data given the description PT: probabilistic tokenization # testing data sources: 20

Evaluation 3 – Execution Time 17/19 The old learnPADS system takes 10 secs to 25 mins. The new system using probabilistic tokenization approaches takes a few seconds to several hours. The new system using probabilistic tokenization approaches takes a few seconds to several hours. requires extra time to find all possible token sequences requires extra time to find all possible token sequences requires extra time to find the most likely token sequences requires extra time to find the most likely token sequences fastest: Hidden Markov Model fastest: Hidden Markov Model most time-consuming: Hierarchical Support Vector Machines most time-consuming: Hierarchical Support Vector Machines

Related Work 18/19 induction & structure discovery without token ambiguity problem Grammar induction & structure discovery without token ambiguity problem Arasu & Garcia-Molina ’ 03 “ extracting structure from web pages ” Garofalakis et al. ’ 00 “ XTRACT for infering DTDs ” Kushmerick et al. ’ 97 “ wrapper induction ” Detect row table components by Hidden Markov Model & Conditional Random Fields: Detect row table components by Hidden Markov Model & Conditional Random Fields: Pinto et al. ’ 03 Extract certain fields in records from text: Extract certain fields in records from text: Borkar et al. ’ 01 Predict exons and introns in DNA sequences using generalized HMM: Predict exons and introns in DNA sequences using generalized HMM: Kulp ‘ 96 Part-of-speech tagging in natural language processing: Part-of-speech tagging in natural language processing: Heeman Heeman ’ 99 (Decision Tree) Speech Recognition: Rabiner ‘ 89

Contributions Identify the Token Ambiguity Problem and take initial steps towards solving it by statistical models Identify the Token Ambiguity Problem and take initial steps towards solving it by statistical models  Use all possible token sequences.  Integrate 3 statistical approaches into the learnPADS framework. Hidden Markov Model Hidden Markov Model Hierarchical Maximum Entropy Model Hierarchical Maximum Entropy Model Hierarchical Support Vector Machines Model Hierarchical Support Vector Machines Model Evaluate correctness and performance by a number of measures Evaluate correctness and performance by a number of measures  Results have shown that multiple token sequences and statistical methods achieve partial success. 19/19

End

Future Work How to make use of “ vertical ” information How to make use of “ vertical ” information  one record is not independent of others  key: alignment  Conditional Random Fields Online learning: Online learning:  old description + new data new description

Evaluation 3 – Qualitative Comparison The description is too general and it loses much useful information. The description is too verbose and the structure is unclear. Data Source lexHMMHMEMHSVM lexHMMHMEMHSVM 1967Transactions0000crashreporter2011 ai ls-l.txt2011 yum.txt210windowserver2011 rpmpkgs.txt2-20netstat-an2-200 railroad.txt2111boot.txt211 dibbler quarterlyincome1111 asl.log2-222corald.log0110 scrollkeeper.log1211coraldnssrv.log011 page_log0000probed.log0000 MER_T01_01.csv0100coralwebsrv.log011 optimal