© 2008 IBM Corporation Regular Expression Learning for Information Extraction Yunyao Li *, Rajasekar Krishnamurthy *, Sriram Raghavan *, Shivakumar Vaithyanathan.

Slides:



Advertisements
Similar presentations
CSCI 6962: Server-side Design and Programming Input Validation and Error Handling.
Advertisements

Large-Scale Entity-Based Online Social Network Profile Linkage.
1 RegionKNN: A Scalable Hybrid Collaborative Filtering Algorithm for Personalized Web Service Recommendation Xi Chen, Xudong Liu, Zicheng Huang, and Hailong.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Extracting Personal Names from Applying Named Entity Recognition to Informal Text Einat Minkov & Richard C. Wang Language Technologies Institute.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
RoadRunner: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo Presented by Lei Lei.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Traditional Information Extraction -- Summary CS652 Spring 2004.
Navigating the Intranet with High Precision Huaiyu Zhu Alexander L¨oser Sriram Raghavan Shivakumar Vaithyanathan.
Reduced Support Vector Machine
Open Information Extraction From The Web Rani Qumsiyeh.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Information Extraction from HTML: General Machine Learning Approach Using SRV.
A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Mining Officially Unrecognized Side effects of drugs by combining Web Search and Machine learning Carlo Carino, Yuanyuan Jia, Bruce Lambert, Patricia West.
4/20/2017.
PRX Functions: There is Hardly Anything Regular About Them! Ken Borowiak.
Face Detection using the Viola-Jones Method
Annotating Search Results from Web Databases. Abstract An increasing number of databases have become web accessible through HTML form-based search interfaces.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
Regular Expressions Dr. Ralph D. Westfall May, 2011.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
Presented by Tienwei Tsai July, 2005
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Bug Localization with Machine Learning Techniques Wujie Zheng
CIS 451: Regular Expressions Dr. Ralph D. Westfall January, 2009.
Automatic Rule Refinement for Information Extraction Bin Liu University of Michigan Laura Chiticariu IBM Research - Almaden Vivian Chu IBM Research - Almaden.
Presenter: Shanshan Lu 03/04/2010
JavaScript and PHP Validation and Error Handling CHAPTER 17.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Automatic Suggestion of Query-Rewrite Rules for Enterprise Search Date : 2013/08/13 Source : SIGIR’12 Authors : Zhuowei Bao, Benny Kimelfeld, Yunyao Li.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
September 25, 2006 NASA Feasibility Study Status Update.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
Prediction of Protein Binding Sites in Protein Structures Using Hidden Markov Support Vector Machine.
Feature Selection and Weighting using Genetic Algorithm for Off-line Character Recognition Systems Faten Hussein Presented by The University of British.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
INVITATION TO Computer Science 1 11 Chapter 2 The Algorithmic Foundations of Computer Science.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Compsci 101.2, Fall Plan for LWoC l Power of Regular Expressions  From theoretical computer science to scraping webpages  Using documentation,
Kriti Chauhan CSE6339 Spring 2009
Extracting Information from Diverse and Noisy Scanned Document Images
Visual Grounding.
Presentation transcript:

© 2008 IBM Corporation Regular Expression Learning for Information Extraction Yunyao Li *, Rajasekar Krishnamurthy *, Sriram Raghavan *, Shivakumar Vaithyanathan *, H. V. Jagadish ○ * IBM Almaden Research Center ○ University of Michigan

© 2008 IBM Corporation Outline  Motivation  Regex Learning Problem  Regex Transformations  ReLIE Search Algorithm  Experiments  Summary

© 2008 IBM Corporation Importance of Regular Expression (Regex)  Regex is essential to many information extraction (IE) tasks  addresses  Software names  Credit card numbers  Social security numbers  Gene and Protein names  …. But … writing regexes for an IE task is not straightforward Web collections compliance bioinformatics

© 2008 IBM Corporation Phone Number Extraction  A simple pattern: blocks of digits separated by non-word character: R 0 = (\d+\W)+\d+  Identifies valid phone numbers (e.g , )  Produces invalid matches (e.g , 10/19/2002, 1.25 …)  Misses valid phone numbers (e.g. (800) 865-CARE )

© 2008 IBM Corporation Software Name Extraction  A simple pattern: blocks of capitalized words followed by version number: R 0 = ([A-Z]\w*\s*)+[Vv]?(\d+\.?)+  Identifies valid software names (e.g. Eclipse 3.2, Windows 2000 )  Produces invalid matches (e.g. English 123, Room 301, Chapter 1.2 )  Misses valid software names (e.g. Windows XP )

© 2008 IBM Corporation Conventional Regex Writing Process for IE Regex 0 Sample Documents Match 1 Match 2 … Good Enough? N Y Regex final (\d+\W)+\d+(\d+\W)+\d{4} … /19/ … Regex 1 Regex 2 Regex 3 (\d+[\.\s\-])+\d{4}(\d{3}[\.\s\-])+\d{4}

© 2008 IBM Corporation Our goal - Learning Regex final automatically Regex 0 Sample Documents Match 1 Match 2 … NegMatch 1 … NegMatch m 0 PosMatch 1 … PosMatch n 0 Labeled Matches ReLIE Regex final

© 2008 IBM Corporation Intuition R 0 ([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\ [a-zA-Z] {0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,2}\s*(\w{0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,5}\s* (?!(201|…|330))(\w{0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4} … ([A-Z][a-zA-Z]{1,10}\s){2,4}\s*(\w{0,2}\d[\.]?){1,4} … Compute F-measure F1F1 F7F7 F8F8 F 34 F 48 … ((?!(Copyright|Page|Physics|Question| · · · |Article|Issue) [A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4} F 35 R’ ([A-Z] [a-z] {1,10}\s){1,5}\s*( [a-zA-z] {0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s) {1,2} \s*(\\w{0,2}\d[\.]?){1,4} (((?!(Copyright|Page|Physics|Question| · · · |Article|Issue) [A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4} … … ([A-Z] [a-z] {1,10}\s){1,5} \s*( \d {0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s){1,5} \s*(\\w{0,2}\d[\.]?){1,3} … … ([A-Z] [a-z] {1,10}\s){1,5}\s* (?!(201|…|330))(\w{0,2}\d[\.]?){1,4} ………….. … … … … … … Generate candidate regular expressions by modifying current regular expression Select the “best candidate” R’ If R’ has better than current regular expression, repeat the process

© 2008 IBM Corporation Outline  Motivation  Regex Learning Problem  Regex Transformations  ReLIE Search Algorithm  Experiments  Summary

© 2008 IBM Corporation Regex Learning Problem  Ideally:  find the best R f among all possible regexes  How do we define the best?  Highest F-measure over a document collection D.  We can only compute F-measure based on the labeled data  Must limited R f such that any match of R f is also a match of R 0

© 2008 IBM Corporation Regex Learning as a Search Problem M(R f, D) M(R, D): Matches of R over document collection D.

© 2008 IBM Corporation Outline  Motivation  Regex Learning Problem  Regex Transformations  ReLIE Search Algorithm  Experiments  Summary

© 2008 IBM Corporation Two Regex Transformations  Drop-disjunct Transformation: R = R a (R 1 | R 2 |… R i | R i+1 |…| R n ) R b  R’ = R a (R 1 | … R i |…) R b  Include-Intersect Transformation R = R a XR b  R’ = R a (X  Y) R b where Y  

© 2008 IBM Corporation (\d + \W)+\d+  (\d {3} \W)+\d+ Applying Drop-Disjunct Transformation  Character Class Restriction E.g. To restrict the matching of non-word characters (\d+ \W )+\d+  (\d+ [\.\s\-] )+\d+  Quantifier Restriction E.g. To restrict the number of digits in a block

© 2008 IBM Corporation Applying Include-Intersect Transformation  Negative Dictionaries  Disallow certain words from matching specific portions of the regex E.g. a simple pattern for software name extraction: blocks of capitalized words followed by version number: R 0 = ([A-Z]\w*\s*)+[Vv]?(\d+\.?)+  Identifies valid software name (e.g. Eclipse 3.2, Windows 2000 )  Produces invalid matches (e.g. ENGLISH 123, Room 301, Chapter 1.2 ) ([A-Z]\w*\s*) +[Vv]?(\d+\.?)+  ( [A-Z]\w*\s*) +[Vv]?(\d+\.?)+ ((?! ENGLISH|Room|Chapter)

© 2008 IBM Corporation Outline  Motivation  Regex Learning Problem  Regex Transformations  ReLIE Search Algorithm  Experiments  Summary

© 2008 IBM Corporation ReLIE Algorithm Character class restrictions Quantifier restrictions Negative dictionary R 0 ([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\ [a-zA-Z] {0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,2}\s*(\w{0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,5}\s* (?!(201|…|330))(\w{0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4} … ([A-Z][a-zA-Z]{1,10}\s){2,4}\s*(\w{0,2}\d[\.]?){1,4} … Compute F-measure F1F1 F7F7 F8F8 F 34 F 48 … ((?!(Copyright|Page|Physics|Question| · · · |Article|Issue) [A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4} Character class restrictions Quantifier restrictions Negative dictionary F 35 R’ ([A-Z] [a-z] {1,10}\s){1,5}\s*( [a-zA-z] {0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s) {1,2} \s*(\\w{0,2}\d[\.]?){1,4} (((?!(Copyright|Page|Physics|Question| · · · |Article|Issue) [A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4} … … ([A-Z] [a-z] {1,10}\s){1,5} \s*( \d {0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s){1,5} \s*(\\w{0,2}\d[\.]?){1,3} … … ([A-Z] [a-z] {1,10}\s){1,5}\s* (?!(201|…|330))(\w{0,2}\d[\.]?){1,4} ………….. … … … … … … Generate candidate regular expressions by applying a single transformation Select the “best candidate” R’ based on F-measure on training corpus If R’ has better F-measure than current regular expression, repeat the process Use validation set to avoid over-fitting

© 2008 IBM Corporation Outline  Motivation  Regex Learning Problem  Regex Transformations  ReLIE Search Algorithm  Experiments  Summary

© 2008 IBM Corporation Experimental Set Up  Data Set  EWeb: 50K web pages from IBM intranet  AWeb: 50K web pages from University of Michigan web site. AWeb-S : subset of 10K pages from AWeb  10K s from Enron collection  Extraction Tasks SoftwareNameTask CourseNumberTask PhoneNumberTaskURLTask  Comparison Study  ReLIE  Conditional Random Fields (CRF): Base feature set –matches corresponding to the input regex –three adjacent words to each side of the matches

© 2008 IBM Corporation Extraction Quality ReLIE performs comparably with CRF with a slight edge with limited training data Program repeatedly failed at training phrase.

© 2008 IBM Corporation Cross-domain Evaluation ReLIE significantly outperforms CRF for all three tasks (b) CourseNameTask is not tested, as course names exist only in AWeb.

© 2008 IBM Corporation Performance ReLIE is an order of magnitude faster than CRF for both training and testing Average Training/Testing Time (sec)(with 40% data for training)

© 2008 IBM Corporation What has ReLIE learned? Patterns learned by ReLIE are similar to features manually given to CRF

© 2008 IBM Corporation ReLIE as Feature Extractor for CRF C+RL: CRF + features learned by ReLIE Token level features learned by ReLIE helpful when the training data is small Character level features learned by ReLIE always helpful

© 2008 IBM Corporation Outline  Motivation  Regex Learning Problem  Regex Transformations  ReLIE Search Algorithm  Experiments  Summary

© 2008 IBM Corporation ReLIE  Effective for learning regexes for certain classes of IE  Particularly useful when  cross-domain, or  limited training data  Potentially becoming a powerful feature extractor for CRF and other machine learning algorithms.

© 2008 IBM Corporation