Download presentation
Presentation is loading. Please wait.
Published byEvangeline O’Connor’ Modified over 9 years ago
1
© 2008 IBM Corporation Regular Expression Learning for Information Extraction Yunyao Li *, Rajasekar Krishnamurthy *, Sriram Raghavan *, Shivakumar Vaithyanathan *, H. V. Jagadish ○ * IBM Almaden Research Center ○ University of Michigan http://www.almaden.ibm.com/cs/projects/avatar/
2
© 2008 IBM Corporation Outline Motivation Regex Learning Problem Regex Transformations ReLIE Search Algorithm Experiments Summary
3
© 2008 IBM Corporation Importance of Regular Expression (Regex) Regex is essential to many information extraction (IE) tasks Email addresses Software names Credit card numbers Social security numbers Gene and Protein names …. But … writing regexes for an IE task is not straightforward Web collections Email compliance bioinformatics
4
© 2008 IBM Corporation Phone Number Extraction A simple pattern: blocks of digits separated by non-word character: R 0 = (\d+\W)+\d+ Identifies valid phone numbers (e.g. 800-865-1125, 725-1234 ) Produces invalid matches (e.g. 123-45-6789, 10/19/2002, 1.25 …) Misses valid phone numbers (e.g. (800) 865-CARE )
5
© 2008 IBM Corporation Software Name Extraction A simple pattern: blocks of capitalized words followed by version number: R 0 = ([A-Z]\w*\s*)+[Vv]?(\d+\.?)+ Identifies valid software names (e.g. Eclipse 3.2, Windows 2000 ) Produces invalid matches (e.g. English 123, Room 301, Chapter 1.2 ) Misses valid software names (e.g. Windows XP )
6
© 2008 IBM Corporation Conventional Regex Writing Process for IE Regex 0 Sample Documents Match 1 Match 2 … Good Enough? N Y Regex final (\d+\W)+\d+(\d+\W)+\d{4} 800-865-1125 725-1234 … 123-45-6789 10/19/2002 1.25 … Regex 1 Regex 2 Regex 3 (\d+[\.\s\-])+\d{4}(\d{3}[\.\s\-])+\d{4}
7
© 2008 IBM Corporation Our goal - Learning Regex final automatically Regex 0 Sample Documents Match 1 Match 2 … NegMatch 1 … NegMatch m 0 PosMatch 1 … PosMatch n 0 Labeled Matches ReLIE Regex final
8
© 2008 IBM Corporation Intuition R 0 ([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\ [a-zA-Z] {0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,2}\s*(\w{0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,5}\s* (?!(201|…|330))(\w{0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4} … ([A-Z][a-zA-Z]{1,10}\s){2,4}\s*(\w{0,2}\d[\.]?){1,4} … Compute F-measure F1F1 F7F7 F8F8 F 34 F 48 … ((?!(Copyright|Page|Physics|Question| · · · |Article|Issue) [A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4} F 35 R’ ([A-Z] [a-z] {1,10}\s){1,5}\s*( [a-zA-z] {0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s) {1,2} \s*(\\w{0,2}\d[\.]?){1,4} (((?!(Copyright|Page|Physics|Question| · · · |Article|Issue) [A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4} … … ([A-Z] [a-z] {1,10}\s){1,5} \s*( \d {0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s){1,5} \s*(\\w{0,2}\d[\.]?){1,3} … … ([A-Z] [a-z] {1,10}\s){1,5}\s* (?!(201|…|330))(\w{0,2}\d[\.]?){1,4} ………….. … … … … … … Generate candidate regular expressions by modifying current regular expression Select the “best candidate” R’ If R’ has better than current regular expression, repeat the process
9
© 2008 IBM Corporation Outline Motivation Regex Learning Problem Regex Transformations ReLIE Search Algorithm Experiments Summary
10
© 2008 IBM Corporation Regex Learning Problem Ideally: find the best R f among all possible regexes How do we define the best? Highest F-measure over a document collection D. We can only compute F-measure based on the labeled data Must limited R f such that any match of R f is also a match of R 0
11
© 2008 IBM Corporation Regex Learning as a Search Problem M(R f, D) M(R, D): Matches of R over document collection D.
12
© 2008 IBM Corporation Outline Motivation Regex Learning Problem Regex Transformations ReLIE Search Algorithm Experiments Summary
13
© 2008 IBM Corporation Two Regex Transformations Drop-disjunct Transformation: R = R a (R 1 | R 2 |… R i | R i+1 |…| R n ) R b R’ = R a (R 1 | … R i |…) R b Include-Intersect Transformation R = R a XR b R’ = R a (X Y) R b where Y
14
© 2008 IBM Corporation (\d + \W)+\d+ (\d {3} \W)+\d+ Applying Drop-Disjunct Transformation Character Class Restriction E.g. To restrict the matching of non-word characters (\d+ \W )+\d+ (\d+ [\.\s\-] )+\d+ Quantifier Restriction E.g. To restrict the number of digits in a block
15
© 2008 IBM Corporation Applying Include-Intersect Transformation Negative Dictionaries Disallow certain words from matching specific portions of the regex E.g. a simple pattern for software name extraction: blocks of capitalized words followed by version number: R 0 = ([A-Z]\w*\s*)+[Vv]?(\d+\.?)+ Identifies valid software name (e.g. Eclipse 3.2, Windows 2000 ) Produces invalid matches (e.g. ENGLISH 123, Room 301, Chapter 1.2 ) ([A-Z]\w*\s*) +[Vv]?(\d+\.?)+ ( [A-Z]\w*\s*) +[Vv]?(\d+\.?)+ ((?! ENGLISH|Room|Chapter)
16
© 2008 IBM Corporation Outline Motivation Regex Learning Problem Regex Transformations ReLIE Search Algorithm Experiments Summary
17
© 2008 IBM Corporation ReLIE Algorithm Character class restrictions Quantifier restrictions Negative dictionary R 0 ([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\ [a-zA-Z] {0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,2}\s*(\w{0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,5}\s* (?!(201|…|330))(\w{0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4} … ([A-Z][a-zA-Z]{1,10}\s){2,4}\s*(\w{0,2}\d[\.]?){1,4} … Compute F-measure F1F1 F7F7 F8F8 F 34 F 48 … ((?!(Copyright|Page|Physics|Question| · · · |Article|Issue) [A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4} Character class restrictions Quantifier restrictions Negative dictionary F 35 R’ ([A-Z] [a-z] {1,10}\s){1,5}\s*( [a-zA-z] {0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s) {1,2} \s*(\\w{0,2}\d[\.]?){1,4} (((?!(Copyright|Page|Physics|Question| · · · |Article|Issue) [A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4} … … ([A-Z] [a-z] {1,10}\s){1,5} \s*( \d {0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s){1,5} \s*(\\w{0,2}\d[\.]?){1,3} … … ([A-Z] [a-z] {1,10}\s){1,5}\s* (?!(201|…|330))(\w{0,2}\d[\.]?){1,4} ………….. … … … … … … Generate candidate regular expressions by applying a single transformation Select the “best candidate” R’ based on F-measure on training corpus If R’ has better F-measure than current regular expression, repeat the process Use validation set to avoid over-fitting
18
© 2008 IBM Corporation Outline Motivation Regex Learning Problem Regex Transformations ReLIE Search Algorithm Experiments Summary
19
© 2008 IBM Corporation Experimental Set Up Data Set EWeb: 50K web pages from IBM intranet AWeb: 50K web pages from University of Michigan web site. AWeb-S : subset of 10K pages from AWeb Email: 10K emails from Enron collection Extraction Tasks SoftwareNameTask CourseNumberTask PhoneNumberTaskURLTask Comparison Study ReLIE Conditional Random Fields (CRF): Base feature set –matches corresponding to the input regex –three adjacent words to each side of the matches
20
© 2008 IBM Corporation Extraction Quality ReLIE performs comparably with CRF with a slight edge with limited training data Program repeatedly failed at training phrase.
21
© 2008 IBM Corporation Cross-domain Evaluation ReLIE significantly outperforms CRF for all three tasks (b) CourseNameTask is not tested, as course names exist only in AWeb.
22
© 2008 IBM Corporation Performance ReLIE is an order of magnitude faster than CRF for both training and testing Average Training/Testing Time (sec)(with 40% data for training)
23
© 2008 IBM Corporation What has ReLIE learned? Patterns learned by ReLIE are similar to features manually given to CRF
24
© 2008 IBM Corporation ReLIE as Feature Extractor for CRF C+RL: CRF + features learned by ReLIE Token level features learned by ReLIE helpful when the training data is small Character level features learned by ReLIE always helpful
25
© 2008 IBM Corporation Outline Motivation Regex Learning Problem Regex Transformations ReLIE Search Algorithm Experiments Summary
26
© 2008 IBM Corporation ReLIE Effective for learning regexes for certain classes of IE Particularly useful when cross-domain, or limited training data Potentially becoming a powerful feature extractor for CRF and other machine learning algorithms.
27
© 2008 IBM Corporation http://www.almaden.ibm.com/cs/projects/avatar/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.