© 2008 IBM Corporation Regular Expression Learning for Information Extraction Yunyao Li , Rajasekar Krishnamurthy , Sriram Raghavan *, Shivakumar Vaithyanathan.

© 2008 IBM Corporation Regular Expression Learning for Information Extraction Yunyao Li *, Rajasekar Krishnamurthy *, Sriram Raghavan *, Shivakumar Vaithyanathan *, H. V. Jagadish ○ * IBM Almaden Research Center ○ University of Michigan http://www.almaden.ibm.com/cs/projects/avatar/

© 2008 IBM Corporation Importance of Regular Expression (Regex)  Regex is essential to many information extraction (IE) tasks  Email addresses  Software names  Credit card numbers  Social security numbers  Gene and Protein names  …. But … writing regexes for an IE task is not straightforward Web collections Email compliance bioinformatics

© 2008 IBM Corporation Phone Number Extraction  A simple pattern: blocks of digits separated by non-word character: R 0 = (\d+\W)+\d+  Identifies valid phone numbers (e.g. 800-865-1125, 725-1234 )  Produces invalid matches (e.g. 123-45-6789, 10/19/2002, 1.25 …)  Misses valid phone numbers (e.g. (800) 865-CARE )

© 2008 IBM Corporation Software Name Extraction  A simple pattern: blocks of capitalized words followed by version number: R 0 = ([A-Z]\w*\s*)+[Vv]?(\d+\.?)+  Identifies valid software names (e.g. Eclipse 3.2, Windows 2000 )  Produces invalid matches (e.g. English 123, Room 301, Chapter 1.2 )  Misses valid software names (e.g. Windows XP )

© 2008 IBM Corporation Conventional Regex Writing Process for IE Regex 0 Sample Documents Match 1 Match 2 … Good Enough? N Y Regex final (\d+\W)+\d+(\d+\W)+\d{4} 800-865-1125 725-1234 … 123-45-6789 10/19/2002 1.25 … Regex 1 Regex 2 Regex 3 (\d+[\.\s\-])+\d{4}(\d{3}[\.\s\-])+\d{4}

© 2008 IBM Corporation Intuition R 0 ([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\ [a-zA-Z] {0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,2}\s*(\w{0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,5}\s* (?!(201|…|330))(\w{0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4} … ([A-Z][a-zA-Z]{1,10}\s){2,4}\s*(\w{0,2}\d[\.]?){1,4} … Compute F-measure F1F1 F7F7 F8F8 F 34 F 48 … ((?!(Copyright|Page|Physics|Question| · · · |Article|Issue) [A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4} F 35 R’ ([A-Z] [a-z] {1,10}\s){1,5}\s*( [a-zA-z] {0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s) {1,2} \s*(\\w{0,2}\d[\.]?){1,4} (((?!(Copyright|Page|Physics|Question| · · · |Article|Issue) [A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4} … … ([A-Z] [a-z] {1,10}\s){1,5} \s*( \d {0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s){1,5} \s*(\\w{0,2}\d[\.]?){1,3} … … ([A-Z] [a-z] {1,10}\s){1,5}\s* (?!(201|…|330))(\w{0,2}\d[\.]?){1,4} ………….. … … … … … … Generate candidate regular expressions by modifying current regular expression Select the “best candidate” R’ If R’ has better than current regular expression, repeat the process

© 2008 IBM Corporation Regex Learning Problem  Ideally:  find the best R f among all possible regexes  How do we define the best?  Highest F-measure over a document collection D.  We can only compute F-measure based on the labeled data  Must limited R f such that any match of R f is also a match of R 0

© 2008 IBM Corporation Two Regex Transformations  Drop-disjunct Transformation: R = R a (R 1 | R 2 |… R i | R i+1 |…| R n ) R b  R’ = R a (R 1 | … R i |…) R b  Include-Intersect Transformation R = R a XR b  R’ = R a (X  Y) R b where Y  

© 2008 IBM Corporation (\d + \W)+\d+  (\d {3} \W)+\d+ Applying Drop-Disjunct Transformation  Character Class Restriction E.g. To restrict the matching of non-word characters (\d+ \W )+\d+  (\d+ [\.\s\-] )+\d+  Quantifier Restriction E.g. To restrict the number of digits in a block

© 2008 IBM Corporation Applying Include-Intersect Transformation  Negative Dictionaries  Disallow certain words from matching specific portions of the regex E.g. a simple pattern for software name extraction: blocks of capitalized words followed by version number: R 0 = ([A-Z]\w*\s*)+[Vv]?(\d+\.?)+  Identifies valid software name (e.g. Eclipse 3.2, Windows 2000 )  Produces invalid matches (e.g. ENGLISH 123, Room 301, Chapter 1.2 ) ([A-Z]\w*\s*) +[Vv]?(\d+\.?)+  ( [A-Z]\w*\s*) +[Vv]?(\d+\.?)+ ((?! ENGLISH|Room|Chapter)

© 2008 IBM Corporation ReLIE Algorithm Character class restrictions Quantifier restrictions Negative dictionary R 0 ([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\ [a-zA-Z] {0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,2}\s*(\w{0,2}\d[\.]?){1,4} ([A-Z][a-zA-Z]{1,10}\s){1,5}\s* (?!(201|…|330))(\w{0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4} … ([A-Z][a-zA-Z]{1,10}\s){2,4}\s*(\w{0,2}\d[\.]?){1,4} … Compute F-measure F1F1 F7F7 F8F8 F 34 F 48 … ((?!(Copyright|Page|Physics|Question| · · · |Article|Issue) [A-Z][a-zA-Z]{1,10}\s){1,5}\s*(\w{0,2}\d[\.]?){1,4} Character class restrictions Quantifier restrictions Negative dictionary F 35 R’ ([A-Z] [a-z] {1,10}\s){1,5}\s*( [a-zA-z] {0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s) {1,2} \s*(\\w{0,2}\d[\.]?){1,4} (((?!(Copyright|Page|Physics|Question| · · · |Article|Issue) [A-Z] [a-z] {1,10}\s){1,5}\s*(\\w{0,2}\d[\.]?){1,4} … … ([A-Z] [a-z] {1,10}\s){1,5} \s*( \d {0,2}\d[\.]?){1,4} ([A-Z] [a-z] {1,10}\s){1,5} \s*(\\w{0,2}\d[\.]?){1,3} … … ([A-Z] [a-z] {1,10}\s){1,5}\s* (?!(201|…|330))(\w{0,2}\d[\.]?){1,4} ………….. … … … … … … Generate candidate regular expressions by applying a single transformation Select the “best candidate” R’ based on F-measure on training corpus If R’ has better F-measure than current regular expression, repeat the process Use validation set to avoid over-fitting

© 2008 IBM Corporation Experimental Set Up  Data Set  EWeb: 50K web pages from IBM intranet  AWeb: 50K web pages from University of Michigan web site. AWeb-S : subset of 10K pages from AWeb  Email: 10K emails from Enron collection  Extraction Tasks SoftwareNameTask CourseNumberTask PhoneNumberTaskURLTask  Comparison Study  ReLIE  Conditional Random Fields (CRF): Base feature set –matches corresponding to the input regex –three adjacent words to each side of the matches

© 2008 IBM Corporation ReLIE as Feature Extractor for CRF C+RL: CRF + features learned by ReLIE Token level features learned by ReLIE helpful when the training data is small Character level features learned by ReLIE always helpful

© 2008 IBM Corporation ReLIE  Effective for learning regexes for certain classes of IE  Particularly useful when  cross-domain, or  limited training data  Potentially becoming a powerful feature extractor for CRF and other machine learning algorithms.

© 2008 IBM Corporation Regular Expression Learning for Information Extraction Yunyao Li , Rajasekar Krishnamurthy , Sriram Raghavan *, Shivakumar Vaithyanathan.

Similar presentations

Presentation on theme: "© 2008 IBM Corporation Regular Expression Learning for Information Extraction Yunyao Li , Rajasekar Krishnamurthy , Sriram Raghavan *, Shivakumar Vaithyanathan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© 2008 IBM Corporation Regular Expression Learning for Information Extraction Yunyao Li *, Rajasekar Krishnamurthy *, Sriram Raghavan *, Shivakumar Vaithyanathan.

Similar presentations

Presentation on theme: "© 2008 IBM Corporation Regular Expression Learning for Information Extraction Yunyao Li *, Rajasekar Krishnamurthy *, Sriram Raghavan *, Shivakumar Vaithyanathan."— Presentation transcript:

Similar presentations

About project

Feedback

© 2008 IBM Corporation Regular Expression Learning for Information Extraction Yunyao Li , Rajasekar Krishnamurthy , Sriram Raghavan *, Shivakumar Vaithyanathan.

Presentation on theme: "© 2008 IBM Corporation Regular Expression Learning for Information Extraction Yunyao Li , Rajasekar Krishnamurthy , Sriram Raghavan *, Shivakumar Vaithyanathan."— Presentation transcript: