1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond J. Mooney
2 Introduction Information Extraction (IE) is the task of locating specific pieces of information in NL text IE is an important subpart of text understanding IE systems are difficult and time consuming to build and they don’t port well to different domains Researchers are combining learning methods with NLP methods to automate IE
3 Overview of RAPIER RAPIER – Robust Automated Production of Information Extraction Rules Learn IE rules automatically Use a corpus of documents paired with filled templates Resulting rules do not require prior parsing or subsequent processing Uses limited syntactic information from a POS tagger Induced patterns incorporate semantic classes Rules characterize slot-fillers and their context
4 RAPIER Rules Consist of three parts: Pre-filler pattern – matches text immediately preceding the extracted information Filler pattern – matches the exact text to be extracted Post-filler pattern – matches text after information Each pattern is a sequence of pattern items or pattern lists Pattern item specifies constraints for one word or symbol Pattern list specifies constraints for 0..n words or symbols Constraints include: List of words, one of which must match the item POS tag Semantic class
5 RAPIER Rules (cont.) Pre-FillerFillerPost-Filler 1)word: leading1)list: len2 tags:[nn, nns] 1)word: [firm, company] Leading telecommunications firm in need … 1)tag:[nn, nnp] 2)list: length 2 1)word: undisclosed tag: [jj] 1)sem: price … sold to the bank for an undisclosed amount … … paid Honeywell an undisclosed price …
6 Learning Algorithm Pre-FillerFillerPost-Filler SRULESSRULES 1)word: located tag: vbn 2) word: in tag: in 1)word: atlanta tag: nnp 1)word:, tag:, 2)word: georgia tag: nnp 3)word:. tag:. 1)word: offices tag: nns 2)word: in tag: in 1)word: kansas tag: nnp 2)word: city tag: nnp 1)word:, tag:, 2)word: missouri tag: nnp 3)word:. tag:. RLISTRLIST 1)list: len- 2 word: atlanta,kansas,city tag: nnp 1)list: len- 2 tag: nnp 1)word: in tag: in 1)list: len- 2 tag: nnp 1)word:, tag:, 2)tag: nnp semantic: state located in Atlanta, Georgia. offices in Kansas City, Missouri. For each slot, S in the template being learned SlotRules = most specific rules from document S while compression has failed fewer than lim times randomly select r pairs of rules from SlotRules find the set L of generalizations of the fillers of the rule pairs create rules from L, evaluate, and initialize RulesList let n = 0 while best rule in RuleList produces spurious fillers and weighted information value of best rule is improving increment n specialize each rule in RuleList with generalizations of the last n items of the pre-filler patterns of the rule pair and add specializations to RuleList specialize each rule in RuleList with generalizations of the last n items of the post-filler patterns of the rule pair and add specializations to RuleList if best rule in RuleList produces only valid fillers Add it to SlotRules Remove empirically subsumed rules
7 Experimental Results The task: Extract information from coputer-related job postings 17 slots used, including employer, salary, etc. Results do not employ semantic categories 100 document dataset with filled templates with 10-fold cross validation Measured precision, recall, and F-measure
8 Experimental Results – continued Performance: Is comparable to Crystal on a medical domain Is better than AutoSlog and AutoSlog-TS on MUC-4 terrorism task Is hard to compare because of the different domains tested Is good because precision is most important
9 Related Work Resolve Uses decision trees Uses annotated coreference examples Crystal Uses a clustering algorithm to build a dictionary of extraction patterns Requires patterns identified by an expert Requires prior syntax analysis to identify syntactic elements and their relationships AutoSlog Specializes a set of general syntatic patterns An expert must examine the patterns it produces Requires prior syntax analysis Liep Requires prior syntax analysis Makes no real use of semantic information Has not been applied to complex domains
10 Related Work – BYU DEG RAPIER rules correspond closely to DEG data frames. Data frames are finer-grained, based on character patterns, whereas rules are based on word patterns Pre-filler and Post-filler patterns correspond closely to data frame contexts and key words Semantic categories correspond closely with lexicons Not mentioned how RAPIER handles multiple record documents Rapier data structure is given by the template (slots) defined in the input data RAPIER is very similar in purpose to what Joe is trying to do – learn extraction rules based on a filled in form
11 Conclusions Extracting desired pieces of information from NL text is important Manually constructing IE systems too hard RAPIER uses relational learning to build a set of pattern- match rules given a database of texts and filled templates Learned patterns employ syntactic and semantic information to match slot fillers and context Fairly accurate results can be obtained for a real-world problem with relatively small datasets RAPIER compares favorably with other IE learning systems