Automatic Rule Refinement for Information Extraction Bin Liu University of Michigan Laura Chiticariu IBM Research - Almaden Vivian Chu IBM Research - Almaden.

Automatic Rule Refinement for Information Extraction Bin Liu University of Michigan Laura Chiticariu IBM Research - Almaden Vivian Chu IBM Research - Almaden Frederic R. Reiss IBM Research - Almaden H. V. Jagadish University of Michigan VLDB 2010 Presenter: Ajay Gupta Date: 20 th Oct 2011

Outline 2 Introduction Rules Representation Method Overview Experimental Setup Results Conclusion & Future Work

3 Information Extraction (IE) 3  Distill structured data from unstructured text  Exploit the extracted data in your applications For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Name Title Organization Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman Founder Free Soft.. Frederick Reiss et. al. SIGMOD 2010 Tutorial Annotations

4 Rule Based Information Extraction Most IE systems uses Rules to define important patterns in the text. Example: Person name extractor If a match of a dictionary of common first names occurs in the text, followed immediately by a capitalized word, mark the two words as a “candidate person name”.

5 Anna at James St. office (555-1234), or James, her assistant - 555-7789 have the details. Example Extraction Rules – When Things Go Wrong Phone 555-7789 Phone 555-1234 Person James Person Anna Person James

6 Rule Development in Information Extraction Analyze Develop Test Iterative refinement process labor intensive time-consuming error prone

7 Rule Refinement Is Hard Number of rules could be large. Rule interactions could be complex. Analyzing side effects - False positive → improve precision - Correct results → decrease recall Identifying change could take hours - Person extractor has 14 complex rules

8 Rules Representation

9 SQL to represent rules. SQL Subset: – Select, Project, Join, Union ALL, Except ALL SQL Extension: Data Type: span Table: Document(text span) Predicate Functions: - Follows, FollowsTok, Contains Scalar Functions: - Merge, Between, LeftContext Table Functions: - Regex, Dictionary

10 Rules Examples Dictionary file first_names.dict: anna, james, john, peter… R1: create view Phone as Regex(‘d{3}-\d{4}’, Document, text); R2: create view FirstNameCand F as Dictionary(‘first_names.dict’, Document, text); R3: create view FirstName as Select * from FirstNameCand F where Not(ContainsDict('street_suffix.dict', RightContextTok(F.match,1))); Anna at James St. Office (555-1234), or James, her assistant - 555-7789 have the details. t0: Phone t1 555-1234 t2 555-7789 FirstNameCand t3 Anna t4 James t5 James FirstName t6 Anna t7 James

11 Rules Examples R4: create view PersonPhoneAll as Select Merge(F.Match, P.match) as match from FirstName F, Phone P where Follows(F.match, P.match, 0, 60); R5: create table PersonPhone(match span); insert into PersonPhone ( select * from PersonPhoneAll A) except all ( select A1.* from PersonPhoneAll A1, PersonPhoneAll A2 where Contains(A1.match, A2.match) and Not(Equals(A1.match, A2.match)); PersonPhoneAll PersonPhone t11 Anna at James St. Office (555- 1234) t12 James, her assistant - 555-7789 t8 Anna at James St. Office (555- 1234) t9 James, her assistant - 555-7789 t10 Anna at James St. Office (555- 1234) or James, her assistant - 555-7789

12 Canonical Representation of Rules

13 Method Overview

14 Method Overview Data Provenance: Boris Glavic, Gustavo Alonso, ICDE 09

15 Input: Set of correct and incorrect examples generated by an Extractor Goal: Generate refinements of Extractor that remove incorrect example, while minimizing the rest of the results. Basic Idea: Data provenance allows one to understand the origins of an output Cut any provenance link  wrong output disappears Method Overview FirstNameCand Dictionary FirstNames.dict Doc PersonPhoneAll Join Follows(name,phone,0,60) Anna Anna  555-7789 Phone Regex /\d{3}-\d{4}/ 555-7789 (Simplified) provenance of a wrong output

16 Method Overview Solution: Stage1: Generate High Level Changes “remove tuple t from the output of operator Op in the canonical representation of the extractor”. Problems: 1) feasibility 2) side-effects Stage2: Generate Low Level Changes - How to modify the operator to implement high level change. - Ranking

17 High-Level Change Let t be a tuple in an output table V. A high-level change for t is a pair (t′, Op), where Op is an operator in the canonical operator graph of V and t′ is a tuple in the output of Op such that eliminating t′ from the output of Op by modifying Op results in eliminating t from V. DEFINITION: HIGH-LEVEL CHANGE

18 Computing Provenance

19 Algorithm to Generate HLCs

20 FirstNameCand Dictionary FirstNames.dict Doc PersonPhoneAll Join Follows(name,phone,0,60) Anna Anna  555-7789 Phone Regex /\d{3}-\d{4}/ 555-7789 FirstName Select Not(ContainsDict('street_suffix.. HLC Example Anna HLC1 Remove Anna 555-7789 From output of Join in R4 HLC4 Remove Anna From Output of Dictionary in R2 HLC3 Remove Anna From Output of Select in R3 HLC2 Remove 555-7789 From Output of Regex in R1

21 FirstNameCand Dictionary FirstNames.dict Doc PersonPhoneAll Join Follows(name,phone,0,60) Anna Anna  555-7789 Phone Regex /\d{3}-\d{4}/ 555-7789 FirstName Select Not(ContainsDict('street_suffix.. Anna HLC1 Remove Anna 555-7789 From output of Join in R4 HLC4 Remove Anna From Output of Dictionary in R2 Generating Low-Level Changes from HLCs LLC Change Join Predicate to Follows(name,phone,0,50) LLC Remove 'anna' From FirstNames.dict

22 Generating Low-Level Changes from HLCs: Naive Approach Input: Set of HLCs Output: List of LLCs, ranked based on effects Algorithm: 1) For each operator Op, consider all HLCs 2) For each HLC, enumerate all possible LLCs 3) For each LLC: Compute the set of local tuples it removes from the output of Op Propagate these removals up through the provenance graph to compute the effect on end-to-end result 4) Rank LLCs

23 Problems with Naive Approach Problem1: Number of possible LLCs for a HLC could be very large Example: Remove output tuple of a Dictionary operator Dictionary with 1000 entries possible LLCs: 2^999 -1 !. Solution: Limit the LLCs considered to a set of tractable size, while still considering all feasible combinations of HLCs for given operator 1) Generate a single LLC for each of k promising combinations of HLCs for given operator 2) k is the number of LLCs presented to the user

24 Problems with Naive Approach Problem2: Traversing the provenance graph is expensive O(n 2 ), where n is the size of the operator tree Solution: Remember the mapping from each high-level change back to the affected output tuple.

25 Specific Classes of Low-Level Changes 1) Modify numerical join parameters - E.g., “Modify max char. distance of Follows() predicate in the join operator of rule R4 from 60 to 20” 2) Remove dictionary entries - E.g., “Modify the Dictionary operator of rule R2 by removing entry Anna from first_names.dict” 3) Add filtering dictionary - E.g., “Add predicate Not(ContainsDict(‘street_suffix.dict’, RightContextTok(match,1))) to Dictionary operator of rule R3” 4) Add filtering view - applies to an entire view E.g., “Subtract from the result of rule R4 PersonPhoneAll spans that are strictly contained within another PersonPhoneAll span”

26 LLC Generation: Removing Dictionary Entries James  James X James Y James Anderson Anna  Anna XYZ  Anna ABC 26 Output of operator Dictionary(‘FirstNameDict’) Final output of FirstName extractor ‘anna’  Anna XYZ Anna ABC ‘james’  James X James Y James Anderson Dictionary entries in ‘FirstNameDict’ Effects of removing Dictionary entry 1.‘anna’ 2.‘anna’, ‘james’ Generated LLCs: Remove from dictionary FirstNameDict the following entries:

27 Experiments - Rule refinement approach implemented in SystemT information extraction system - Uses SystemT’s AQL rule language Goals: Quality evaluation of generated refinements Performance evaluation Setup: Ubuntu Linux version 9.10, 2.26 GHz Intel Xeon CPU with 8 GB RAM. 10 fold cross validation.

28 Extraction Tasks and Rule Sets Person task 14 complex rules for identifying person names E.g., “CapitalizedWord followed by FirstName” “LastName followed by Comma followed by CapitalizedWord” Rules for identifying other Named Entities E.g., Organization, EmailAddress, Address These can be used as filtering purpose to enable refinement. - “Morgan Stanley”, “Georgia” PersonPhone task 11 complex rules for identifying phone numbers High-quality Person extractor One rule to identify PersonPhone candidates: “Person followed by Phone within 0 to 60 characters”

29 Evaluation Datasets Dataset#docs#labels#docs#labels ACE CoNLL Enron EnronPP 273 946 434 322 5201 6560 4500 157 69 216 218 161 1220 1842 1969 46 Training Set Test Set ACE: collection of newswire reports, broadcast news and con- versations with Person labeled data from the ACE05 Dataset. CoNLL: collection of news articles with Person labeled data from the CoNLL 2003 Shared Task. Enron, EnronPP: collections of emails from the Enron corpus annotated with Person and respectively PersonPhone labels.

30 Quality Evaluation

31 Quality Evaluation - F1-measure improves between 6% to 26% in few iterations - Recall remains stable. - F1-measure and Precision reaches platue - First few high ranked refinements - Some low level changes are not implemented yet

32 Quality Evaluation: Comparison with Experts - Two experts - Enron dataset for Person task - Time: One hour

33 Performance Evaluation - One hour by an expert - 3 min to 15 min per refinement - System refinement time: 2 min

34 Conclusion & Future work - Database provenance technique for refining information extraction rules. Future work: - Extensions Other types of LLCs. e.g. Regex - Addressing false negatives

35 Thank You

Automatic Rule Refinement for Information Extraction Bin Liu University of Michigan Laura Chiticariu IBM Research - Almaden Vivian Chu IBM Research - Almaden.

Similar presentations

Presentation on theme: "Automatic Rule Refinement for Information Extraction Bin Liu University of Michigan Laura Chiticariu IBM Research - Almaden Vivian Chu IBM Research - Almaden."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Rule Refinement for Information Extraction Bin Liu University of Michigan Laura Chiticariu IBM Research - Almaden Vivian Chu IBM Research - Almaden.

Similar presentations

Presentation on theme: "Automatic Rule Refinement for Information Extraction Bin Liu University of Michigan Laura Chiticariu IBM Research - Almaden Vivian Chu IBM Research - Almaden."— Presentation transcript:

Similar presentations

About project

Feedback