Download presentation
Presentation is loading. Please wait.
Published bySamantha Horton Modified over 9 years ago
1
India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan and Sachindra Joshi IBM India Research Lab
2
India Research Lab © Copyright IBM Corporation 2006 |2 Problem: Entity Annotation Extract all instances of entities of type E from an unstructured source S. - Company names, Designation, Person names, Date, Time October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open- source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft..
3
India Research Lab © Copyright IBM Corporation 2006 |3 Document-at-a-time Based Approach ML / Hand-built rules Tokenizer POS Lookup Gazetteer Lookup etc… Feature Collection Instance Extractor Annotated Document ………………… ………………… ………………… ………………… ………………… ………………… ………...………. …<>……………… ……… ……<> …………………. ……… ……… …………<>…… ………………… …………………… …<>…… … ……. … A Single Non-annotated Document collection Annotated document collection A few rule-based annotators exist: E.g. GATE. We have built a rule-based annotator at IRL
4
India Research Lab © Copyright IBM Corporation 2006 |4 Example: Rules for identifying ORGANIZATIONs How to identify? B.P. Marsh Plc The U.S.B. Holding Co. U.S.B. Holding Group
5
India Research Lab © Copyright IBM Corporation 2006 |5 Example rule for identifying ORGANIZATION instances Regular expressio n macros Dictionary attribute OR Part of speech tag U.S.B. The Holding Co.
6
India Research Lab © Copyright IBM Corporation 2006 |6 Problems with Document-at-a-time Based Approach on large corpora Repeated computations for multiple occurrences of same token: - Dictionary-lookups - Regular expression matches Large over-heads while - Re-annotating a corpus after changing dictionary entries The user realizes that “ Group ” is a too generic word to be included as an ORGANIZATION:CLUE and want to remove its entry Group
7
India Research Lab © Copyright IBM Corporation 2006 |7 Problems with Document-at-a-time Based Approach on large corpora Repeated computations for multiple occurrences of same token: - Dictionary-lookups - Regular expression matches Large over-heads while - Re-annotating a corpus after changing dictionary entries The user realizes that “ Group ” is a too generic word to be included as an ORGANIZATION:CLUE and want to remove its entry Group - Re-annotating a corpus with slight modification in rules The user realizes that the optional “The” at the beginning introduces too many wrong annotations and modifies the ruleThe
8
India Research Lab © Copyright IBM Corporation 2006 |8 The rule with the optional “The” at the beginning removed
9
India Research Lab © Copyright IBM Corporation 2006 |9 Problems with Document-at-a-time Based Approach on large corpora Repeated computations for multiple occurrences of same token: - Dictionary-lookups - Regular expression matches Large over-heads while - Re-annotating a corpus after c hanging dictionary entries The user realizes that “ Group ” is a too generic word to be included as an ORGANIZATION:CLUE and want to remove its entry Group - Re-annotating a corpus with slight modification in rules The user realizes that the optional “The” at the beginning introduces too many wrong annotations and modifies the ruleThe - Making incremental annotation updates by adding new rules The user wants a new rule that identifies “C.B. Fairlie Holding & Finance Limited”C.B. Fairlie Holding & Finance Limited
10
India Research Lab © Copyright IBM Corporation 2006 |10 A new rule to capture an interspersed conjunction
11
India Research Lab © Copyright IBM Corporation 2006 |11 Problems with Document-at-a-time Based Approach on large corpora Repeated computations for multiple occurrences of same token: - Dictionary-lookups - Regular expression matches Large over-heads while - Changing dictionary entries The user realizes that “ Group ” is a too generic word to be included as an ORGANIZATION:CLUE and want to remove its entry Group - Re-annotating a corpus with slight modification in rules The user realizes that the optional “The” at the beginning introduces too many wrong annotations and modifies the ruleThe - Making incremental annotation updates by adding new rules The user wants a new rule that identifies “C.B. Fairlie Holding & Finance Limited”C.B. Fairlie Holding & Finance Limited The user wants a new rule that identifies acquiring organizations: “AT&T Wireless, Inc. ” (that purchased Alaska Communications System in 1995 )AT&T Wireless, Inc.
12
India Research Lab © Copyright IBM Corporation 2006 |12 A new rule to identify acquiring organizations Post-context specifier
13
India Research Lab © Copyright IBM Corporation 2006 |13 An alternative approach: Operating on the Inverted Index Inverted Index - A compact representation of the collection - Captures redundancies/repetition information
14
India Research Lab © Copyright IBM Corporation 2006 |14 Structure of Index Example:The company said that it will acquire the other company the company said that it will acquire other sidfirstlast Posting List sid: a sentence identifier first: beginning position of an occurrence last: end position of the same occurrence Basic Entities Orthographic properties Dictionary Features
15
India Research Lab © Copyright IBM Corporation 2006 |15 An alternative approach: Operating on the Inverted Index Inverted Index - A compact representation of the collection - Captures redundancies/repetition information Many applications build an inverted index on the annotated corpus anyways - We directly update the inverted index with annotation entries
16
India Research Lab © Copyright IBM Corporation 2006 |16 Our approach: Index Based Entity Annotation
17
India Research Lab © Copyright IBM Corporation 2006 |17 Complexity Analysis for Document based Approach Problem: Find all annotations of length at most Solution: Given a regular expression R, convert it into a DFA D R Complexity
18
India Research Lab © Copyright IBM Corporation 2006 |18 Operations on Index merge(L,L’) : returns a posting list where each entry in the returned posting list occurs either in posting list L or L’ or in both consint(L, L’) : returns a posting list where each entry in the posting list points to a token sequence which consists of two consecutive subsequences @sa and @sb such that L has a pointer to @sa and L’ has a pointer to @sb
19
India Research Lab © Copyright IBM Corporation 2006 |19 Implementing a DFA using Index With each pair of state s and list k associate is a posting list of token sequences of length k which end in state s Iteratively compute from its predecessor states S1 a b c c S2S3S4
20
India Research Lab © Copyright IBM Corporation 2006 |20 Complexity Analysis for Index based Approach Observation:
21
India Research Lab © Copyright IBM Corporation 2006 |21 Example: Simple Dictionary Match Let tokens in T be drawn from {a,b…z} Let D be a dictionary {a,e,i,o,u} A simple 2 state DFA that matches D is: S1S2 a e i o u Ratio of document based match to index based match
22
India Research Lab © Copyright IBM Corporation 2006 |22 Index based Annotation using Regular Expressions NFA to DFA conversion may cause explosion of states Scan regular expression from left to right and build AND/OR graph recursively Compute posting list using AND/OR graph by propagating lists from leaves to root node AND
23
India Research Lab © Copyright IBM Corporation 2006 |23 Handling ? And Kleen Operators Each node contains two binary properties - isOpt: 1 if the regular expression of the form R? (selfRecursion=? or *) - selfLoop: 1 if the regular expression matched is of the form R+ (one or more times) (selfRecursion=* or +) - For R* both the properties are set
24
India Research Lab © Copyright IBM Corporation 2006 |24 New Operations consint (L,L’): Generated list has isOpt set iff if both the arguments have isOpt set merge (L,L’): Generated list has isOpt set if any of the arguments have isOpt set. consint (L,+): Returns posting list such that each entry points to at most subsequences in L
25
India Research Lab © Copyright IBM Corporation 2006 |25 Computing Regular Expression using AND/OR Graph Compute posting lists with each node from bottom up. For each AND node use consint operation with the posting list of children nodes. For each OR node use merge operation
26
India Research Lab © Copyright IBM Corporation 2006 |26 Experimental Results Data sets - Enron email: 2.3 GB - Reuters+20NG: 93 MB 8 rules for 4 annotations - Person name, company name, location and date Data setGATEIndex based Speedup Factor Enron497434337492613.26 Reuter+752287922388.15 A greater speedup is achieved on larger corpus Incremental annotations achieve even larger performance gains Data setGATEIndex based Speedup Factor Enron14799546222723.78 Reuter+6611571792936.87
27
India Research Lab © Copyright IBM Corporation 2006 THANK YOU Title slide
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.