India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan and Sachindra Joshi IBM India Research Lab

India Research Lab © Copyright IBM Corporation 2006 |2 Problem: Entity Annotation  Extract all instances of entities of type E from an unstructured source S. - Company names, Designation, Person names, Date, Time October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open- source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft..

India Research Lab © Copyright IBM Corporation 2006 |3 Document-at-a-time Based Approach ML / Hand-built rules Tokenizer POS Lookup Gazetteer Lookup etc… Feature Collection Instance Extractor Annotated Document ………………… ………………… ………………… ………………… ………………… ………………… ………...………. …<>……………… ……… ……<> …………………. ……… ……… …………<>…… ………………… …………………… …<>…… … ……. … A Single Non-annotated Document collection Annotated document collection A few rule-based annotators exist: E.g. GATE. We have built a rule-based annotator at IRL

India Research Lab © Copyright IBM Corporation 2006 |4 Example: Rules for identifying ORGANIZATIONs How to identify? B.P. Marsh Plc The U.S.B. Holding Co. U.S.B. Holding Group

India Research Lab © Copyright IBM Corporation 2006 |5 Example rule for identifying ORGANIZATION instances Regular expressio n macros Dictionary attribute OR Part of speech tag U.S.B. The Holding Co.

India Research Lab © Copyright IBM Corporation 2006 |6 Problems with Document-at-a-time Based Approach on large corpora  Repeated computations for multiple occurrences of same token: - Dictionary-lookups - Regular expression matches  Large over-heads while - Re-annotating a corpus after changing dictionary entries  The user realizes that “ Group ” is a too generic word to be included as an ORGANIZATION:CLUE and want to remove its entry Group

India Research Lab © Copyright IBM Corporation 2006 |7 Problems with Document-at-a-time Based Approach on large corpora  Repeated computations for multiple occurrences of same token: - Dictionary-lookups - Regular expression matches  Large over-heads while - Re-annotating a corpus after changing dictionary entries  The user realizes that “ Group ” is a too generic word to be included as an ORGANIZATION:CLUE and want to remove its entry Group - Re-annotating a corpus with slight modification in rules  The user realizes that the optional “The” at the beginning introduces too many wrong annotations and modifies the ruleThe

India Research Lab © Copyright IBM Corporation 2006 |8 The rule with the optional “The” at the beginning removed

India Research Lab © Copyright IBM Corporation 2006 |9 Problems with Document-at-a-time Based Approach on large corpora  Repeated computations for multiple occurrences of same token: - Dictionary-lookups - Regular expression matches  Large over-heads while - Re-annotating a corpus after c hanging dictionary entries  The user realizes that “ Group ” is a too generic word to be included as an ORGANIZATION:CLUE and want to remove its entry Group - Re-annotating a corpus with slight modification in rules  The user realizes that the optional “The” at the beginning introduces too many wrong annotations and modifies the ruleThe - Making incremental annotation updates by adding new rules  The user wants a new rule that identifies “C.B. Fairlie Holding & Finance Limited”C.B. Fairlie Holding & Finance Limited

India Research Lab © Copyright IBM Corporation 2006 |10 A new rule to capture an interspersed conjunction

India Research Lab © Copyright IBM Corporation 2006 |11 Problems with Document-at-a-time Based Approach on large corpora  Repeated computations for multiple occurrences of same token: - Dictionary-lookups - Regular expression matches  Large over-heads while - Changing dictionary entries  The user realizes that “ Group ” is a too generic word to be included as an ORGANIZATION:CLUE and want to remove its entry Group - Re-annotating a corpus with slight modification in rules  The user realizes that the optional “The” at the beginning introduces too many wrong annotations and modifies the ruleThe - Making incremental annotation updates by adding new rules  The user wants a new rule that identifies “C.B. Fairlie Holding & Finance Limited”C.B. Fairlie Holding & Finance Limited  The user wants a new rule that identifies acquiring organizations: “AT&T Wireless, Inc. ” (that purchased Alaska Communications System in 1995 )AT&T Wireless, Inc.

India Research Lab © Copyright IBM Corporation 2006 |12 A new rule to identify acquiring organizations Post-context specifier

India Research Lab © Copyright IBM Corporation 2006 |13 An alternative approach: Operating on the Inverted Index  Inverted Index - A compact representation of the collection - Captures redundancies/repetition information

India Research Lab © Copyright IBM Corporation 2006 |14 Structure of Index Example:The company said that it will acquire the other company the company said that it will acquire other sidfirstlast Posting List sid: a sentence identifier first: beginning position of an occurrence last: end position of the same occurrence  Basic Entities  Orthographic properties  Dictionary Features

India Research Lab © Copyright IBM Corporation 2006 |15 An alternative approach: Operating on the Inverted Index  Inverted Index - A compact representation of the collection - Captures redundancies/repetition information  Many applications build an inverted index on the annotated corpus anyways - We directly update the inverted index with annotation entries

India Research Lab © Copyright IBM Corporation 2006 |16 Our approach: Index Based Entity Annotation

India Research Lab © Copyright IBM Corporation 2006 |17 Complexity Analysis for Document based Approach Problem: Find all annotations of length at most  Solution: Given a regular expression R, convert it into a DFA D R Complexity

India Research Lab © Copyright IBM Corporation 2006 |18 Operations on Index  merge(L,L’) : returns a posting list where each entry in the returned posting list occurs either in posting list L or L’ or in both  consint(L, L’) : returns a posting list where each entry in the posting list points to a token sequence which consists of two consecutive subsequences @sa and @sb such that L has a pointer to @sa and L’ has a pointer to @sb

India Research Lab © Copyright IBM Corporation 2006 |19 Implementing a DFA using Index  With each pair of state s and list k associate  is a posting list of token sequences of length k which end in state s  Iteratively compute from its predecessor states S1 a b c c S2S3S4

India Research Lab © Copyright IBM Corporation 2006 |21 Example: Simple Dictionary Match  Let tokens in T be drawn from {a,b…z}  Let D be a dictionary {a,e,i,o,u}  A simple 2 state DFA that matches D is: S1S2 a e i o u  Ratio of document based match to index based match

India Research Lab © Copyright IBM Corporation 2006 |22 Index based Annotation using Regular Expressions  NFA to DFA conversion may cause explosion of states  Scan regular expression from left to right and build AND/OR graph recursively  Compute posting list using AND/OR graph by propagating lists from leaves to root node AND

India Research Lab © Copyright IBM Corporation 2006 |23 Handling ? And Kleen Operators  Each node contains two binary properties - isOpt: 1 if the regular expression of the form R? (selfRecursion=? or *) - selfLoop: 1 if the regular expression matched is of the form R+ (one or more times) (selfRecursion=* or +) - For R* both the properties are set

India Research Lab © Copyright IBM Corporation 2006 |24 New Operations  consint  (L,L’): Generated list has isOpt set iff if both the arguments have isOpt set  merge  (L,L’): Generated list has isOpt set if any of the arguments have isOpt set.  consint  (L,+): Returns posting list such that each entry points to at most  subsequences in L

India Research Lab © Copyright IBM Corporation 2006 |25 Computing Regular Expression using AND/OR Graph  Compute posting lists with each node from bottom up.  For each AND node use consint operation with the posting list of children nodes.  For each OR node use merge operation

India Research Lab © Copyright IBM Corporation 2006 |26 Experimental Results  Data sets - Enron email: 2.3 GB - Reuters+20NG: 93 MB  8 rules for 4 annotations - Person name, company name, location and date Data setGATEIndex based Speedup Factor Enron497434337492613.26 Reuter+752287922388.15  A greater speedup is achieved on larger corpus  Incremental annotations achieve even larger performance gains Data setGATEIndex based Speedup Factor Enron14799546222723.78 Reuter+6611571792936.87

India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

Similar presentations

Presentation on theme: "India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

Similar presentations

Presentation on theme: "India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan."— Presentation transcript:

Similar presentations

About project

Feedback