Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning Dictionaries for Information Extraction by Multi- Level Bootstrapping Ellen Riloff and Rosie Jones, AAAI 99 Presented by: Sunandan Chakraborty.

Similar presentations


Presentation on theme: "Learning Dictionaries for Information Extraction by Multi- Level Bootstrapping Ellen Riloff and Rosie Jones, AAAI 99 Presented by: Sunandan Chakraborty."— Presentation transcript:

1 Learning Dictionaries for Information Extraction by Multi- Level Bootstrapping Ellen Riloff and Rosie Jones, AAAI 99 Presented by: Sunandan Chakraborty

2 Information Extraction Extracting domain-specific information from NL text Example Domains Locations Companies Terrorism

3 Required Lexical Resources Semantic lexicons Dictionary of words tagged using semantic categories e.g. name of locations (countries, cities) Extraction patterns e.g. outlets in, from From Noun Phrase Outlets in New York

4 Mutual Bootstrapping No annotated corpus Learning extraction patterns and semantic lexicon Input Unannotated corpus Seed words

5 Mutual Bootstrapping Starting from seed words Identifying NPs related to the seed words [for extraction patterns] Using extraction patterns to identify new terms New terms should be in the same lexical category Using new terms to search for more patterns

6 Algorithm Input: Candidate extraction pattern from AutoSlog Seed words Data Structures EPdata – to store candidate extraction patterns Initial value: extraction patterns from AutoSlog an the extractions SemLex – to store semantic lexicons as they are identified  Initial value: seed words Cat_EPlist – to store the extraction patterns  Initial value: null

7 Algorithm (contd...) 1.For all Extraction Patterns P i in EPdata score(P i ) = R i * log 2 (F i ) F i = no. of lexicons produced by P i R i = F i /N i, N i : no. of NPs extracted by P i 2. Insert P i to Cat_Eplist, where score(P i ) is max 3. Insert P i ’s extraction SemLex 4. Repeat from step 1.

8 Results (Locations) PatternSeed wordExtracted Headquarted in xNicaraguaSan miguel, chapare region Gripped xColombianone To occupy xNicaragua, townSmall country, this northern region, san sebastian nieghbourhood…. Shot in xCity, soyapangoJauja, central square, head, clash central mountain region…..

9 Multi-level Bootstrapping Problem with mutual bootstrapping Insertion of incorrect word in SemLex can drastically reduce accuracy Solution Second level of bootstrapping

10 Meta-bootstrapping Outer level of bootstrapping Retains the best 5 NPs Corresponding lexicons are added to a permanent list Reliability score: rel(NP i ) = Σ Ni k=1 (1+ score(p k )) Using reliable lexicons for the next iteration of Mutual-BS

11 Results Web LocationWeb CompanyTerrorism weapon Offices in Owned by exploded Facilities in employedThrew Operates in trust companyQuantity of Expanded into Sold to Hurled

12 Evaluation Corpus 4160 Corporate web pages 1500 terrorism text AutoSlog candidate extraction patterns 19,690 for the web pages 14,064 for the terrorism text Seed words Web company: Co., Company, Corp… Web Location: Different country names Terrorism location: Bolivia, city, Colombia, district

13 Evaluation (contd…) 50 iterations of Meta-bootstrapping Mutual bootstrapping ran until to produced 10 unique patterns

14 Evaluation (contd…) After 50 th iteration Web company95/206 (46%) Web location191/250 (76%) Web title107/231 (46%) Terrorism location158/250 (63%) Terrorism weapon124/244 (51%) Other systems’ accuracy (weapon): 17% (Rilof & Shepherd, 1997) 36% (Roark & Charniak, 1998)

15 Evaluation (contd…) Tested on 233 new web pages Recall/Preci sion (%) BaselineLexiconUnion Web company 10/3218/4718/45 Web location11/9851/7754/74 Web title6/10046/6647/62


Download ppt "Learning Dictionaries for Information Extraction by Multi- Level Bootstrapping Ellen Riloff and Rosie Jones, AAAI 99 Presented by: Sunandan Chakraborty."

Similar presentations


Ads by Google