Presentation is loading. Please wait.

Presentation is loading. Please wait.

2004-12-09IGK Colloquium - Winter 04/05 1 Suggestions to Improve the Flexibility and Adaptivity of Information Extraction Irene M. Cramer Supervisor: Prof.

Similar presentations


Presentation on theme: "2004-12-09IGK Colloquium - Winter 04/05 1 Suggestions to Improve the Flexibility and Adaptivity of Information Extraction Irene M. Cramer Supervisor: Prof."— Presentation transcript:

1 2004-12-09IGK Colloquium - Winter 04/05 1 Suggestions to Improve the Flexibility and Adaptivity of Information Extraction Irene M. Cramer Supervisor: Prof. Dr. D. Klakow Lehrstuhl für Sprachsignalverarbeitung Saarland University

2 IGK Colloquium - Winter 04/052 2004-12-09 Outline Information Extraction  Some comments on IE  An IE example system: FASTUS  The challenge Answers to the challenge  The possible method  Some case studies  Dissertation roadmap

3 IGK Colloquium - Winter 04/053 2004-12-09 Information Extraction Problem:  Huge amount of textual information available  Who is able to read and analyze it?

4 IGK Colloquium - Winter 04/054 2004-12-09 Information Extraction Solution IE:  Find relevant information  Analyze relevant information automatically  Structure relevant information

5 IGK Colloquium - Winter 04/055 2004-12-09 IE - Templates

6 IGK Colloquium - Winter 04/056 2004-12-09 Information Extraction Input: specification of relevant information  templates and documentstemplates Output: set of instantiated templates  e.g. store in a data base

7 IGK Colloquium - Winter 04/057 2004-12-09 Information Extraction Evaluation: precision/recall, F-measure Application:  Text Classification  Text Mining  Text Summarization  Question Answering

8 IGK Colloquium - Winter 04/058 2004-12-09 IE example system: FASTUS FASTUS (= Finite State Automa-based Text Understanding System) MUC IE system Extraction of information in unstructured text No real text understanding!

9 IGK Colloquium - Winter 04/059 2004-12-09 IE example system: FASTUS San Salvador, 19 Apr 89 (ACAN-EFE) -- [TEXT] Salvadoran President-elect Alfredo Cristiani condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti National Liberation Front (FMLN) of the crime.... Garcia Alvarado, 56, was killed when a bomb placed by urban guerrillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador.... Vice President-elect Francisco Merino said that when the attorney general's car stopped at a light on a street in downtown San Salvador, an individual placed a bomb on the roof of the armored vehicle.... According to the police and Garcia Alvarado's driver, who escaped unscathed, the attorney general was traveling with two bodyguards. One of them was injured. http://www.ai.sri.com/natural-language/projects/fastus-schabes.html

10 IGK Colloquium - Winter 04/0510 2004-12-09 IE example system: FASTUS Incident: Date - 19 Apr 89 Incident: Location El Salvador: San Salvador (CITY) Incident: Type Bombing Perpetrator: Individual ID "urban guerrillas" Perpetrator: Organization ID "FMLN" Perpetrator: Organization Suspected or Accused by Authorities: "FMLN" Confidence Physical Target: Description "vehicle" Physical Target: Effect Some Damage: "vehicle" Human Target: Name "Roberto Garcia Alvarado" Human Target: Description "attorney general": "Roberto Garcia Alvarado" "driver" "bodyguards" Human Target: Effect Death: "Roberto Garcia Alvarado" No Injury: "driver" Injury: "bodyguards" http://www.ai.sri.com/natural-language/projects/fastus-schabes.html

11 IGK Colloquium - Winter 04/0511 2004-12-09 IE example system: FASTUS Series of cascaded, finite-state automata Basically, 3 steps:steps  Recognize phrasesphrases Complex words (multi words, proper names) Simple phrases Complex phrases  Recognize patternspatterns  Merge bits of information found

12 IGK Colloquium - Winter 04/0512 2004-12-09 Company Name: Bridgestone Sports Co. (  multi word/proper name) Verb Group: said Noun Group: Friday Noun Group: it Verb Group: had set up Noun Group: a joint venture Preposition: in Location: Taiwan (  proper name) Preposition: with Noun Group: a local concern Conjunction: and Noun Group: a Japanese trading house Verb Group: to produce Noun Group: golf clubs Verb Group: to be shipped Preposition: to Location: Japan (  proper name) Example: Complex Words & Phrases http://www.ai.sri.com/natural-language/projects/fastus-schabes.html

13 IGK Colloquium - Winter 04/0513 2004-12-09 {Company/ies} {Set-up} {Joint-Venture} with {Company/ies} {Produce} {Product} {Company} {Capitalized} at {Currency} {Company} {Start} {Activity} in/on {Date} Example: Patterns http://www.ai.sri.com/natural-language/projects/fastus-schabes.html

14 IGK Colloquium - Winter 04/0514 2004-12-09 Information Extraction Limitation  Someone has to build the templates, which is time consuming.  Thus, the templates are normally static. What about adaptation to new domain …?

15 IGK Colloquium - Winter 04/0515 2004-12-09 Example – QA Ontology http://www.cs.ust.hk/~hltc/semanet02/pdf/mann.pdf

16 IGK Colloquium - Winter 04/0516 2004-12-09 The Challenge: To be more flexible (and to support open domain QA):  Have many more patterns than in a typical IE system  Base work on (already existing) QA ontologyQA ontology  Learn the patterns automatically!?

17 IGK Colloquium - Winter 04/0517 2004-12-09 The Method – Constraints We are looking for common entities (as MUC Named Entities) … … and also for exceptional ones (book titles, sports, occupations etc.) No annotated corpora No hand crafted rules Thus, we will have to start with almost nothing  unsupervised or semi unsupervised learning

18 IGK Colloquium - Winter 04/0518 2004-12-09 The Method – Bootstrapping “… a process where a simple system activates a more complicated system… “ (http://en.wikipedia.org) “… a complex system emerges by starting simply and, bit by bit, developing more complex capabilities on top of the simpler ones…” (http://en.wikipedia.org)

19 IGK Colloquium - Winter 04/0519 2004-12-09 The Method – Bootstrapping Start with seed  Learn  Evaluate the learned  Add evaluated to the seed Restart with new seed

20 IGK Colloquium - Winter 04/0520 2004-12-09 Excursus: Bootstrapping for WSD Yarowsky 1995  Start with small set of contexts for given word (e.g. plant)  Determine log likelihood values from small annotated corpus  Arrange log likelihood according to values

21 IGK Colloquium - Winter 04/0521 2004-12-09 Excursus: Bootstrapping for WSD Look for word (plant) and its context in corpus Assign sense (sense 1 or sense 2 ) on basis of best log likelihood ratio applicable Find new context words that co-occur with known context often enough Example: target: plant known context: species co-occurrence: animal

22 IGK Colloquium - Winter 04/0522 2004-12-09 Excursus: Bootstrapping for WSD Calculate log likelihood ratios of this new context Add them to list Note: smoothing is useful

23 IGK Colloquium - Winter 04/0523 2004-12-09 The Method – Bootstrapping What does this mean for Information Extraction? Start with a small number of instances (and/or patterns)  Learn thereby patterns  Evaluate patterns  Add new patterns to pattern set Derive more instances from these new patterns Evaluate new instances Add new instances to instance set Restart with enlarged instance (or pattern) set This is an iterative process. There are basically two “nested” bootstrapping loops.

24 IGK Colloquium - Winter 04/0524 2004-12-09 The Method – Bootstrapping Some principle problems:  How to evaluate the patterns and the instances?  Add all instances (patterns) to the instance (pattern) set?  Start with instances or pattern or even with both?  By the way, what is a pattern?  What about convergence of the algorithm?  What about corpus size?

25 IGK Colloquium - Winter 04/0525 2004-12-09 Some Case Studies: Corpus and Method Corpus: web and WSJ Apply algorithm described but chose patterns/instances manually

26 IGK Colloquium - Winter 04/0526 2004-12-09 Some Case Studies: City Start with one instance: “Berlin”  pattern: Hotels in und Umgebung  search for “Hotels in *” Paris München Hamburg etc. but also: Europa, Mecklenburg-Vorpommern Now, restart web search with new instances to get new patterns

27 IGK Colloquium - Winter 04/0527 2004-12-09 Some Case Studies: Professions Start with one instance: “lawyer”  pattern: lawyer’s job hire a lawyer  search for “*’s job” forester therapist reporter etc. but also: employee, John … Now, restart web search with new instances to get new patterns

28 IGK Colloquium - Winter 04/0528 2004-12-09 Some Case Studies: Problems Patterns match a lot of different instance types  possible criteria to chose good patterns Instances could be multi words  criteria that determine “instance boundaries” Even if patterns are good, instances found could be wrong ones  criteria to decide about instances

29 IGK Colloquium - Winter 04/0529 2004-12-09 Some Case Studies: List Search Start web search with 5 instances at a time  instances tennis football ballet sailing baseball Get lists with lots of additional instances all at oncelists

30 IGK Colloquium - Winter 04/0530 2004-12-09 Some Case Studies: Problems Only works on the web! For some instance types it doesn’t work at all! Decide about 5 instances  to similar or to different = find no listsfind no lists Find actual list in web page

31 IGK Colloquium - Winter 04/0531 2004-12-09 Roadmap Decide about bootstrapping and implement it Run for MUC Named Entities Run for “simple”, “one word” classes (e.g. sports, occupations) Run for “difficult” classes (e.g. book titles, movies) Run for different classes at the same time

32 IGK Colloquium - Winter 04/0532 2004-12-09 Literature survey There are some publications which address either bootstrapping or flexible IE  E. Riloff, R. Jones (1999): Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping.  E. Agichtein, L. Gravano (2000): Snowball: extracting relations from large plain-text collections.  R. Yangarber, et al. (2000): Automatic acquisition of domain knowledge for Information Extraction.  O. Etzioni, et al. (2004): Methods for Domain-Indepedent Information Extraction from the Web: An Experimental Comparison.  D. Yarowsky (1995): Unsupervised word sense disambiguation rivaling supervised methods.  St. Abney (2002): Bootstrapping.  St. Abney (2004): Understanding the Yarowsky Algorithm.

33 2004-12-09IGK Colloquium - Winter 04/05 33 Thank you!


Download ppt "2004-12-09IGK Colloquium - Winter 04/05 1 Suggestions to Improve the Flexibility and Adaptivity of Information Extraction Irene M. Cramer Supervisor: Prof."

Similar presentations


Ads by Google