Download presentation
Presentation is loading. Please wait.
Published byOctavia Fisher Modified over 9 years ago
1
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen
2
NYU 2 Outline Information Extraction: background Problems in IE Prior Work: Machine Learning for IE Discover patterns from raw text Experimental results Current work
3
NYU 3 Quick Overview What is Information Extraction ? Definition: –finding facts about a specified class of events from free text –filling a table in a data base (slots in a template) Events: instances in relations, with many arguments
4
NYU 4 –George Garrick, 40 years old, president of the London-based European Information Services Inc., was appointed chief executive officer of Nielsen Marketing Research, USA. Example: Management Succession
5
NYU 5 Example: Management Succession –George Garrick, 40 years old, president of the London-based European Information Services Inc., was appointed chief executive officer of Nielsen Marketing Research, USA.
6
NYU 6 discourse sentence Lexical Analysis System Architecture: Proteus Name Recognition Partial Syntax Scenario Patterns Reference Resolution Discourse Analyzer Output Generation Input Text Extracted Information
7
NYU 7 discourse sentence Lexical Analysis System Architecture: Proteus Name Recognition Partial Syntax Scenario Patterns Reference Resolution Discourse Analyzer Output Generation Input Text Extracted Information
8
NYU 8 Problems Customization Performance
9
NYU 9 Problems: Customization To customize a system for a new extraction task, we have to develop –new patterns for new types of events –word classes for the domain –inference rules This can be a large job requiring skilled labor –expense of customization limits uses of extraction
10
NYU 10 Problems: Performance Performance on event IE is limited On MUC tasks, typical top performance is recall < 55%, precision < 75% Errors propagate through multiple phases: –name recognition errors –syntax analysis errors –missing patterns –reference resolution errors –complex inference required
11
NYU 11 Missing Patterns As with many language phenomena –a few common patterns –a large number of rare patterns Rare patterns do not surface sufficiently often in limited corpus Missing patterns make customization expensive and limit performance Finding good patterns is necessary to improve customization and performance Freq Rank
12
NYU 12 Prior Research build patterns from examples –Yangarber ‘97 generalize from multiple examples: annotated text –Crystal, Whisk (Soderland), Rapier (Califf) active learning: reduce annotation –Soderland ‘99, Califf ‘99 learning from corpus with relevance judgements –Riloff ‘96, ‘99 co-learning/bootstrapping –Brin ‘98, Agichtein ‘00
13
NYU 13 Our Goals Minimize manual labor required to construct pattern bases for new domain –un-annotated text –un-classified text –un-supervised learning Use very large corpora -- larger than we could ever tag manually -- to improve coverage of patterns
14
NYU 14 Principle: Pattern Density If we have relevance judgements for documents in a corpus, for the given task, then the patterns which are much more frequent in relevant documents will generally be good patterns Riloff (1996) finds patterns related to terrorist attacks
15
NYU 15 Principle: Duality Duality between patterns and documents: –relevant documents are strong indicators of good patterns –good patterns are strong indicators of relevant documents
16
NYU 16 Outline of Procedure Initial query: a small set of seed patterns which partially characterize the topic of interest repeat Initial query: a small set of seed patterns which partially characterize the topic of interest Retrieve documents containing seed patterns: “relevant documents” Initial query: a small set of seed patterns which partially characterize the topic of interest Retrieve documents containing seed patterns: “relevant documents” Rank patterns in relevant documents by frequency in relevant docs vs. overall frequency Initial query: a small set of seed patterns which partially characterize the topic of interest Retrieve documents containing seed patterns: “relevant documents” Rank patterns in relevant documents by frequency in relevant docs vs. overall frequency Add top-ranked pattern to seed pattern set
17
17 #1: pick seed pattern Seed:
18
18 #2: retrieve relevant documents Seed: Fred retired. ... Harry was named president. Maki retired. ... Yuki was named president. Relevant documents Other documents
19
19 #3: pick new pattern Seed: appears in several relevant documents (top-ranked by Riloff metric) Fred retired. ... Harry was named president. Maki retired. ... Yuki was named president.
20
20 #4: add new pattern to pattern set Pattern set:
21
NYU 21 Pre-processing For each document, find and classify names: –{ person | location | organization | …} Parse document –(regularize passive, relative clauses, etc.) For each clause, collect a candidate pattern: tuple: heads of –[ subject verb direct object object/subject complement locative and temporal modifiers … ]
22
NYU 22 Experiment Task: Management succession (as MUC-6) Source: Wall Street Journal Training corpus: ~ 6,000 articles Test corpus: –100 documents: MUC-6 formal training –+ 150 documents judged manually
23
NYU 23 Experiment: two seed patterns v-appoint = { appoint, elect, promote, name } v-resign = { resign, depart, quit, step-down } Run discovery procedure for 80 iterations
24
NYU 24 Evaluation Look at discovered patterns –new patterns, missed in manual training Document filtering Slot filling
25
NYU 25 Discovered patterns
26
NYU 26 Evaluation: new patterns Not found in manual training
27
NYU 27 Evaluation: Text Filtering How effective are discovered patterns at selecting relevant documents? –IR-style –documents matching at least one pattern
28
NYU 28
29
NYU 29
30
NYU 30 Evaluation: Slot filling How effective are patterns within a complete IE system? MUC-style IE on MUC-6 corpora Caveat
31
NYU 31 Evaluation: Slot filling How effective are patterns within a complete IE system? MUC-style IE on MUC-6 corpora Caveat
32
NYU 32 Evaluation: Slot filling How effective are patterns within a complete IE system? MUC-style IE on MUC-6 corpora Caveat
33
NYU 33 Conclusion: Automatic discovery Performance comparable to human (4-week development) From un-annotated text: allows us to take advantage of very large corpora –redundancy –duality Will likely help wider use of IE
34
NYU 34
35
NYU 35 Good Patterns U - universe of all documents R - set of relevant documents H= H(p) - set of documents where pattern p matched Density criterion:
36
NYU 36 Graded Relevance Documents matching seed patterns considered 100% relevant Discovered patterns are considered less certain Documents containing them are considered partially relevant
37
NYU 37 document frequency in relevant documents overall document frequency document frequency in relevant documents –(metrics similar to those used in Riloff-96) Scoring Patterns
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.