Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park

Information Extraction Particular natural language understanding task  Inherently domain-specific  Input : unrestricted text  Output : information in a structured form  Skim a text to find relevant sections and focus only on these sections

Problems in IE 1. The accuracy and robustness of systems can still be greatly improved. 2. Building a system in a new domain is difficult and time- consuming.

Architecture of IE System

Architecture (1) Tokenizing and Tagging Sentence Analysis  Phrase Identification  Simple Grammatical Relation  Find and Label semantic entities relevant to the extraction topic  Difference to traditional parsers  In IE, we need not a complete, detailed parse tree. Extraction  Identify domain-specific relations among relevant entities.

Architecture (2) Merging  The main job : Coreference Resolution (Anaphora Resolution)  Optional : Implicit Subject of All Subjects Template Generation  Determine the number of distinct events  Map the individually extracted pieces onto each event  Produce output templates

Role of Corpus-Based Language Learning Algorithms Catch  Obtaining enough training data  For language tasks  Annotated corpora like Penn Treebank Some problems  Learning extraction patterns, coreference resolution, template generation  Difficult to Apply ML techniques  No Corpora annotated  Semantic and Domain-specific language processing skill is needed.

Learning Extraction Patterns Good Pattern  General enough to extract the correct information from more than one sentence  Specific enough not to apply in inappropriate contexts A number of learning methods  The class of patterns learned  The training corpus required  The amount and type of human feedback required  The degree of preprocessing necessary  The background knowledge required  The biases inherent in the learning algorithm itself

AutoSlog (1) Learns extraction patterns in the form of domain-specific “concept node” definitions  CIRCUS Parser Concept Node  Domain-specific semantic case frames that contain a maximum of one slot per frame

AutoSlog (2) One-shot learning algorithm  Training Corpus  A set of texts with noun phrases annotated with the appropriate concept type  Associated answer keys as in MUC corpus  Required  Partial parser  A small(approximately 13) set of general linguistic patterns

AutoSlog (3) To derive a pattern for extracting the phrase: 1. Find the sentence from which the NP originated. 2. Present the sentence to the partial parser for processing. 3. Apply the linguistic patters in order.  Identify thematic role based on the syntactic position. 4. When a pattern applies, generate a concept node definition from the matched constituents, their context, the concept type provided in the annotation for the target NP, and the predefined semantic class for the filler.

Other System PALKA (Kim and Moldovan, 1995)  Background knowledge  Concept hierarchy  a set of predefined keywords that can be used to trigger each pattern and a semantic class lexicon CRYSTAL (Soderland et al. 1995)  Learn extraction patterns in the form of semantic case frames Huffman’s LIEP system

Coreference Resolution (1) An Example from MUC-6 Major weakness of existing IE systems  Use manually generated heuristics (Generalization?)  Assume input is fully parsed  With Grammatical Function, Thematic Roles available  The error is accumulated by sentence after sentence.  Must be able to handle the myriad forms of coreference

Coreference Resolution (2) Empirical Method  Inductive learning algorithms can be applied  MLR (Aone and Bennett, 1995) : on Japanese  RESOLVE (McCarthy and Lehnert, 1995) : on English  C4.5 as learning algorithm  Dataset  MLR : Automatically Generated  RESOLVE : Manually Generated, noise-free

Coreference Resolution (3) MLR Feature Set  66 features (1) lexical features of each phrase (2) the grammatical role of the phrase (3) semantic class information (4) relative positional information (5) whether each phrase contains a proper name (2 features) (6) whether one or both phrases refer to the entity formed by a joint venture (3 features) (7) whether one phrase contains an alias of the other (1 feature) (8) whether the phrases have the same base noun phrase (1 feature) (9) whether the phrases originate from the same sentence (1 feature)  (1) ~ (4) : domain independent

Coreference Resolution (4) Test of MLR and RESOLVE  Evaluated using 50 ~ 250 texts  RESOLVE  Recall : 80 ~ 85%  Precision : 87 ~ 92%  Default (Always negative) : about 74%  MLR  Recall : 67 ~ 70%  Precision : 83 ~ 88%  Both significantly outperforms IE systems manually developed.

Coreference Resolution (5) Much research to do yet  Should be tested on additional types of anaphors  Without domain-specific information (?)  Relative errors from the preceding phases must be investigated. Few attempt for other discourse-level problems

Future Directions Research in IE is very new.  Applying ML algorithms is even newer. A number of exciting directions  Unsupervised Learning for sidestepping the lack of corpora  How to eliminate NLP experts in moving IE systems to other domains?

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Similar presentations

Presentation on theme: "Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park.

Similar presentations

Presentation on theme: "Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park."— Presentation transcript:

Similar presentations

About project

Feedback