Download presentation
Presentation is loading. Please wait.
Published byElisabeth Morton Modified over 9 years ago
1
2007. 11. 14
2
Introduction Information Extraction (IE) A limited form of “complete text comprehension” Document 로부터 entity, relationship 을 추출 Relationship => fact, event Fact: static Event: dynamic Document => Entity-relationship or frame ……………… Structured object
3
Schematic view of IE
4
Information Extraction Simple IE system Term extraction Complex IE system Frame generation
5
Data Elements of IE Entities Basic building blocks Ex) people, locations, genes, and drugs Attributes Features of extracted entities Ex) an employment relationship between a person and a company or phosphorylation between two proteins Event An activity of occurrence of interest in which entities participate such as a terrorist at, a merger between two companies, a birthday and so on
6
Data Elements of IE
7
MUC IE Tasks MUC Message Understanding Conference Sponsored by DARPA (Defense Advanced Research Project Agency) MUC tasks Named Entity Recognition Template Element Task Template Relationship (TR) Task Scenario Temple (ST) Coreference Task(CO)
8
Named Entity Recognition NER Identity all mentions of proper names and quantities in the text People names, geographic locations, and organizations Dates and times Monetary amounts and percentages Test with MUC corpora Proper names: 70% Organization: 45~50% Location: 12~32% People: 23~39% Dates and times: 25% Monetary amounts and percentages: 5%
9
Template Element Task TE A generic object and its attributes Person Organization Location (airport, city, country, province, region, water, and etc) Artifact
10
Template Relationship (TR) Task TR Find the relationship that exist between the template elements extracted from text Ex) persons and companies can be related by employee of relation Employee_of (Fletcher Maddox, UCSD Business School) Employee_of (Fletcher Maddox, La Jolla Genomatics) Product_of (Geninfo, La Jolla Genomatics) Location_of (La Jolla, La Jolla Genomatics) Location_of (CA, La Jolla Genomactics)
11
Scenario Template ST: express “domain” and task-specific entities and relations
12
Coreference Task (CO) CO: captures information on coreferring expression (eg. Pronouns or any other mentions of a given entity Ex David came home from school, and saw his mother, Rachel. She told him that his father will be late. Identified pronominal coreference (David, his, him, his) (mother, Rachel, she)
13
IE Examples
14
Architecture of IE Systems
15
Tokenization module Splits an input document into its basic building blocks Words, sentences, and paragraphs Morphological and lexical analysis Assign POS tags to the document various words, creating basic phrases (like noun phrases and verb phrases), and disambiguating the sense of ambiguous words and phrases Syntactic analysis Establish the connection between the difference parts of each sentence by doing full parsing or shallow parsing Domain analysis Combine all the information collected from the previous components and creates complete frames that describe relationship between entities Can include ‘anaphora resolution’
16
Information Flow in IE System Processing initial lexical content: Tokenization and Lexical Analysis Proper name identification Shallow parsing Building relations Inferencing
17
Information Flow in IE System Building relations Using domain-specific pattern Ex) Company [Temporal] @ Announce Connector Person PersonDetail @Appoint Position Inferencing Infer missing values to complete the identification values Ex) John Edgar was reported to live with Nancy Leroy. His Address is 101 Forest Rd., Bethlethem, PA. Person(John Edgar) Person(Nancy Leroy) Livetogether(John Edgar, Nancy Leroy) Address(John Edgar, 101 Forest Rd., Bethlethem, PA) Address(P2,A) :- person(P1), person(P2), livetogether(P1, P2), address(P1,A)
18
Anaphora Resolution Anaphora (Coreference) resolution Process of matching pairs of NLP expressions that refer to the same entity in the real world Two main approaches Knowledge-based approach Linguistic analysis of sentences Machine learning-based approach Need Annotated corpus
19
Anaphora Resolution Pronominal anaphora Reflexive/personal/possessive pronouns Proper name coreference Apposition Predicative nominative Identical sets Function-value coreference Ordinal anaphora One-anaphora Part-whole coreference
20
Approaches to Anaphora Resolution Focus on pronominal resolution Hobbs Algorithm Also called ‘Naïve Algorithm’ Constraints For two candidate antecedents a and b, if a is encountered before b in the search space, then a is preferred over b. No two antecedents will have the same salience.
21
Approaches to Anaphora Resolution CogNIAC Ordered Six rules Kennedy and Boguraev Salience algorithm Mitkov Scoring algorithm Definiteness Giveness Indicating verbs Lexical reiteration Section Heading preference “non-prepositional” noun phrases Collocation pattern preference Immediate reference Referential distance Domain terminology preference
22
Approaches to Anaphora Resolution Machine Learning Approaches Markables NLP elements such as nouns, nouns phrases, or pronouns Features for Markables Sentence distance Pronouns Exact match Definite noun phrase Number agreement Semantic agreement Gender agreement Proper name alias
23
Machine Learning Approaches Generating Training Examples Positive examples {M1, M2, M3, M4} : same real-world entity Positive examples: {M1, M2}, {M2, M3}, {M3, M4} Negative examples Assume that markables a, b, c appear between M1 and M2 Negative examples: {a, M2}, {b, M2}, {c, M3}
24
Machine Learning Approaches
25
WHISK Supervised learning algorithm that uses hand- tagged examples for learning information extraction rules using regular expression Ex) Input:: * (Digit) ‘BR’ * ‘$’ (number) Output:: Rental {Bedrooms $1} {Price $2}
26
Machine Learning Approaches: BWI (Boosted Wrapper Induction)
27
“Boundary Detectors” are pairs of token sequences Detector matches a boundary iff p matches text before boundary and s matches text after boundary Detectors can contain wildcards, e.g. “capitalized word”, “number”, etc. Example: matches beginning of Date: Thursday, October 25
28
Machine Learning Approaches: (LP) 2 Algorithm Inducing two set of rules Tagging rules Ex) stime (start time of a seminar) Correction rules Ex) “at 4 pm => “at 4 pm
29
Evaluation of IE systems slotBWIHMM(LP)2WHISK Speaker67.7%76.6%77.6%18.3% Location76.7%78.6%75.0%66.4% Start Time99.6%98.5%99.0%92.6% End Time93.9%62.1%95.5%86%
30
Structural IE Introduction Considering structural or visual characteristics of the text E.g) font type, size, location A complement of conventional IE (text mining) Called ‘Visual Information Extraction (VIE)’
31
Structural IE VIE procedure Group the primitive elements into meaningful objects (e.g., lines, paragraph, etc) Establish the hierarchical structure among these objects Compare the structure of the query document with the structure of the training document to find the objects corresponding to the target fields
32
Object Tree
33
Object Tree Generation X Y Fit (Y, X) : A measure of how fit Y is as an additional member to X paragraph line
34
Computing Similarity in O- tree
35
Finding the target fields
36
Templates
38
Browsing
39
Topic distribution Browsing USA, UK => acq 42/19.09%
40
Browsing and filtering associations
41
Browsing associations
42
Taxonomy (Topic Hierarchy) Management
43
Taxonomy Editor
44
Clustering Display using Concept Hierarchy
45
Query Contruction
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.