Presentation is loading. Please wait.

Presentation is loading. Please wait.

2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출 

Similar presentations


Presentation on theme: "2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출 "— Presentation transcript:

1 2007. 11. 14

2 Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출  Relationship => fact, event  Fact: static  Event: dynamic  Document => Entity-relationship or frame ……………… Structured object

3 Schematic view of IE

4 Information Extraction  Simple IE system  Term extraction  Complex IE system  Frame generation

5 Data Elements of IE  Entities  Basic building blocks  Ex) people, locations, genes, and drugs  Attributes  Features of extracted entities  Ex) an employment relationship between a person and a company or phosphorylation between two proteins  Event  An activity of occurrence of interest in which entities participate such as a terrorist at, a merger between two companies, a birthday and so on

6 Data Elements of IE

7 MUC IE Tasks  MUC  Message Understanding Conference  Sponsored by DARPA (Defense Advanced Research Project Agency)  MUC tasks  Named Entity Recognition  Template Element Task  Template Relationship (TR) Task  Scenario Temple (ST)  Coreference Task(CO)

8 Named Entity Recognition  NER  Identity all mentions of proper names and quantities in the text  People names, geographic locations, and organizations  Dates and times  Monetary amounts and percentages  Test with MUC corpora  Proper names: 70%  Organization: 45~50%  Location: 12~32%  People: 23~39%  Dates and times: 25%  Monetary amounts and percentages: 5%

9 Template Element Task  TE  A generic object and its attributes  Person  Organization  Location (airport, city, country, province, region, water, and etc)  Artifact

10 Template Relationship (TR) Task  TR  Find the relationship that exist between the template elements extracted from text  Ex) persons and companies can be related by employee of relation  Employee_of (Fletcher Maddox, UCSD Business School)  Employee_of (Fletcher Maddox, La Jolla Genomatics)  Product_of (Geninfo, La Jolla Genomatics)  Location_of (La Jolla, La Jolla Genomatics)  Location_of (CA, La Jolla Genomactics)

11 Scenario Template ST: express “domain” and task-specific entities and relations

12 Coreference Task (CO) CO: captures information on coreferring expression (eg. Pronouns or any other mentions of a given entity  Ex  David came home from school, and saw his mother, Rachel. She told him that his father will be late.  Identified pronominal coreference  (David, his, him, his)  (mother, Rachel, she)

13 IE Examples

14 Architecture of IE Systems

15  Tokenization module  Splits an input document into its basic building blocks  Words, sentences, and paragraphs  Morphological and lexical analysis  Assign POS tags to the document various words, creating basic phrases (like noun phrases and verb phrases), and disambiguating the sense of ambiguous words and phrases  Syntactic analysis  Establish the connection between the difference parts of each sentence by doing full parsing or shallow parsing  Domain analysis  Combine all the information collected from the previous components and creates complete frames that describe relationship between entities  Can include ‘anaphora resolution’

16 Information Flow in IE System  Processing initial lexical content: Tokenization and Lexical Analysis  Proper name identification  Shallow parsing  Building relations  Inferencing

17 Information Flow in IE System  Building relations  Using domain-specific pattern  Ex)  Company [Temporal] @ Announce Connector Person PersonDetail @Appoint Position  Inferencing  Infer missing values to complete the identification values  Ex)  John Edgar was reported to live with Nancy Leroy. His Address is 101 Forest Rd., Bethlethem, PA.  Person(John Edgar)  Person(Nancy Leroy)  Livetogether(John Edgar, Nancy Leroy)  Address(John Edgar, 101 Forest Rd., Bethlethem, PA)  Address(P2,A) :- person(P1), person(P2), livetogether(P1, P2), address(P1,A)

18 Anaphora Resolution  Anaphora (Coreference) resolution  Process of matching pairs of NLP expressions that refer to the same entity in the real world  Two main approaches  Knowledge-based approach  Linguistic analysis of sentences  Machine learning-based approach  Need Annotated corpus

19 Anaphora Resolution  Pronominal anaphora  Reflexive/personal/possessive pronouns  Proper name coreference  Apposition  Predicative nominative  Identical sets  Function-value coreference  Ordinal anaphora  One-anaphora  Part-whole coreference

20 Approaches to Anaphora Resolution  Focus on pronominal resolution  Hobbs Algorithm  Also called ‘Naïve Algorithm’  Constraints  For two candidate antecedents a and b, if a is encountered before b in the search space, then a is preferred over b.  No two antecedents will have the same salience.

21 Approaches to Anaphora Resolution  CogNIAC  Ordered Six rules  Kennedy and Boguraev  Salience algorithm  Mitkov  Scoring algorithm  Definiteness  Giveness  Indicating verbs  Lexical reiteration  Section Heading preference  “non-prepositional” noun phrases  Collocation pattern preference  Immediate reference  Referential distance  Domain terminology preference

22 Approaches to Anaphora Resolution  Machine Learning Approaches  Markables  NLP elements such as nouns, nouns phrases, or pronouns  Features for Markables  Sentence distance  Pronouns  Exact match  Definite noun phrase  Number agreement  Semantic agreement  Gender agreement  Proper name  alias

23 Machine Learning Approaches  Generating Training Examples  Positive examples  {M1, M2, M3, M4} : same real-world entity  Positive examples: {M1, M2}, {M2, M3}, {M3, M4}  Negative examples  Assume that markables a, b, c appear between M1 and M2  Negative examples: {a, M2}, {b, M2}, {c, M3}

24 Machine Learning Approaches

25  WHISK  Supervised learning algorithm that uses hand- tagged examples for learning information extraction rules  using regular expression  Ex)  Input:: * (Digit) ‘BR’ * ‘$’ (number)  Output:: Rental {Bedrooms $1} {Price $2}

26 Machine Learning Approaches: BWI (Boosted Wrapper Induction)

27  “Boundary Detectors” are pairs of token sequences  Detector matches a boundary iff p matches text before boundary and s matches text after boundary  Detectors can contain wildcards, e.g. “capitalized word”, “number”, etc.  Example:  matches beginning of Date: Thursday, October 25

28 Machine Learning Approaches: (LP) 2 Algorithm  Inducing two set of rules  Tagging rules  Ex) stime (start time of a seminar)  Correction rules  Ex) “at 4 pm  => “at 4 pm

29 Evaluation of IE systems slotBWIHMM(LP)2WHISK Speaker67.7%76.6%77.6%18.3% Location76.7%78.6%75.0%66.4% Start Time99.6%98.5%99.0%92.6% End Time93.9%62.1%95.5%86%

30 Structural IE  Introduction  Considering structural or visual characteristics of the text  E.g) font type, size, location  A complement of conventional IE (text mining)  Called ‘Visual Information Extraction (VIE)’

31 Structural IE  VIE procedure  Group the primitive elements into meaningful objects (e.g., lines, paragraph, etc)  Establish the hierarchical structure among these objects  Compare the structure of the query document with the structure of the training document to find the objects corresponding to the target fields

32 Object Tree

33 Object Tree Generation X Y Fit (Y, X) : A measure of how fit Y is as an additional member to X paragraph line

34 Computing Similarity in O- tree

35 Finding the target fields

36 Templates

37

38 Browsing

39 Topic distribution Browsing USA, UK => acq 42/19.09%

40 Browsing and filtering associations

41 Browsing associations

42 Taxonomy (Topic Hierarchy) Management

43 Taxonomy Editor

44 Clustering Display using Concept Hierarchy

45 Query Contruction


Download ppt "2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출 "

Similar presentations


Ads by Google