Information Extraction CS 652 Information Extraction and Integration.

Information Extraction CS 652 Information Extraction and Integration

Information Extraction(IE) Task Information Retrieval(IR) and IE History of IE Evaluation Metrics Approaches to IE Free, Structured, and Semistructured Text Web Documents IE Systems Discussion

IR and IE IR Retrieves relevant documents from collections Information theory, probabilistic theory, and statistics IE Extracts relevant information from documents Computational linguistics and natural language processing

History of IE Large amount of both online and offline textual data. Message Understanding Conference (MUC) Quantitative evaluation of IE systems Tasks  Latin American terrorism  Joint ventures  Microelectronics  Company management changes

Evaluation Metrics Precision (PR) Recall (R) F-measure

Approaches to IE Knowledge Engineering Approach Grammars are constructed by hand Domain patterns are discovered by human experts through introspection and inspection of a corpus Much laborious tuning and “hill climbing” Automatic Training Approach Use statistical methods when possible Learn rules from annotated corpora Learn rules from interaction with user

Knowledge Engineering Advantages With skills and experience, good performing systems are not conceptually hard to develop. The best performing systems have been hand crafted. Disadvantages Very laborious development process Some changes to specifications can be hard to accommodate Required expertise may not be available

Automatic Training Advantages Domain portability is relatively straightforward System expertise is not required for customization “Data driven” rule acquisition ensures full coverage of examples Disadvantages Training data may not exist, and may be very expensive to acquire Large volume of training data may be required Changes to specifications may require reannotation of large quantities of training data

Texts Free Text Natural language processing Structured Text Textual information in a database or file following a predefined and strict format Semistructured Text Ungrammatical Telegraphic Web Documents

Web Document Categorization [Hsu,1998] Structured Itemised information Uniform syntactic clues (e.g., delimiters, attribute orders, …) Semistructured (e.g., missing attributes, multi-value attributes, …) Unstructured (e.g., linguistic knowledge is required, …)

Free Text AutoSlog Liep Palka Hasten Crystal WebFoot WHISK

AutoSlog [1993] The Parliament building was bombed by Carlos.

LIEP [1995] The Parliament building was bombed by Carlos.

PALKA [1995] The Parliament building was bombed by Carlos.

HASTEN [1995] The Parliament building was bombed by Carlos. Egraphs (SemanticLabel, StructuralElement)

CRYSTAL [1995] The Parliament building was bombed by Carlos.

CRYSTAL + Webfoot [1997]

WHISK [1999] The Parliament building was bombed by Carlos. WHISK Rule: *( PhyObj )*@passive *F ‘bombed’ * {PP ‘by’ *F ( Person )} Context-based patterns

Comparison Extraction granularity Semantic Class Constraint Single_ Slot Rule Multi_ Slot Rule Syntactic Constraints AutoSlog  Liep  Palka  Hasten  Crystal  WHISK 

Web Documents Semistructured and Unstructured RAPIER (E. Califf, 1997) SRV (D. Freitag, 1998) WHISK (S. Soderland, 1998) Semistructured and Structured WIEN (N. Kushmerick, 1997) SoftMealy (C-H. Hsu, 1998) STALKER (I. Muslea, S. Minton, C. Knoblock, 1998)

Inductive Learning Task Inductive Inference Learning Systems Zero-order First-order, e.g., Inductive Logic Programming (ILP)

RAPIER [1997] Inductive Logic Programming Extraction Rules Syntactic information Semantic information Advantage Efficient learning (bottom-up) Drawback Single-slot extraction

RAPIER Rule

SRV [1998] Relational Algorithm (top-down) Features Simple features (e.g., length, character type, …) Relational features (e.g., next-token, …) Advantages Expressive rule representation Drawbacks Single-slot rule generation Large-volume of training data

SRV Rule

WHISK [1998] Covering Algorithm (top-down) Advantages Learn multi-slot extraction rules Handle various order of items-to-be-extracted Handle document types from free text to structured text Drawbacks Must see all the permutations of items Less expressive feature set Need large volume of training data

WHISK Rule

Wrapper Induction Wrapper: an IE application for one particular information source Delimiter-based Rules No linguistic constraints

WIEN [1997] Assumes Items are always in fixed, known order Introduces several types of wrappers Advantages Fast to learn and extract Drawbacks Can not handle permutations and missing items Must label entire pages Does not use semantic classes

WIEN Rule

SoftMealy [1998] Learns a transducer Advantages Learns order of items Allows item permutations and missing items Allows both the use of semantic classes and disjunctions Drawbacks Must see all possible permutations Can not use delimiters that do not immediately precede and follow the relevant items

SoftMealy Rule

STALKER [1998,1999,2001] Hierarchical Information Extraction Embedded Catalog Tree (ECT) Formalism Advantages Extracts nested data Allows item permutations and missing items Need not see all of the permutations One hard-to-extract item does not affect others Drawbacks Does not exploit item order

STALKER Rule

Applications Product Descriptions (ShopBot) Restaurant Guides (STALKER) Seminar Announcements (SRV) Job Advertisement (RAPIER) Executive Succession (WHISK)

Commercial Systems Junglee [1996] Jango [1997] MySimon [1998] …?…

Discussion

Information Extraction CS 652 Information Extraction and Integration.

Similar presentations

Presentation on theme: "Information Extraction CS 652 Information Extraction and Integration."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Extraction CS 652 Information Extraction and Integration.

Similar presentations

Presentation on theme: "Information Extraction CS 652 Information Extraction and Integration."— Presentation transcript:

Similar presentations

About project

Feedback