Download presentation
Presentation is loading. Please wait.
Published byJesse Copeland Modified over 9 years ago
1
Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw
2
Outline Problem Definition of Information Extraction Semi-structured IE Plain Text Information Extraction Methods Special designed programming language W4F, Xwrap, Lixto Supervised learning approach WIEN, Softmealy, Stalker Unsupervised learning approach IEPAD Semi-supervised learning approach OLERA Summary and Future Work
3
Introduction Information Extraction (IE) is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form. The output template of the IE task Several fields (slots) Several instances of a field
4
Problem Definition Plain Text Information Extraction The task of locating specific pieces of data from a natural language document To obtain useful structured information from unstructured text DARPA’s MUC program Semi-structured IE Different from traditional IE The necessity of extracting and integrating data from multiple Web-based sources e.g. generating1000 wrappers/extractors
5
Types of IE from MUC Named Entity recognition (NE) Finds and classifies names, places, etc. Coreference Resolution (CO) Identifies identity relations between entities in texts. Template Element construction (TE) Adds descriptive information to NE results. Scenario Template production (ST) Fits TE results into specified event scenarios.
6
IE from Semi-structured Documents Output Template: k-tuple Multiple instances of a field Missing data Several permutation of attributes
7
Special-designed Programming Language Programming by users General programming language Special-designed programming language W4F, Xwrap, Lixto How? Observing common delimiters as landmarks Writing extraction rules
8
Supervised Learning Approach Wrapper induction WIEN, IJCAI-97 Kushmerick, Weld, Doorenbos, SoftMealy, IJCAI-99 Hsu STALKER, AA-99 Muslea, Minton, Knoblock Key component of IE systems Interface for labeling Learning algorithm Extraction rules: Rule format Extractor
9
Example Labels: {(Congo, 242), (Egypt, 20), (Belize, 501), (Spain, 34)}
10
Labeling Start and end positions for Scope Record Attribute Example
11
Learning Algorithm Token hierarchy for generalization Background knowledge Learning Algorithms Rule expression Delimiter-based Consecutive landmark Sequential landmark Context rule
12
Extractor Architecture WIEN Single-pass Single-loop, no branch STALKER Multi-pass Bi-directional scanning Softmealy Single-pass or multi-pass Finite-state transducer
13
Pattern-discovery based IE (Unsupervised Learning Approach ) Motivation Display of multiple records often forms a repeated pattern The occurrences of the pattern are spaced regularly and adjacently Now the problem becomes... Find regular and adjacent repeats in a string
14
IEPAD Architecture Pattern Discoverer Extractor Extraction Results Html Page Patterns Pattern Viewer Extraction Rule Users Html Pages
15
The Pattern Generator Translator PAT tree construction Pattern validator Rule Composer HTML Page Token Translator PAT Tree Constructor Validator Rule Composer PAT trees and Maximal Repeats Advenced Patterns Extraction Rules A Token String
16
1. Web Page Translation Encoding of HTML source Rule 1: Each tag is encoded as a token Rule 2: Any text between two tags are translated to a special token called TEXT (denoted by a underscore) HTML Example: Congo 242 Egypt 20 Encoded token string T( )T(_)T( )T( )T(_)T( )T( )
17
2. PAT Tree Construction PAT tree: binary suffix tree A Patricia tree constructed over all possible suffix strings of a text Example T( ) 000 T( )001 T( )010 T( )011 T( )100 T(_)110 000110001010110011100 T( )T(_)T( )T( )T(_)T( )T( )
18
The Constructed PAT Tree
19
Definition of Maximal Repeats Let occurs in S in position p 1, p 2, p 3, …, p k is left maximal if there exists at least one (i, j) pair such that S[p i -1] S[p j -1] is right maximal if there exists at least one (i, j) pair such that S[p i +| |] S[p j +| |] is a maximal repeat if it it both left maximal and right maximal
20
3. Pattern Validator Suppose a maximal repeat are ordered by its position such that suffix p 1 < p 2 < p 3 … < p k, where p i denotes the position of each suffix in the encoded token sequence. Characteristics of a Pattern Regularity: Variance coefficient Adjacency: Density
21
4. Rule Composer Problem Patterns with density less than 1 can extract only part of the information Solution Align k-1 substrings among the k occurrences A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.
22
Multiple String Alignment Suppose “ adc ” is the discovered pattern for token string “ adcwbdadcxbadcxbdadcb ” If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'': a d c w b d a d c x b - a d c x b d The extraction pattern can be generalized as “ adc[w|x]b[d|-] ”
23
Pattern Viewer / User Interface Java-application based GUI Web based GUI http://www.csie.ncu.edu.tw/~chia/WebIEPAD/
24
The Extractor Matching the pattern against the encoding token string Knuth-Morris-Pratt’s algorithm Boyer-Moore’s algorithm Alternatives in a rule matching the longest pattern What are extracted? The whole record
25
Problem Deals only with multi-record pages Many patterns are composed due to Multiple string alignment Unknown start position Alignment error due to ignored text strings
26
Semi-supervised approach: OLERA An universal method for wrapping both single-record pages or multi-record pages OnLine Extraction Rule Analysis Drill-down/Roll up operations Encoding hierarchy (What would you do?)
27
OLERA ’ s Framework doc Block Enclosing Attribute Designation Drill down/ Roll up Extraction Patterns Page Encoder Approximate Matching Multiple String Alignment Page Encoder Multiple String Alignment Three simple operations Block enclosing Block enclosing Drill-down/Roll-up Attribute Designation
28
Block Enclosing Multiple single- record pages
29
Enclosing (Cont.) Different from labeling The number of enclosing operation is far less than the number of training pages Encoding Approximate Matching Extension of global string alignment String Alignment Enhanced matching function
30
Attribute Designation
31
Drill-down/Roll-up Drill-down Encoding Multiple String Alignment Each column is given a identifier: 8_0, 8_1, 8_2 for drill down operation on column 8 Roll-up Several columns can be concatenated together The corresponding identifiers are recorded
32
Extractors Grammar Signature representation for alignment result Each drill-down and roll-up operations The columns to be extracted for each attribute Matching signature pattern in testing pages Variation of approximate matching Insertion and mismatch is not allowed Deletion is allowed only if indicated in the signature pattern
33
Conclusion The input of training page Annotated or unlabeled The format of extraction rule Delimiter-based, content-based, contextual rule The background knowledge Implicitly or explicitly
34
Problems For different problems, different encoding scheme is needed Designing unsupervised approach for both single-record and multi-record documents
35
References Semi-structured IE C.H. Chang and S.C. Kuo, OLERA: On-Line Extraction Rule Analysis for Semi-structured Documents, Submitted for publication.OLERA: On-Line Extraction Rule Analysis for Semi-structured Documents C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW10, pp. 681-688, May 2-6, 2001, Hong Kong.IEPAD: Information Extraction based on Pattern Discovery
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.