Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning Chia-Hui Chang Dept. of Computer Science and Information.

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

1 Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules Chun-Nan Hsu Arizona State University.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Describing Process Specifications and Structured Decisions Systems Analysis and Design, 7e Kendall & Kendall 9 © 2008 Pearson Prentice Hall.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
 2005 Pearson Education, Inc. All rights reserved Introduction.
IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan
1 A Balanced Introduction to Computer Science, 2/E David Reed, Creighton University ©2008 Pearson Prentice Hall ISBN Chapter 17 JavaScript.
Information Extraction CS 652 Information Extraction and Integration.
Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University
RoadRunner: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo Presented by Lei Lei.
Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan
Aki Hecht Seminar in Databases (236826) January 2009
Information Extraction CS 652 Information Extraction and Integration.
ODE: Ontology-assisted Data Extraction WEIFENG SU et al. Presented by: Meher Talat Shaikh.
Machine-learning based Semi-structured IE Chia-Hui Chang Department of Computer Science & Information Engineering National Central University
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Structured Data Extraction Based on the slides from Bing Liu at UCI.
Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding.
Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan Sep. 16, 2005.
Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University
Extracting Structured Data from Web Page Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Annotation Free Information Extraction
Chapter 14: Advanced Topics: DBMS, SQL, and ASP.NET
Machine-learning based Semi-structured IE Chia-Hui Chang Department of Computer Science & Information Engineering National Central University
Chapter 2: Algorithm Discovery and Design
Overview of Search Engines
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
Overview of Distributed Data Mining Xiaoling Wang March 11, 2003.
A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Chia-Hui Chang, Shao-Chen Lui Dept. of Computer Science and Information Engineering National Central University IEPAD: Information Extraction Based on.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Open Information Extraction using Wikipedia
Describing Process Specifications and Structured Decisions Systems Analysis and Design, 7e Kendall & Kendall 9 © 2008 Pearson Prentice Hall.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Presenter: Shanshan Lu 03/04/2010
Information extraction from text Spring 2003, Part 4 Helena Ahonen-Myka.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
DataBase and Information System … on Web The term information system refers to a system of persons, data records and activities that process the data.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hongkun Zhao, Weiyi.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
B Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Working with PDF and eText Templates.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.
Information Extractors Hassan A. Sleiman. Author Cuba Spain Lebanon.
Web Information Extraction
Introduction to Information Extraction
Restrict Range of Data Collection for Topic Trend Detection
Social Knowledge Mining
Automatic Wrapper Induction: “Look Mom, no hands!”
Family History Technology Workshop
Presentation transcript:

Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan

Outline Problem Definition of Information Extraction  Semi-structured IE  Plain Text Information Extraction Methods  Special designed programming language W4F, Xwrap, Lixto  Supervised learning approach WIEN, Softmealy, Stalker  Unsupervised learning approach IEPAD  Semi-supervised learning approach OLERA Summary and Future Work

Introduction Information Extraction (IE) is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form. The output template of the IE task  Several fields (slots)  Several instances of a field

Problem Definition Plain Text Information Extraction  The task of locating specific pieces of data from a natural language document  To obtain useful structured information from unstructured text  DARPA’s MUC program Semi-structured IE  Different from traditional IE  The necessity of extracting and integrating data from multiple Web-based sources  e.g. generating1000 wrappers/extractors

Types of IE from MUC Named Entity recognition (NE)  Finds and classifies names, places, etc. Coreference Resolution (CO)  Identifies identity relations between entities in texts. Template Element construction (TE)  Adds descriptive information to NE results. Scenario Template production (ST)  Fits TE results into specified event scenarios.

IE from Semi-structured Documents Output Template: k-tuple  Multiple instances of a field  Missing data  Several permutation of attributes

Special-designed Programming Language Programming by users  General programming language  Special-designed programming language W4F, Xwrap, Lixto How?  Observing common delimiters as landmarks  Writing extraction rules

Supervised Learning Approach Wrapper induction  WIEN, IJCAI-97 Kushmerick, Weld, Doorenbos,  SoftMealy, IJCAI-99 Hsu  STALKER, AA-99 Muslea, Minton, Knoblock Key component of IE systems  Interface for labeling  Learning algorithm Extraction rules: Rule format  Extractor

Example Labels: {(Congo, 242), (Egypt, 20), (Belize, 501), (Spain, 34)}

Labeling Start and end positions for  Scope  Record  Attribute Example

Learning Algorithm Token hierarchy for generalization  Background knowledge Learning Algorithms Rule expression  Delimiter-based Consecutive landmark Sequential landmark  Context rule

Extractor Architecture WIEN  Single-pass  Single-loop, no branch STALKER  Multi-pass  Bi-directional scanning Softmealy  Single-pass or multi-pass  Finite-state transducer

Pattern-discovery based IE (Unsupervised Learning Approach )  Motivation Display of multiple records often forms a repeated pattern The occurrences of the pattern are spaced regularly and adjacently  Now the problem becomes... Find regular and adjacent repeats in a string

IEPAD Architecture Pattern Discoverer Extractor Extraction Results Html Page Patterns Pattern Viewer Extraction Rule Users Html Pages

The Pattern Generator Translator PAT tree construction Pattern validator Rule Composer HTML Page Token Translator PAT Tree Constructor Validator Rule Composer PAT trees and Maximal Repeats Advenced Patterns Extraction Rules A Token String

1. Web Page Translation Encoding of HTML source  Rule 1: Each tag is encoded as a token  Rule 2: Any text between two tags are translated to a special token called TEXT (denoted by a underscore) HTML Example: Congo 242 Egypt 20 Encoded token string T( )T(_)T( )T( )T(_)T( )T( )

2. PAT Tree Construction PAT tree: binary suffix tree A Patricia tree constructed over all possible suffix strings of a text Example T( ) 000 T( )001 T( )010 T( )011 T( )100 T(_) T( )T(_)T( )T( )T(_)T( )T( )

The Constructed PAT Tree

Definition of Maximal Repeats Let  occurs in S in position p 1, p 2, p 3, …, p k  is left maximal if there exists at least one (i, j) pair such that S[p i -1]  S[p j -1]  is right maximal if there exists at least one (i, j) pair such that S[p i +|  |]  S[p j +|  |]  is a maximal repeat if it it both left maximal and right maximal

3. Pattern Validator Suppose a maximal repeat  are ordered by its position such that suffix p 1 < p 2 < p 3 … < p k, where p i denotes the position of each suffix in the encoded token sequence. Characteristics of a Pattern  Regularity: Variance coefficient  Adjacency: Density

4. Rule Composer Problem  Patterns with density less than 1 can extract only part of the information Solution  Align k-1 substrings among the k occurrences  A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

Multiple String Alignment Suppose “ adc ” is the discovered pattern for token string “ adcwbdadcxbadcxbdadcb ” If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'': a d c w b d a d c x b - a d c x b d The extraction pattern can be generalized as “ adc[w|x]b[d|-] ”

Pattern Viewer / User Interface Java-application based GUI Web based GUI 

The Extractor Matching the pattern against the encoding token string  Knuth-Morris-Pratt’s algorithm  Boyer-Moore’s algorithm Alternatives in a rule  matching the longest pattern What are extracted?  The whole record

Problem Deals only with multi-record pages Many patterns are composed due to  Multiple string alignment  Unknown start position Alignment error due to ignored text strings

Semi-supervised approach: OLERA An universal method for wrapping both  single-record pages or  multi-record pages OnLine Extraction Rule Analysis  Drill-down/Roll up operations  Encoding hierarchy (What would you do?)

OLERA ’ s Framework doc Block Enclosing Attribute Designation Drill down/ Roll up Extraction Patterns Page Encoder Approximate Matching Multiple String Alignment Page Encoder Multiple String Alignment Three simple operations  Block enclosing Block enclosing  Drill-down/Roll-up  Attribute Designation

Block Enclosing Multiple single- record pages

Enclosing (Cont.) Different from labeling  The number of enclosing operation is far less than the number of training pages Encoding Approximate Matching  Extension of global string alignment String Alignment  Enhanced matching function

Attribute Designation

Drill-down/Roll-up Drill-down  Encoding  Multiple String Alignment  Each column is given a identifier: 8_0, 8_1, 8_2 for drill down operation on column 8 Roll-up  Several columns can be concatenated together The corresponding identifiers are recorded

Extractors Grammar  Signature representation for alignment result  Each drill-down and roll-up operations  The columns to be extracted for each attribute Matching signature pattern in testing pages  Variation of approximate matching Insertion and mismatch is not allowed Deletion is allowed only if indicated in the signature pattern

Conclusion The input of training page  Annotated or unlabeled The format of extraction rule  Delimiter-based, content-based, contextual rule The background knowledge  Implicitly or explicitly

Problems For different problems, different encoding scheme is needed Designing unsupervised approach for both single-record and multi-record documents

References Semi-structured IE  C.H. Chang and S.C. Kuo, OLERA: On-Line Extraction Rule Analysis for Semi-structured Documents, Submitted for publication.OLERA: On-Line Extraction Rule Analysis for Semi-structured Documents  C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW10, pp , May 2-6, 2001, Hong Kong.IEPAD: Information Extraction based on Pattern Discovery