Download presentation
Presentation is loading. Please wait.
1
Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA
2
Summary We illustrated… 1. the construction of character-based wrappers used in SEAL 2. a method to extend SEAL to learn binary relational concepts We showed that… 1. character-based wrappers perform better than HTML-based 2. binary SEAL has good performance
3
Background – SEAL Set Expander for Any Language Wang & Cohen, ICDM 2007 An example of set expansion Given an input query (seeds): { survivor, amazing race } The output answer is: { american idol, big brother,... }
4
Features Independent of human & markup language Support seeds in English, Chinese, Japanese,... Accept documents in HTML, XML, SGML, TeX, … Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web Research contributions Automatically construct wrappers for extracting candidate items Rank candidates using random walk
5
Fetcher: Download web pages containing all seeds Extractor: Learn and construct wrappers Ranker: Rank candidate items using Random Walk Canon Nikon Olympus Pentax Sony Kodak Minolta Panasonic Casio Leica Fuji Samsung … SEAL’s Architecture
6
Wrapper Learner Current WL only learns unary relation e.g., x is a mayor A unary wrapper consists of a pair of left (L) and right (R) context string Extracts all strings between L, R Extended WL learns binary relation e.g., x is the mayor of city y A binary wrapper has an additional middle (M) context string Extracts string pairs between L, M and M, R
7
Unary Relation Wrapper Construction
8
Real Unary Wrappers Given seeds: Ford, Nissan, Toyota Examples of wrappers and extractions:
9
Mock Unary Example Given seeds: Ford, Nissan, Toyota Example document written in an unknown mark-up language:
10
Context tries for mock example: Constructed unary wrappers:
11
Metric – Mean Average Precision Dataset – 36 datasets (Wang & Cohen, ICDM 2007) Evaluated on 5 types of wrappers Type 1 is least strict – SEAL’s default Type 5 is most strict – less strict than any HTML wrapper Result – stricter wrappers perform worse Unary SEAL Evaluation
12
Binary Wrapper Construction Keep track of all middle contexts: In the unary code, replace Intersect with:
13
Real Binary Wrappers
15
Binary SEAL Evaluation Relational Datasets Surveyed more than a dozen Randomly selected five: Bootstrap results ten times using iSEAL (an iterative version of SEAL) Wang & Cohen, ICDM 2008
17
Unary SEAL Evaluation
18
Mock Binary Example Example document written in an unknown mark-up language: Given seeds: Ford, Nissan, Toyota
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.