Download presentation
Presentation is loading. Please wait.
Published byMary Carroll Modified over 9 years ago
1
Implementing Automatic Value Extraction from Structured Web Pages Varun Ganapathi, Jonathan Pines, Josh Wiseman
2
Problem Context: Many web pages are generated by applying a template to structured data Goal: Given a set of pages generated from a template, infer the template. Extract values from previously unseen pages generated from the template Why? The template encodes structure that usually has semantic meaning. The structured values that back a page are all the important information in the page.
3
What is a Template? It is a special case of a context free grammar Tuple ( fixed-length ordered lists ) Sets ( arbitrary-length lists denoted by separators ) Example of Instantiated Template: Ethan Hunt comes face to face with a dangerous and … 6.8 Tom Cruise Ethan Hunt Ving Rhames Luther Strickell
4
Learning Templates Use the following observations: 1.When tokens occur frequently together, it might be because they are derived from the same template 2.The strings derived from templates have certain properties 1.Ordered 2.Nested Loop Find equivalence classes of differentiated tokens Increase partial template Differentiate tokens based on partial template Construct Template using Patterns
5
Evaluation We manually extracted “interesting” data from several IMDB movie pages. Ethan Hunt comes face to face with a dangerous and … 6.8 Tom Cruise Ethan Hunt Ving Rhames Luther Strickell Some attributes: title, writers, directors, plot summary, rating, actors, languages, trivia, … Attributes were either: Correct: Our system was perfect. Partially Correct: Our system got a bit too much. Incorrect: Our system missed some data.
6
Results
8
Attributes: 5 correct 5 partially correct 6 incorrect
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.