Download presentation
Presentation is loading. Please wait.
Published byAnna Marshall Modified over 9 years ago
1
May 11, 2005WWW 2005 -- Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL
2
May 11, 2005WWW 2005 -- Chiba, Japan2 Acknowledgments David Karger (karger@csail.mit.edu) Haystack Group (http://haystack.csail.mit.edu)
3
May 11, 2005WWW 2005 -- Chiba, Japan3 Agenda Overview Demo Details –Induction –Matching –Semantics –Heuristics
4
May 11, 2005WWW 2005 -- Chiba, Japan4 Agenda Overview Demo Details –Induction –Matching –Semantics –Heuristics
5
May 11, 2005WWW 2005 -- Chiba, Japan5 Unwrapping the Web Majority of semantic content in “deep web” Transformed into human-readable HTML by scripts HTML is difficult for automated agents to understand Little incentive for content providers to provide RDF markup How to “unwrap” this content?
6
May 11, 2005WWW 2005 -- Chiba, Japan6 Thresher Simple UI for wrapper induction on structured web content “Demonstrate” examples of objects Induce wrapper, or pattern, based on DOM User may also label properties with RDF
7
May 11, 2005WWW 2005 -- Chiba, Japan7 Thresher Built on Haystack Semantic Web client Everything is RDF Everything has context menus Thresher brings RDF into the web browser Wrappers reify web objects for full interaction
8
May 11, 2005WWW 2005 -- Chiba, Japan8 Thresher Underlying wrapper algorithm based on tree edit distance Align user’s examples Keep aligned nodes (layout elements) Wildcard non-aligned nodes (content) Pattern matching is also alignment
9
May 11, 2005WWW 2005 -- Chiba, Japan9 Agenda Overview Demo Details –Induction –Matching –Semantics –Heuristics
10
May 11, 2005WWW 2005 -- Chiba, Japan10 Agenda Overview Demo Details –Induction –Matching –Semantics –Heuristics
11
May 11, 2005WWW 2005 -- Chiba, Japan11 Wrapper Induction Wrapper: pattern created from examples User provides positive examples Generalize examples into reusable pattern Existing techniques: –head-left-right-tail (HLRT) descriptors –Hidden Markov models –Support Vector Machines –Other Machine Learning
12
May 11, 2005WWW 2005 -- Chiba, Japan12 Wrapper Induction Our approach: take advantage of hierarchical structure of HTML Each example picks out a subtree of DOM Calculate tree edit distance between examples Least-cost edit distance gives best mapping Remove unmapped nodes to make pattern
13
May 11, 2005WWW 2005 -- Chiba, Japan13 Tree Edit Distance Calculate cost ( ) of sequence of operations to transform one tree into the other Operations: insert, delete, change a node Cost of an operation = size of subtree it affects Least-cost set of operations gives best mapping between elements
14
May 11, 2005WWW 2005 -- Chiba, Japan14 Mapping Examples
15
May 11, 2005WWW 2005 -- Chiba, Japan15 Mapping Examples
16
May 11, 2005WWW 2005 -- Chiba, Japan16 Mapping Examples
17
May 11, 2005WWW 2005 -- Chiba, Japan17 Agenda Overview Demo Details –Induction –Matching –Semantics –Heuristics
18
May 11, 2005WWW 2005 -- Chiba, Japan18 Pattern Matching Look for document subtrees with similar structure Find alignments of wrapper in tree Require every node in wrapper be mapped to some node in document subtree Wildcards match zero or more times Each valid alignment is a match
19
May 11, 2005WWW 2005 -- Chiba, Japan19 Matching Example
20
May 11, 2005WWW 2005 -- Chiba, Japan20 Agenda Overview Demo Details –Induction –Matching –Semantics –Heuristics
21
May 11, 2005WWW 2005 -- Chiba, Japan21 Adding Semantics How to tie wrappers to semantic content? Assert RDF statements about unwrapped objects Tied to wrapper structure Classes bound to wrappers Properties bound to wildcards
22
May 11, 2005WWW 2005 -- Chiba, Japan22 Semantic Labels
23
May 11, 2005WWW 2005 -- Chiba, Japan23 Semantic Matching
24
May 11, 2005WWW 2005 -- Chiba, Japan24 Semantic Matching
25
May 11, 2005WWW 2005 -- Chiba, Japan25 Semantic Matching [ ; “Dertouzos Lect…” ; “Distributed Hash…” ; “3:30 PM” ]
26
May 11, 2005WWW 2005 -- Chiba, Japan26 Agenda Overview Demo Details –Induction –Matching –Semantics –Heuristics
27
May 11, 2005WWW 2005 -- Chiba, Japan27 Find additional examples automatically Consider nodes neighboring the example Require low normalized cost: Often allows us to create wrappers with a single example Automatically Adding Examples
28
May 11, 2005WWW 2005 -- Chiba, Japan28 Automatically Adding Examples TR T
29
May 11, 2005WWW 2005 -- Chiba, Japan29 List Collapse Current wrappers generalize well for single elements Will not recognize variable length lists Collapse neighboring nodes with low normalized cost For matching, allow nodes to match more than once
30
May 11, 2005WWW 2005 -- Chiba, Japan30 Wrapper Wrap-up Gather user example(s) Automatically find additional examples Generalize examples using best mapping Add semantic labels Match by finding alignments Overlay objects on the page for interaction
31
May 11, 2005WWW 2005 -- Chiba, Japan31 Additional Tools Wrapper Sharing RSS Web Operations
32
May 11, 2005WWW 2005 -- Chiba, Japan32 Our Contributions End-user wrapper induction Few examples required Bring object interaction into the browser Wrappers bridge syntactic-semantic gap
33
May 11, 2005WWW 2005 -- Chiba, Japan33 Future Work and Applications Document-level classes Page reformatting Autonomous agent interaction Negative examples Automatic wrapper induction
34
May 11, 2005WWW 2005 -- Chiba, Japan34 ahogue@google.com http://haystack.csail.mit.edu
35
May 11, 2005WWW 2005 -- Chiba, Japan35 List Collapse Example
36
May 11, 2005WWW 2005 -- Chiba, Japan36 List Collapse Example
37
May 11, 2005WWW 2005 -- Chiba, Japan37 List Collapse Example
38
May 11, 2005WWW 2005 -- Chiba, Japan38 List Collapse Example
39
May 11, 2005WWW 2005 -- Chiba, Japan39 Creating a Wrapper
40
May 11, 2005WWW 2005 -- Chiba, Japan40 Creating a Wrapper
41
May 11, 2005WWW 2005 -- Chiba, Japan41 Creating a Wrapper
42
May 11, 2005WWW 2005 -- Chiba, Japan42 Adding an Example
43
May 11, 2005WWW 2005 -- Chiba, Japan43 Adding a Property
44
May 11, 2005WWW 2005 -- Chiba, Japan44 Adding a Property
45
May 11, 2005WWW 2005 -- Chiba, Japan45 Interacting with a Wrapped Object
46
May 11, 2005WWW 2005 -- Chiba, Japan46 Agenda Overview Demo Details –Induction –Matching –Semantics –Heuristics Results
47
May 11, 2005WWW 2005 -- Chiba, Japan47 Wrapper: Google Search Result
48
May 11, 2005WWW 2005 -- Chiba, Japan48 Wrapper: IMDB Actor
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.