May 11, 2005WWW Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL
May 11, 2005WWW Chiba, Japan2 Acknowledgments David Karger Haystack Group (
May 11, 2005WWW Chiba, Japan3 Agenda Overview Demo Details –Induction –Matching –Semantics –Heuristics
May 11, 2005WWW Chiba, Japan4 Agenda Overview Demo Details –Induction –Matching –Semantics –Heuristics
May 11, 2005WWW Chiba, Japan5 Unwrapping the Web Majority of semantic content in “deep web” Transformed into human-readable HTML by scripts HTML is difficult for automated agents to understand Little incentive for content providers to provide RDF markup How to “unwrap” this content?
May 11, 2005WWW Chiba, Japan6 Thresher Simple UI for wrapper induction on structured web content “Demonstrate” examples of objects Induce wrapper, or pattern, based on DOM User may also label properties with RDF
May 11, 2005WWW Chiba, Japan7 Thresher Built on Haystack Semantic Web client Everything is RDF Everything has context menus Thresher brings RDF into the web browser Wrappers reify web objects for full interaction
May 11, 2005WWW Chiba, Japan8 Thresher Underlying wrapper algorithm based on tree edit distance Align user’s examples Keep aligned nodes (layout elements) Wildcard non-aligned nodes (content) Pattern matching is also alignment
May 11, 2005WWW Chiba, Japan9 Agenda Overview Demo Details –Induction –Matching –Semantics –Heuristics
May 11, 2005WWW Chiba, Japan10 Agenda Overview Demo Details –Induction –Matching –Semantics –Heuristics
May 11, 2005WWW Chiba, Japan11 Wrapper Induction Wrapper: pattern created from examples User provides positive examples Generalize examples into reusable pattern Existing techniques: –head-left-right-tail (HLRT) descriptors –Hidden Markov models –Support Vector Machines –Other Machine Learning
May 11, 2005WWW Chiba, Japan12 Wrapper Induction Our approach: take advantage of hierarchical structure of HTML Each example picks out a subtree of DOM Calculate tree edit distance between examples Least-cost edit distance gives best mapping Remove unmapped nodes to make pattern
May 11, 2005WWW Chiba, Japan13 Tree Edit Distance Calculate cost ( ) of sequence of operations to transform one tree into the other Operations: insert, delete, change a node Cost of an operation = size of subtree it affects Least-cost set of operations gives best mapping between elements
May 11, 2005WWW Chiba, Japan14 Mapping Examples
May 11, 2005WWW Chiba, Japan15 Mapping Examples
May 11, 2005WWW Chiba, Japan16 Mapping Examples
May 11, 2005WWW Chiba, Japan17 Agenda Overview Demo Details –Induction –Matching –Semantics –Heuristics
May 11, 2005WWW Chiba, Japan18 Pattern Matching Look for document subtrees with similar structure Find alignments of wrapper in tree Require every node in wrapper be mapped to some node in document subtree Wildcards match zero or more times Each valid alignment is a match
May 11, 2005WWW Chiba, Japan19 Matching Example
May 11, 2005WWW Chiba, Japan20 Agenda Overview Demo Details –Induction –Matching –Semantics –Heuristics
May 11, 2005WWW Chiba, Japan21 Adding Semantics How to tie wrappers to semantic content? Assert RDF statements about unwrapped objects Tied to wrapper structure Classes bound to wrappers Properties bound to wildcards
May 11, 2005WWW Chiba, Japan22 Semantic Labels
May 11, 2005WWW Chiba, Japan23 Semantic Matching
May 11, 2005WWW Chiba, Japan24 Semantic Matching
May 11, 2005WWW Chiba, Japan25 Semantic Matching [ ; “Dertouzos Lect…” ; “Distributed Hash…” ; “3:30 PM” ]
May 11, 2005WWW Chiba, Japan26 Agenda Overview Demo Details –Induction –Matching –Semantics –Heuristics
May 11, 2005WWW Chiba, Japan27 Find additional examples automatically Consider nodes neighboring the example Require low normalized cost: Often allows us to create wrappers with a single example Automatically Adding Examples
May 11, 2005WWW Chiba, Japan28 Automatically Adding Examples TR T
May 11, 2005WWW Chiba, Japan29 List Collapse Current wrappers generalize well for single elements Will not recognize variable length lists Collapse neighboring nodes with low normalized cost For matching, allow nodes to match more than once
May 11, 2005WWW Chiba, Japan30 Wrapper Wrap-up Gather user example(s) Automatically find additional examples Generalize examples using best mapping Add semantic labels Match by finding alignments Overlay objects on the page for interaction
May 11, 2005WWW Chiba, Japan31 Additional Tools Wrapper Sharing RSS Web Operations
May 11, 2005WWW Chiba, Japan32 Our Contributions End-user wrapper induction Few examples required Bring object interaction into the browser Wrappers bridge syntactic-semantic gap
May 11, 2005WWW Chiba, Japan33 Future Work and Applications Document-level classes Page reformatting Autonomous agent interaction Negative examples Automatic wrapper induction
May 11, 2005WWW Chiba, Japan34
May 11, 2005WWW Chiba, Japan35 List Collapse Example
May 11, 2005WWW Chiba, Japan36 List Collapse Example
May 11, 2005WWW Chiba, Japan37 List Collapse Example
May 11, 2005WWW Chiba, Japan38 List Collapse Example
May 11, 2005WWW Chiba, Japan39 Creating a Wrapper
May 11, 2005WWW Chiba, Japan40 Creating a Wrapper
May 11, 2005WWW Chiba, Japan41 Creating a Wrapper
May 11, 2005WWW Chiba, Japan42 Adding an Example
May 11, 2005WWW Chiba, Japan43 Adding a Property
May 11, 2005WWW Chiba, Japan44 Adding a Property
May 11, 2005WWW Chiba, Japan45 Interacting with a Wrapped Object
May 11, 2005WWW Chiba, Japan46 Agenda Overview Demo Details –Induction –Matching –Semantics –Heuristics Results
May 11, 2005WWW Chiba, Japan47 Wrapper: Google Search Result
May 11, 2005WWW Chiba, Japan48 Wrapper: IMDB Actor