Download presentation
Presentation is loading. Please wait.
Published byBethanie Joella Rich Modified over 9 years ago
1
Shuyi Zheng, Di Wu, Ruihua Song, Ji-Rong Wen Microsoft Research Asia SIGKDD-2007, San Jose, California, USA
2
Outline Introduction Our approach Experiments Demo Conclusion SIGKDD-2007, San Jose, California, USA2
3
Motivations SIGKDD-2007, San Jose, California, USA3
4
Motivations Page Generation Script (e.g., ASP, PHP, JSP) Database Encoding Wrapper Decoding SIGKDD-2007, San Jose, California, USA4
5
Related Work Some automatic or semi-automatic wrapper learning methods have been proposed e.g. WIEN[12], SoftMeley,[11] Stalker[17], RoadRunner[6], EXALG[2], TTAG[4], works in [18], ViNTs[21] and etc. Page clustering for wrapper induction is considered a trivial task Manual: most of previous work Automatic but isolated from wrapper generation: RoadRunner[6,7] and [18] SIGKDD-2007, San Jose, California, USA5
6
Problems (cont.) Dynamic URLs With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before SIGKDD-2007, San Jose, California, USA6
7
7 (a): www.amazon.com/gp/product/B000BNLGJA/ (a): …/gp/product/B000BNLGJA/ (b): www.amazon.com/gp/product/B00007J8SC/ (b): …/gp/product/B00007J8SC/ (c): www.amazon.com/gp/product/B0000DD95R/ (c): …/gp/product/B0000DD95R/ (d): www.amazon.com/gp/product/B0000A1AT9/ (d): …/gp/product/B0000A1AT9/
8
Problems Dynamic URLs With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before Complex Templates Even if URLs can group pages that share a template, such a method is sometimes far from optimal to generate only one wrapper for a complex template SIGKDD-2007, San Jose, California, USA8
9
9 (c): www.amazon.com/gp/product/B0000DD95R/(d): www.amazon.com/gp/product/B0000A1AT9/
10
Our Proposed Approach Main ideas Similarity-based templates, instead of ground-truth templates Advantages Be more stable Optimize the number of wrappers SIGKDD-2007, San Jose, California, USA10
11
Outline Introduction Our approach Experiments Demo Conclusion SIGKDD-2007, San Jose, California, USA11
12
Problem Definition SIGKDD-2007, San Jose, California, USA12
13
System Overview SIGKDD-2007, San Jose, California, USA13
14
Wrapper Generation [6, 4, 18] SIGKDD-2007, San Jose, California, USA14
15
Wrapper-DOM Distance Distance between a wrapper and a DOM tree Tree alignment Cost calculation SIGKDD-2007, San Jose, California, USA15
16
Wrapper-Oriented Page Clustering (WPC) SIGKDD-2007, San Jose, California, USA 16 (a) Level-1 Wrapper (b) Level-2 Wrapper(c) Level-3 Wrapper(d) Level-4 Wrapper
17
Outline Introduction Our approach Experiments Demo Conclusion SIGKDD-2007, San Jose, California, USA17
18
Experiments Data 1700 product pages from Amazon.com (Amazon) Mixed 1000 pages from 10 shopping sites (M10) Target product records: (name, image, price) Settings 2-fold cross-validation Evaluation measures: Precision, Recall and F1 SIGKDD-2007, San Jose, California, USA18
19
Effectiveness Test Amazon: 44 wrappers, F1: 94.88% vs. 78% M10: SIGKDD-2007, San Jose, California, USA19
20
WPC with Different Thresholds SIGKDD-2007, San Jose, California, USA20
21
Stability Test Objective Evaluate how the choice of initial training page impacts the performance of WPC SIGKDD-2007, San Jose, California, USA21
22
Outline Introduction Our approach Experiments Demo Conclusion SIGKDD-2007, San Jose, California, USA22
23
Demo! Microsoft Office Excel 2007 Web Data Add-In is coming soon! SIGKDD-2007, San Jose, California, USA23 Please have a try in two weeks! http://blogs.msdn.com/xaw
24
Outline Introduction Our approach Experiments Demo Conclusion SIGKDD-2007, San Jose, California, USA24
25
Conclusion Our system Takes a miscellaneous training set as input Conducts template detection and wrapper generation in a single step Can achieve a joint optimization under the criterion of extraction accuracy In the near future, We will extend the approach to handle the templates containing content strings SIGKDD-2007, San Jose, California, USA25
26
Contacts: Ruihua Song (rsong@microsoft.com) Shuyi Zheng (shzheng@cse.psu.edu) SIGKDD-2007, San Jose, California, USA26
27
Poster No. 11 Looking forward to talking with you at Poster Reception II this evening! SIGKDD-2007, San Jose, California, USA27
28
SIGKDD-2007, San Jose, California, USA28
29
Labeling Cost To show how many training pages are required for learning wrappers to achieve an accuracy higher than 95% in terms of F1. SIGKDD-2007, San Jose, California, USA29
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.