Shuyi Zheng, Di Wu, Ruihua Song, Ji-Rong Wen Microsoft Research Asia SIGKDD-2007, San Jose, California, USA
Outline Introduction Our approach Experiments Demo Conclusion SIGKDD-2007, San Jose, California, USA2
Motivations SIGKDD-2007, San Jose, California, USA3
Motivations Page Generation Script (e.g., ASP, PHP, JSP) Database Encoding Wrapper Decoding SIGKDD-2007, San Jose, California, USA4
Related Work Some automatic or semi-automatic wrapper learning methods have been proposed e.g. WIEN[12], SoftMeley,[11] Stalker[17], RoadRunner[6], EXALG[2], TTAG[4], works in [18], ViNTs[21] and etc. Page clustering for wrapper induction is considered a trivial task Manual: most of previous work Automatic but isolated from wrapper generation: RoadRunner[6,7] and [18] SIGKDD-2007, San Jose, California, USA5
Problems (cont.) Dynamic URLs With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before SIGKDD-2007, San Jose, California, USA6
7 (a): (a): …/gp/product/B000BNLGJA/ (b): (b): …/gp/product/B00007J8SC/ (c): (c): …/gp/product/B0000DD95R/ (d): (d): …/gp/product/B0000A1AT9/
Problems Dynamic URLs With the popularity of dynamic URLs, it is no longer as effective to detect templates by URLs as before Complex Templates Even if URLs can group pages that share a template, such a method is sometimes far from optimal to generate only one wrapper for a complex template SIGKDD-2007, San Jose, California, USA8
9 (c):
Our Proposed Approach Main ideas Similarity-based templates, instead of ground-truth templates Advantages Be more stable Optimize the number of wrappers SIGKDD-2007, San Jose, California, USA10
Outline Introduction Our approach Experiments Demo Conclusion SIGKDD-2007, San Jose, California, USA11
Problem Definition SIGKDD-2007, San Jose, California, USA12
System Overview SIGKDD-2007, San Jose, California, USA13
Wrapper Generation [6, 4, 18] SIGKDD-2007, San Jose, California, USA14
Wrapper-DOM Distance Distance between a wrapper and a DOM tree Tree alignment Cost calculation SIGKDD-2007, San Jose, California, USA15
Wrapper-Oriented Page Clustering (WPC) SIGKDD-2007, San Jose, California, USA 16 (a) Level-1 Wrapper (b) Level-2 Wrapper(c) Level-3 Wrapper(d) Level-4 Wrapper
Outline Introduction Our approach Experiments Demo Conclusion SIGKDD-2007, San Jose, California, USA17
Experiments Data 1700 product pages from Amazon.com (Amazon) Mixed 1000 pages from 10 shopping sites (M10) Target product records: (name, image, price) Settings 2-fold cross-validation Evaluation measures: Precision, Recall and F1 SIGKDD-2007, San Jose, California, USA18
Effectiveness Test Amazon: 44 wrappers, F1: 94.88% vs. 78% M10: SIGKDD-2007, San Jose, California, USA19
WPC with Different Thresholds SIGKDD-2007, San Jose, California, USA20
Stability Test Objective Evaluate how the choice of initial training page impacts the performance of WPC SIGKDD-2007, San Jose, California, USA21
Outline Introduction Our approach Experiments Demo Conclusion SIGKDD-2007, San Jose, California, USA22
Demo! Microsoft Office Excel 2007 Web Data Add-In is coming soon! SIGKDD-2007, San Jose, California, USA23 Please have a try in two weeks!
Outline Introduction Our approach Experiments Demo Conclusion SIGKDD-2007, San Jose, California, USA24
Conclusion Our system Takes a miscellaneous training set as input Conducts template detection and wrapper generation in a single step Can achieve a joint optimization under the criterion of extraction accuracy In the near future, We will extend the approach to handle the templates containing content strings SIGKDD-2007, San Jose, California, USA25
Contacts: Ruihua Song Shuyi Zheng SIGKDD-2007, San Jose, California, USA26
Poster No. 11 Looking forward to talking with you at Poster Reception II this evening! SIGKDD-2007, San Jose, California, USA27
SIGKDD-2007, San Jose, California, USA28
Labeling Cost To show how many training pages are required for learning wrappers to achieve an accuracy higher than 95% in terms of F1. SIGKDD-2007, San Jose, California, USA29