Presentation is loading. Please wait.

Presentation is loading. Please wait.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Similar presentations


Presentation on theme: "Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming."— Presentation transcript:

1 Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming Shi, Yunbo Cao, and Hang Li Microsoft Research Asia 1: Xi’an Jiaotong University 2: Peking University 3: University of Science and Technology of China

2 Outline Motivation Related work Problem description Our approach Experimental results Conclusions

3 Outline Motivation Related work Problem description Our approach Experimental results Conclusions

4 Motivation Title of HTML document should be defined in title filed Title fields of HTML documents are not reliable Data Set Num. of HTML docs Empty title fields Duplicated title fields TREC1,053,1115.8%26.9%

5 Can We Extract Title from Body of HTML?

6 Outline Motivation Related work Problem description Our approach Experimental results Conclusions

7 Related Work: Web Information Extraction Information type: data record, news article, summary Data structure: DOM tree, block Approach: rule-based approach vs machine learning based approach Domain specific vs domain independent Not clear how to extract title from body

8 Related Work: Web Information Retrieval Title filed, anchor text, and URL are useful for web page retrieval Not clear whether extracted title is useful

9 Outline Motivation Related work Problem description Our approach Experimental results Conclusions

10 Input: HTML document (web page) Output: title(s) from body of HTML document Condition: domain independent Title Extraction Task National Weather Service Oxnard Los Angeles Marine Weather Statement HTML document Extracted titles

11 Intuitively, title is ‘most conspicuous’ part Can have 0-2 titles Must be on top region Font size, font weight, etc are noticeable Can cross several lines, but usually in same format Cannot be in bullets and list Cannot be expressions like “under construction”, … Image is not considered Spec on HTML Title

12 Examples

13 Outline Motivation Related work Problem description Our approach Experimental results Conclusions

14 Title Extraction Processing Title extraction as information extraction Using DOM tree Leaf node containing ‘text’ as unit (instance) Mainly using format information Title

15 DOM Tree HTML document DOM tree

16 General framework for Information Extraction Learning Tool Extraction Tool Model

17 HTML Title Extraction Learning Tool Extraction Tool Perceptron Classifier x: unit Y: title?

18 Information Used in Features (1) Rich format information Font size: 1~7 levels Font weight: bold face or not Font family: Times New Roman, Arial, etc Font style: normal or italic Font color: #000000, #FF0000, etc Background color: #FFFFFF, #FF0000, etc Alignment: center, left, right, and justify. Tag information H1,H2,…,H6: levels as header LI: a listed item DIR: a directory list A: a link or anchor U: an underline BR: a line break HR: a horizontal ruler IMG: an image Class name: ‘sectionheader’, ‘title’, ‘titling’,’ header’, etc.

19 Information Used in Features (2) Position information Position from beginning of body Width of unit in page DOM tree information Number of sibling nodes in the DOM tree. Relations with root node, parent node and sibling nodes in terms of font size change, etc. Relations with previous leaf node and next leaf node, in terms of font size change, etc. Linguistic information Length of text: number of characters Length of real text: number of alphabetic letters Negative words: ‘by’, ‘date’, ‘phone’, ‘fax’, ‘email’, ‘author’, etc. Positive words: ‘abstract’, ‘introduction’, ‘summary’, ‘overview’, ‘subject’, ‘title’, etc.

20 Use of Extracted Title in Web Page Retrieval Employing BM25 framework BasicField: texts in body and title are used BaiscField+Title BasicField+ExtTitle BasicField+CombTitle

21 Outline Motivation Related work Problem description Our approach Experimental results Conclusions

22 Data for Title Extraction Experiments Name Num. of HTML Docs Title labeled Docs having titles TRECabout 1 million4,25878.3% MSabout 1 million4,13763.8%

23 Title Extraction Results (TREC, Cross-Validation) ApproachPrecisionRecallF1-ScoreAccuracy Largest font (baseline) 0.5280.6430.5800.523 First unit 0.327 (-38.1%) 0.402 (-37.5%) 0.360 (-37.8%) 0.327 (-37.5%) Title-field 0.270 (-48.8%) 0.324 (-49.6%) 0.295 (-49.1%) 0.261 (-50.0%) Perceptron 0.698 (+32.3%) 0.703 (+9.3%) 0.701 (+20.9%) 0.698 (+33.5%)

24 Title Extraction Results (MS, Cross Validation) ApproachPrecisionRecallF1-ScoreAccuracy Largest font (baseline) 0.5840.8400.6890.582 First unit 0.606 (+3.7%) 0.875 (+4.1%) 0.716 (+3.9%) 0.606 (+4.1%) Title-field 0.656 (+12.3%) 0.834 (-0.7%) 0.735 (+6.6%) 0.673 (+15.6%) Perceptron 0.910 (+55.7%) 0.919 (+9.4%) 0.914 (+32.6%) 0.909 (+56.1%)

25 Title Extraction: Feature Contribution MS

26 Training Set Test Set PrecisionRecallF1-ScoreAccuracy MSTREC0.6980.6150.6540.642 TRECMS0.8520.8830.8670.871 TREC 0.6980.7030.7010.698 MS 0.9100.9190.9140.909 Title Extraction: Domain Adaptation

27 Query Data for Retrieval Experiments YearTaskNum. of queries 2002NP150 2003 TD50 HP150 NP150 2004 TD75 HP75 NP75

28 Web Page Retrieval Results (TREC) TREC-2003 NP

29 Web Page Retrieval Results (TREC) TREC-2003 HP

30 Web Page Retrieval Results (TREC) 2003 TD

31 Average Precision for Each Method YearTask Baisc Field +Title+ComTitle 2003 TD0.5280.606 0.650 (>>) (+23.1%) HP0.302 0.397 (>>) (+31.4%) 0.435 (>>) (+44.0%) NP0.096 0.127 (+32.3%) 0.145 (+51.0%)

32 Outline Motivation Related work Problem description Our approach Experimental results Conclusions

33 Title fields of HTML documents are not reliable We propose conducting title extraction from bodies of HTML documents Construct domain-independent model using machine learning and format features Use of extracted titles can help improve precision of web page retrieval, particularly TREC name page finding

34 Thanks!


Download ppt "Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming."

Similar presentations


Ads by Google