Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.

Slides:



Advertisements
Similar presentations
Introduction to HTML
Advertisements

Session 2 Introduction to HyperText Markup Language 4 (HTML 4) Matakuliah: M0114/Web Based Programming Tahun: 2005 Versi: 5.
Internet Services and Web Authoring (CSET 226) Lecture # 5 HyperText Markup Language (HTML) 1.
1 Lesson 5. 2 R3 R1 R5 R4 R6 R2 B B A A
HTML. The World Wide Web Protocols Addresses HTML.
Presenter: James Huang Date: Sept. 26,  Introduction  Basics  Lists  Links  Forms  CSS 2.
INTRODUCTION TO HYPERTEXT MARKUP LANGUAGE 1. Outline  Introduction  Markup Languages  Editing HTML  Common Tags  Headers  Text Styling  Linking.
HyperText Markup Language (HTML) Uses plain text to add markup instructions to a file. If you can't produce it with a standard keyboard, it can't go into.
Cascading Style Sheets By: Valerie Kuna. What are Cascading Style Sheets? Cascading Style Sheets (CSS) are a standard for specifying the presentation.
Today’s objectives  Element relations – tree structure  Pseudo classes  Pseudo elements.
กระบวนวิชา CSS. What is CSS? CSS stands for Cascading Style Sheets Styles define how to display HTML elements Styles were added to HTML 4.0 to.
Basic Word Processing.
Copyright 2006 South-Western/Thomson Learning Chapter 10 Reports.
How to Make a Web Page: A Crash Course in HTML programming.
Web development  World Wide Web (web) is the Internet system for hypertext linking.  A hypertext document (web page) is an online document. It contains.
Introduction to HTML CPS470 Software Engineering Fall 1998.
ETT 429 Spring 2007 Web Design I.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Microsoft Office © Copyright William Rowan Objective By the end of this you will have being given a brief introduction to: Microsoft Word Microsoft.
HTML: PART ONE. Creating an HTML Document  It is a good idea to plan out a web page before you start coding  Draw a planning sketch or create a sample.
Web Design HTML, Frontpage, DreamWeaver μέρος β ΠΡΥ019 - Πληροφορική Δρ.Βάσος Βασιλείου.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Microsoft Word Building Block - Frequently used text saved in a gallery, from which it can be inserted quickly into a document. Clipboard - A storage.
HTML (HyperText Markup Language)
Information Literacy. Information Literacy includes: The ability of a student to: 1.Identify the need for information Select a topic 2.Access information.
Objectives: 1. Create a Skeleton HTML 2. View a Skeleton File Through a Server and Browser 3. Learn HTML Body Tags for the Display of Text and Graphics.
Basic HTML Workshop By: Preeda Chunjongkolkul (Pete) Systems Librarian/Webmaster
1 Introduction to HTML Joshua S. Simon Collective Technologies.
HTML,DHTML & Javascript/Session1/1 of 39 Introduction and Basic Tags Session 1 of Using HTML, DHTML & JavaScript.
CS105 INTRODUCTION TO COMPUTER CONCEPTS HTML Instructor: Cuong (Charlie) Pham.
ALBERT WAVERING BOBBY SENG. Week 2: HTML + CSS  Quiz  Announcements/questions/etc  Some functional HTML elements.
Formatting Documents Lesson 2 Microsoft Word. Apply Paragraph and Character styles Formatting has to do with the appearance of a document. In Word entire.
CPSC 203 Introduction to Computers Lab 33 By Jie Gao.
Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.
HTML for ISD Brown Bag Presentation Session 2 What we will cover:  Basics of HTML  How to make your first page  Links  Text formatting.
House Styles for ICT How we expect your work to be presented……
HTML HyperText Markup Language ©Richard L. Goldman July 15, 2003.
Lecture 3- Microsoft Word COE 201- Computer Proficiency.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Introduction to Web Authoring Ellen Cushman /wra210.htm Class mtg. #2.
1999, COMPUTER SCIENCE, BUU Introduction to HTML Seree Chinodom
HTML – The Basics Rebecca Shillingburg
HTML. INDEX Introduction to HTML Creating Web Pages Commands And Tags Web Page.
1 HTML. 2 Full forms WWW – world Wide Web HTTP – Hyper Text Transfer Protocol HTML – Hyper Text Markup Language.
Revision Webpage design HTML.   FACE  Attributes  Marquee  Define the following terms.
HTML AN INTRODUCTION TO WEB PAGE PROGRAMMING. INTRODUCTION TO HTML With HTML you can create your own Web site. HTML stands for Hyper Text Markup Language.
Create Your Own Web Page: An Introduction to HTML Instructor: Corey Johnson Assisted by: tba.
Introduction to Web Authoring Bill Hart-Davidson AIM: billhd30 Session 2
Q.Nand1 HTML Creating an HTML Document Lesson 2. Q.Nand2 Overview Creating an HTML Document: –HTML syntax –Creating Basic Tags –Displaying Your HTML Files.
Lesson 5. XHTML Tags, Attributes and Structure XHTML Basic Structure head and body titles Paragraph headings comments Document Presentation Manipulating.
HTML Basics.
Basic Word Processing.
What is HTML? Acronym for: HyperText Markup Language
LAB Work 01 MBA 61062: E-Commerce
Elements of HTML Web Design – Sec 3-2
HTML GUIDE Press F5 and then
Web Authoring (Ski Resort Task)
Creating a Home Page in HTML
Session 5: HTML J 0394 – Perancangan Situs Web Program Studi Manajemen
HTML Formatting.
Tag Basics.
Basic Word Processing.
Computers and Scientific Thinking David Reed, Creighton University
Marking Up with XHTML Tags describe how a web page should look
Basic Word Processing.
Basic Word Processing.
Basic Word Processing.
Html.
Basic Word Processing.
Session IV Chapter 15 - How Work with Fonts and Printing
Presentation transcript:

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming Shi, Yunbo Cao, and Hang Li Microsoft Research Asia 1: Xi’an Jiaotong University 2: Peking University 3: University of Science and Technology of China

Outline Motivation Related work Problem description Our approach Experimental results Conclusions

Outline Motivation Related work Problem description Our approach Experimental results Conclusions

Motivation Title of HTML document should be defined in title filed Title fields of HTML documents are not reliable Data Set Num. of HTML docs Empty title fields Duplicated title fields TREC1,053,1115.8%26.9%

Can We Extract Title from Body of HTML?

Outline Motivation Related work Problem description Our approach Experimental results Conclusions

Related Work: Web Information Extraction Information type: data record, news article, summary Data structure: DOM tree, block Approach: rule-based approach vs machine learning based approach Domain specific vs domain independent Not clear how to extract title from body

Related Work: Web Information Retrieval Title filed, anchor text, and URL are useful for web page retrieval Not clear whether extracted title is useful

Outline Motivation Related work Problem description Our approach Experimental results Conclusions

Input: HTML document (web page) Output: title(s) from body of HTML document Condition: domain independent Title Extraction Task National Weather Service Oxnard Los Angeles Marine Weather Statement HTML document Extracted titles

Intuitively, title is ‘most conspicuous’ part Can have 0-2 titles Must be on top region Font size, font weight, etc are noticeable Can cross several lines, but usually in same format Cannot be in bullets and list Cannot be expressions like “under construction”, … Image is not considered Spec on HTML Title

Examples

Outline Motivation Related work Problem description Our approach Experimental results Conclusions

Title Extraction Processing Title extraction as information extraction Using DOM tree Leaf node containing ‘text’ as unit (instance) Mainly using format information Title

DOM Tree HTML document DOM tree

General framework for Information Extraction Learning Tool Extraction Tool Model

HTML Title Extraction Learning Tool Extraction Tool Perceptron Classifier x: unit Y: title?

Information Used in Features (1) Rich format information Font size: 1~7 levels Font weight: bold face or not Font family: Times New Roman, Arial, etc Font style: normal or italic Font color: #000000, #FF0000, etc Background color: #FFFFFF, #FF0000, etc Alignment: center, left, right, and justify. Tag information H1,H2,…,H6: levels as header LI: a listed item DIR: a directory list A: a link or anchor U: an underline BR: a line break HR: a horizontal ruler IMG: an image Class name: ‘sectionheader’, ‘title’, ‘titling’,’ header’, etc.

Information Used in Features (2) Position information Position from beginning of body Width of unit in page DOM tree information Number of sibling nodes in the DOM tree. Relations with root node, parent node and sibling nodes in terms of font size change, etc. Relations with previous leaf node and next leaf node, in terms of font size change, etc. Linguistic information Length of text: number of characters Length of real text: number of alphabetic letters Negative words: ‘by’, ‘date’, ‘phone’, ‘fax’, ‘ ’, ‘author’, etc. Positive words: ‘abstract’, ‘introduction’, ‘summary’, ‘overview’, ‘subject’, ‘title’, etc.

Use of Extracted Title in Web Page Retrieval Employing BM25 framework BasicField: texts in body and title are used BaiscField+Title BasicField+ExtTitle BasicField+CombTitle

Outline Motivation Related work Problem description Our approach Experimental results Conclusions

Data for Title Extraction Experiments Name Num. of HTML Docs Title labeled Docs having titles TRECabout 1 million4, % MSabout 1 million4, %

Title Extraction Results (TREC, Cross-Validation) ApproachPrecisionRecallF1-ScoreAccuracy Largest font (baseline) First unit (-38.1%) (-37.5%) (-37.8%) (-37.5%) Title-field (-48.8%) (-49.6%) (-49.1%) (-50.0%) Perceptron (+32.3%) (+9.3%) (+20.9%) (+33.5%)

Title Extraction Results (MS, Cross Validation) ApproachPrecisionRecallF1-ScoreAccuracy Largest font (baseline) First unit (+3.7%) (+4.1%) (+3.9%) (+4.1%) Title-field (+12.3%) (-0.7%) (+6.6%) (+15.6%) Perceptron (+55.7%) (+9.4%) (+32.6%) (+56.1%)

Title Extraction: Feature Contribution MS

Training Set Test Set PrecisionRecallF1-ScoreAccuracy MSTREC TRECMS TREC MS Title Extraction: Domain Adaptation

Query Data for Retrieval Experiments YearTaskNum. of queries 2002NP TD50 HP150 NP TD75 HP75 NP75

Web Page Retrieval Results (TREC) TREC-2003 NP

Web Page Retrieval Results (TREC) TREC-2003 HP

Web Page Retrieval Results (TREC) 2003 TD

Average Precision for Each Method YearTask Baisc Field +Title+ComTitle 2003 TD (>>) (+23.1%) HP (>>) (+31.4%) (>>) (+44.0%) NP (+32.3%) (+51.0%)

Outline Motivation Related work Problem description Our approach Experimental results Conclusions

Title fields of HTML documents are not reliable We propose conducting title extraction from bodies of HTML documents Construct domain-independent model using machine learning and format features Use of extracted titles can help improve precision of web page retrieval, particularly TREC name page finding

Thanks!