Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Slides:



Advertisements
Similar presentations
XML III. Learning Objectives Formatting XML Documents: Overview Using Cascading Style Sheets to format XML documents Using XSL to format XML documents.
Advertisements

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.
Web Pages and Style Sheets Bert Wachsmuth. HTML versus XHTML XHTML is a stricter version of HTML: HTML + stricter rules = XHTML. XHTML Rule violations:
A Machine Learning Approach for Improved BM25 Retrieval
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.
1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Data-rich Section Extraction from HTML pages Introducing the DSE-Algorithm Original Paper from: Jiying Wang and Fred H. Lochovsky Department of Computer.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.
 2008 Pearson Education, Inc. All rights reserved. 1 Introduction to HTML.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
Section 4.1 Format HTML tags Identify HTML guidelines Section 4.2 Organize Web site files and folder Use a text editor Use HTML tags and attributes Create.
XP Dreamweaver 8.0 Tutorial 3 1 Adding Text and Formatting Text with CSS Styles.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Logical Structure Recovery in Scholarly Articles with Rich Document Features Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
HTML Basics. HTML Coding HTML Hypertext markup language The code used to create web pages.
Post-Ranking query suggestion by diversifying search Chao Wang.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Section 4.1 Section 4.2 Format HTML tags Identify HTML guidelines
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Julián ALARTE DAVID INSA JOSEP SILVA
Madam Hazwani binti Rahmat
Introduction to XHTML.
Document Object Model (DOM): Objects and Collections
Family History Technology Workshop
Introduction to HTML.
Web Information retrieval (Web IR)
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Using Link Information to Enhance Web Page Classification
Presentation transcript:

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping Hu,Shuming Shi, Yunbo Cao, and Hang Li SIGIR 2005

INTRODUCTION Titles are the ‘names’ of documents and useful for document processing About 33.5% of the title fields of HTML documents in the TREC are bogus Question : Can we extract titles from the bodies and use them in web applications?

INTRODUCTION We take a machine learning approach and employ “Perceptron with Uneven Margins”. We propose a specification on HTML titles. The specification is defined mainly based on format information. We also propose a new retrieval method for web page retrieval. The Okapi-based method combines text, title, and extracted title. Title fields, anchor texts, and URLs of web pages can enhance web page retrieval.

One title is “National Weather Service Oxnard” and the other title is “Los Angeles Marine Weather Statement”

Specification on HTML titles The specification is as follows: 1. Number:  An HTML document can have two titles, one title, or no title. 2. Position  Titles must be on the top region (i.e., within the top ¼ region);  Titles cannot be in a narrow pane (i.e., with a width of less than ¼ of the entire page). 3. Appearance  The font sizes of titles are usually the largest and second largest;  Titles are conspicuous in terms of font family, font weight, font color, font style, alignment, background color, and text length.

Specification on HTML titles 4. Neighbor  Titles consist of consecutive lines in the same font size. (i.e., subtitles in smaller font sizes should be ignored);  If there exist two titles, then the two titles usually are in two different blocks. Lines, links, and images can be separators between the blocks. 5. Content  Titles cannot be a link, time expression, address, etc;  Titles cannot be ‘under construction’, ‘last updated’, etc;  Titles can be the expressions immediately after ‘Title:’ and ‘Subject:’. 6. Other  Titles in images are not considered.

TITLE EXTRACTION METHOD Our machine learning approach consists of two phases: training and extraction In the preprocessing, we extract units from the input document. The output of pre- processing is a sequence of units. A unit generally corresponds to a line in the HTML document and contains not only content information but also format information.

TITLE EXTRACTION METHOD We parse the body of the HTML document and construct a DOM (Document Object Model) tree. We take all the leave nodes that contain ‘texts’ in the DOM tree as units. In training, we take labeled units (titles and others) in the sequences as training data and construct a model for identifying whether a unit is a title.

TITLE EXTRACTION METHOD In extraction, we employ the model to identify each unit in the sequence to find whether it is a title. The model assigns a score to each unit. In post-processing of extraction, we extract titles using heuristics. We choose the units with the highest scores as the first title, and then choose the units with the second highest scores as the second title, provided that the scores are larger than zero.

Model We describe the model in a general framework. The input is sequences of units x1 ~ xn with aligned sequences of labels y1~yn. Labels represent being or not being the target of extraction, i.e., title. Suppose that X1~Xn are random variables denoting a sequence of units, and Y1~Yn are random variables denoting a sequence of labels.

Model If the Ys are independent from each other, then we have Each conditional probability model is a classifier. In this paper, as classifier we employ an improved variant of Perceptron, called Perceptron with Uneven Margin This version of Perceptron can work well especially when the number of positive training instances and the number of negative training instances differ largely, which is exactly the case for the current problem.

Features We consider the following information in the design of features. 1. Rich format information  Font size, Font weight, Font family, Font style, Font color, Background color, Alignment 2. Tag information ,,…,,,,,,,, etc. 3. Position information  Position from the beginning of body 4. DOM tree information  Number of sibling nodes,  Relations with the root node, parent node and sibling nodes in terms of font size change 5. Linguistic information  Length of text, Positive words, Negative words, etc.

Features With the information above, we create four types of features which can help identify the position (Pos), appearance (App), neighbor (Nei), and content (Con) of a title. There are in total 245 binary features.

DOCUMENT RETRIEVAL METHOD Our method takes BM25 as basic function BasicField In this method, a document is represented by all the texts in the title and body. Given a query, we employ BM25 to calculate the score of each document with respect to the query: Here, dl is document length, and avdl is average document length; k1 and b are parameters. We set k1 =1.1, b = 0.7.

DOCUMENT RETRIEVAL METHOD BasicField+CombTitle  We combine the extracted title field and the title field and denote it as ‘CombTitle’.

DOCUMENT RETRIEVAL METHOD BasicField+ExtTitle  We employ a similar method to that of BasicField+ComTitle, in which instead of using the combined title filed, we use the extracted title field. BasicField+Title  This is a similar method to BasicField+ComTitle, in which instead of using the combined title filed, we use the title field.

Data Sets As data sets, we used the.GOV data in the TREC Web Track and the data from an intranet of Microsoft (MS) There are 1,053,111 web pages in TREC and about 1,000,000 web pages in MS. We manually annotated titles in the randomly selected HTML documents. There were 3,332 HTML documents with annotated titles in TREC, and there were 2,641 HTML documents with annotated titles in MS.

Evaluation Measures for Extraction We used ‘precision’, ‘recall’, ‘F1-score’, and ‘accuracy’ in evaluation of title extraction results. In the evaluation, if the extracted title can approximately match to the annotated title, then we view it as a correct extraction. We define the approximate match between the two titles t1 and t2 in the following way. where d(t1,t2) is the edit distance between t1 and t2; l1 and l2 are lengths of t1 and t2 respectively.

Evaluation of Title Fields We found that there are 33.5% of HTML documents in the TREC data set having bogus titles. There are three cases: 1. Empty title field There are 60,524 pages (5.8%) 2. ‘Untitled’ title field There are 4,964 pages (0.8%) which have “untitled” or “untitled document” in their title fields. 3. Duplicated title field 282,826 pages (26.9%) fall into this type. Many web sites contain web pages sharing the same title field but having different contents.

Title Extraction Experiment

The performance of title extraction in MS is better than that in TREC. We found that the HTML documents in MS have fewer patterns than those in TREC and that is the reason for the higher performance.

Web Retrieval Experiment In the experiment, we used the queries and relevance judgments of Web Tracks in TREC-2002, TREC-2003, and TREC The queries have been classified into three types, i.e. named-page finding (NP), homepage finding (HP) and topic distillation (TD) The TD queries of TREC-2002 were not used because the specification in it is different from those in TREC and TREC-2004.

Web Retrieval Experiment The baseline method is BaseField and its performance is obtained when alpha equals 1. The results indicate that the best of BaseField+CombTitle outperforms the baseline method in all three tasks. This is also true for BaseField+Title. The results indicate that the extracted titles are useful for web page retrieval, especially for NP and it is better to employ BaseField+CombTitle.

Web Retrieval Experiment

CONCLUSION We have proposed a specification of HTML title Our method can work significantly better than the baseline methods for title extraction. We can construct domain-independent model for title extraction. Using extracted titles can indeed improve web page retrieval, particularly name page finding of TREC Many types of format information are useful for title extraction.