Block-based Web Search Deng Cai 1*, Shipeng Yu 2*, Ji-Rong Wen *, Wei-Ying Ma * SIGIR ’ 04 * Microsoft Research Asia Beijing, China {jrwen,

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.
Indexing DNA Sequences Using q-Grams
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Bidding Protocols for Deploying Mobile Sensors Reporter: Po-Chung Shih Computer Science and Information Engineering Department Fu-Jen Catholic University.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.
Evaluating Search Engine
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.
XHTML1 Building Document Structure. XHTML2 Objectives In this chapter, you will: Learn how to create Extensible Hypertext Markup Language (XHTML) documents.
Aki Hecht Seminar in Databases (236826) January 2009
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
Webpage Understanding: an Integrated Approach
Curve Modeling Bézier Curves
DHTML. What is DHTML?  DHTML is the combination of several built-in browser features in fourth generation browsers that enable a web page to be more.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Adversarial Information Retrieval The Manipulation of Web Content.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
XHTML1 Building Document Structure Chapter 2. XHTML2 Objectives In this chapter, you will: Learn how to create Extensible Hypertext Markup Language (XHTML)
CS654: Digital Image Analysis Lecture 3: Data Structure for Image Analysis.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Querying Structured Text in an XML Database By Xuemei Luo.
Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.
1 Visual Segmentation-Based Data Record Extraction from Web IEEE Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date :
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
 Examine two basic sources for implicit relevance feedback on the segment level for search personalization. Eye tracking Display time.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Author : Tzi-Cker Chiueh, Prashant Pradhan Publisher : High-Performance Computer Architecture, Presenter : Jo-Ning Yu Date : 2010/11/03.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
CSCE822 Data Mining and Warehousing
Data Mining K-means Algorithm
B+ Tree.
Compact Query Term Selection Using Topically Related Text
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
Text Categorization Berlin Chen 2003 Reference:
Retrieval Performance Evaluation - Measures
Presentation transcript:

Block-based Web Search Deng Cai 1*, Shipeng Yu 2*, Ji-Rong Wen *, Wei-Ying Ma * SIGIR ’ 04 * Microsoft Research Asia Beijing, China {jrwen, 1 Tsinghua University Beijing, China 2 Institute for Computer Science University of Munich

Introduction Passage retrieval is a research topic with long history in IR. Particularly when documents contain multiple drifting subjects The content of a web page is usually diverse and encompasses multiple regions with unrelated topics. We argue that the characteristics of web pages make passage a more effective mechanism for IR. Highly relevant region may be obscured by low overall relevance. It is necessary to segment a web page into semantically independent units.

Introduction In document retrieval the similarity measure is sensitive to document length. Some measures(e.g. cosine measure) favor short documents. Web pages suffer from the same. Compare four page segmentation approaches for improving web IR, and show that unlike fixed-window, semantic partitioning can be easier and more accurate.

Web Page Segmentation Passages can be categorized into three classes: Discourse passages rely on logical structure of the documents marked by punctuation. Semantic passages partition a document into topics according to its semantic structures. Fix-length passages are defined to contain fixed number of words. There exist new characteristics in web pages. Two-dimensional logical structure : each region could have relationships with four directions. Instead of using “ passage ”, we prefer to use block to denote a region of web page.

Web Page Segmentation There have been some research on web page segmentation: Traditional passages: the results are not encouraging DOM: not targeting on web IR thus difficult to evaluate. We introduce a VIPS(Vision-based Page Segmentation) method using visual cues to achieve more accurate content structure on the semantic level. Still have varying length problem Introduce a combined algorithm which takes advantage of both visual layout and length normalization.

The Four Methods Fixed-length Page Segmentation (FixedPS) For web documents it is identical to traditional window approach except that all HTML tags are removed. DOM-based Page Segmentation (DomPS) Partition pages based on their pre-defined syntactic structures, i.e., the HTML tags. No consistent way to do, and few works are done on web IR. DOM is still a linear structure and visually adjacent blocks may be far from each other. DOM prefers more on presentation to content.

The Four Methods Vision-based Page Segmentation (VIPS) A closely packed block within the web page is much likely about a single semantic. Blocks obtained are based on semantic structure of web pages. Discard traditional content analysis and produce blocks based on visual cues. The DOM structure and visual information are used iteratively to generate vision-based content structure.

The Four Methods A Combined Approach (CombPS) The distribution of block length is very diverse using VIPS with WT10g dataset. Since fixed-length window show great consistence on dealing with varying length problem, we propose this combined approach. After applying the VIPS method, apply fixed-length block extraction. First window from the first word of the block and subsequent windows half-overlap preceding ones till the end of the block. For visual blocks smaller than pre-defined length, directly output.

VIPS: a Vision-based Page Segmentation Algorithm Deng Cai Shipeng Yu Ji-Rong Wen Wei-Ying Ma Microsoft Research

Visual Block Extraction Judge if a DOM node can be divided bases on: The properties of the DOM node itself. HTML tag, background color, size, shape … The properties of the children of the DOM node. Same as above, # of different kinds of children also a consideration. Definition: Inline node : node with inline text HTML tags, such as,, … Line-break node : others. Valid node : a node that can be seen on the browser(width and height not zero). Text node : node corresponding to free text. Virtual text node : inline node with text node (and virtual text node ) children only.

Visual Block Extraction Important cues to produce heuristics: Tag cue, color cue, text cue, and size cue. At the same time assign DoC to each block. When is met, trace into the node( R2: If the DOM node has only one valid child and the child is not a text node, divide this node ). Only three of the five children are valid. The node is split( R8: If the bgcolor is different from one of its children, divide this node and this child node not be divided in this round ). The Second and fourth child of node is not valid( R1: If the node is not a text node and it has no valid children, then this node will be cut ). The third and fifth children of will not be divided in this round( R11: If previous sibling node has not been divides, do not divide this node ).

Visual Separator Detection Block contained in/cross with/ covers the separator -> split/update/remove the separator. Weight of separator be assigned based on visual difference between neighboring blocks. Distance between blocks on different sides of the separator. Overlapped with some certain HTML tags(e.g., ). Background color of the two sides. Different font properties for horizontal separators.

Visual Separator Detection Six blocks are put in the pool and five separators are detected. S 23 and S 45 gets a higher weight(different font).

Content Structure Construction Construction starts from the separators with lowest weight. Merge the blocks till separators with maximum weights are met. Each leaf node is checked whether it meets the granularity requirement. If not, go back to Visual Block Extraction step to construct sub content structure within that node. In the first iteration, the first, third and fifth separators are chosen to form VB_2_2_1, VB_2_2_2, and CB_2_2_3, and so on. Each leaf node will be checked to see whether it meets the granularity requirement.

Three steps: block extraction, separator detection and content structure construction.

Experiment Setup Four page segmentation methods are evaluated: FixedPS: window length set to 200 words. DomPS: iterate the DOM tree for some structural tags. If there are no more structural tags within the current structural tag, a block is constructed. VIPS: the permitted degree of coherence set to 0.6. CombPS: in the second step, window length set to be 200 words. A full document approach is also implemented for comparison, in which no segmentation performed. Block Retrieval verifies whether page segmentation are helpful to deal with length normalization and multiple-topic problems.

Experiment Setup Query Expansion test whether page segmentation can benefit the selection of query terms. The experiments are based on Web Tracks of TREC 2001 and Choose Okapi as IR system and BM2500 for weighting function. Use precision at as main evaluation metric and also evaluate average precision for TREC 2001 since it is more on ad-hoc retrieval.

Block Retrieval The experiments are conducted into three steps: initial retrieval, page segmentation, and block retrieval. We obtain the document rank(DR) and pages can be re-ranked based on the single best-ranked block within each page(BR). The rank of each page is Table 3 shoes the results. FullDoc is not listed since it gets the baseline. The last column shows results of combining block and document rank, with α being optimal for each method. The dependency between and α is illustrated is Figure 4.

Block Retrieval If BR only, DomPS performs worst, and in TREC 2002, none exceeds baseline. When BR+DR, all four methods increase significantly and all better than baseline, this shows the effect of rank combination, similar to passage retrieval. DomPS still the worst, VIPS and CombPS still better and show similar comparison characteristics to the non-combining situations.

Block Retrieval From Figure 4 the winner for either dataset shows a consistent improvement compared to the other methods. For TREC 2001 CombPS wins almost in every combination, and for TREC 2002 CombPS shares rather similar trends when α>0.4.

Block Retrieval DomPS is always the worst partly because the produced blocks are too detailed and cannot be mapped to a single semantic part. FixedPS shows very good performance in AvP, which confirms that varying-length is still an important factor to web IR. FixedPS gives way to VIPS and CombPS when is the main concern partly because it lacks semantic partition and fails to recognize best semantic blocks. FixedPS and VIPS have different advantages and should be selected for different purposes. By combining VIPS and FixedPS, CombPS aims to find a tradeoff and gets very good and stable(the best or very close to the best) results.

Query Expansion After block ranks obtained, the following 4 th and 5 th steps are executed: Expansion term selection : all terms except original query terms in the selected blocks are weighted according to the term selection value TSV : TSV=w (1) * r/R, where R is # of selected blocks, r the # of blocks which contain this term. In this top 10 terms are selected. Final Retrieval : for original terms, new weight is tf * 3, expansion term 1-(n-1)/m, n is the TSV, m is the # of expansion terms, i.e., 10 in our experiments. Figure 5 illustrates the values given different number of blocks, and in Table 4, the best value for each method, Figure 6 shows AvP comparison for TREC 2001.

Query Expansion DomPS is still unstable and sometimes even worse than baseline. VIPS and FixedPS are similar, except that VIPS shows better in AvP, and CombPS always the best. Since TREC 2002 aims for topic distillation, it seems that query expansion makes little improvement over baseline. Although CombPS wins, it shows no significant improvement.

Query Expansion Since baseline is very low, top-ranked documents are actually irrelevant, thus FullDoc obtain low result in all experiments. DomPS shows no significant improvement partly because the segmentation is too detailed(average length is 540 bytes) and usually does not provide complete information. VIPS considers more visual information and is more likely to obtain a semantic partition of a web page. VIPS tends to reach best performance at small number of blocks, which means that top blocks have very good quality. FixedPS also achieves good performance. In some cases it can deal with those “ badly ” presented pages while VIPS cannot. Because of no priorities for short blocks, FixedPS shows great steadiness.

Conclusion We verified that page segmentation can significantly improve IR by dealing with multiple-topic and mixed-length problems of web pages. By integrating semantic and fixed-length properties, we can overcome both problems and achieve best performance.