Fusion-based Approach to Web Search Optimization Kiduk Yang, Ning Yu WIDIT Laboratory SLIS, Indiana University AIRS2005 Kiduk Yang, Ning Yu WIDIT Laboratory.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
WEB MINING. Why IR ? Research & Fun
Chapter 5: Introduction to Information Retrieval
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Information Retrieval in Practice
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Evaluating the Performance of IR Sytems
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Cohort Modeling for Enhanced Personalized Search Jinyun YanWei ChuRyen White Rutgers University Microsoft BingMicrosoft Research.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Tuning Before Feedback: Combining Ranking Discovery and Blind Feedback for Robust Retrieval* Weiguo Fan, Ming Luo, Li Wang, Wensi Xi, and Edward A. Fox.
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
CIKM Opinion Retrieval from Blogs Wei Zhang 1 Clement Yu 1 Weiyi Meng 2 1 Department of.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
WIDIT at TREC-2005 HARD Track Kiduk Yang, Ning Yu, Hui Zhang, Ivan Record, Shahrier Akram WIDIT Laboratory School of Library & Information Science Indiana.
Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine,
Indri at TREC 2004: UMass Terabyte Track Overview Don Metzler University of Massachusetts, Amherst.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
General Architecture of Retrieval Systems 1Adrienn Skrop.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
R WIDIT at TREC-2006 Blog Track: Searching for Opinionated Posts about Entities Kiduk Yang, Ning Yu, Hui Zhang, Alejandro Valerio WIDIT Laboratory School.
Information Retrieval in Practice
IR Theory: Web Information Retrieval
Search Engine Architecture
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Query Type Classification for Web Document Retrieval
Introduction to Search Engines
IR Theory: Web Information Retrieval
Presentation transcript:

Fusion-based Approach to Web Search Optimization Kiduk Yang, Ning Yu WIDIT Laboratory SLIS, Indiana University AIRS2005 Kiduk Yang, Ning Yu WIDIT Laboratory SLIS, Indiana University AIRS2005

2 OUTLINE  Introduction  WIDIT in TREC Web Track  Results & Discussion  Introduction  WIDIT in TREC Web Track  Results & Discussion

3 AIRS2005 Introduction  Web IR  Challenges Size, Heterogeneity, Quality of data Diversity of user tasks, interests, characteristics  Opportunities Diverse Sources of Evidence Data Abundance  WIDIT Approach to Web IR  Leverage Multiple Sources of Evidence  Utilize Multiple Methods  Apply Fusion  Research Question  What to combine?  How to combine?  Web IR  Challenges Size, Heterogeneity, Quality of data Diversity of user tasks, interests, characteristics  Opportunities Diverse Sources of Evidence Data Abundance  WIDIT Approach to Web IR  Leverage Multiple Sources of Evidence  Utilize Multiple Methods  Apply Fusion  Research Question  What to combine?  How to combine?

4 AIRS2005 WIDIT in Web Track 2004  Data  Documents 1.25 million.gov Web pages (18 GB).  Topics (i.e. queries) 75 Topic Distillation (TD), 75 Home Page (HP), 75 Named Page (NP)  Task  Retrieve relevant documents given a mixed set of query types (QT)  Main Strategy  Fusion of multiple data representations Static Tuning (QT-independent)  Fusion of multiple sources of evidence Dynamic Tuning (QT-specific)  Data  Documents 1.25 million.gov Web pages (18 GB).  Topics (i.e. queries) 75 Topic Distillation (TD), 75 Home Page (HP), 75 Named Page (NP)  Task  Retrieve relevant documents given a mixed set of query types (QT)  Main Strategy  Fusion of multiple data representations Static Tuning (QT-independent)  Fusion of multiple sources of evidence Dynamic Tuning (QT-specific)

5 AIRS2005 WIDIT: Web IR System Architecture Indexing Module Sub-indexes Body Index Anchor Index Header Index Documents Topics Queries Simple Queries Queries Expanded Queries Retrieval Module Fusion Module Sub-indexes Search Results Re-ranking Module Fusion Result Final Result Static Tuning Dynamic Tuning Query Classification Module Query Types

6 AIRS2005 WIDIT: Indexing Module  Document Indexing 1. Strip HTML tags extract title, meta keywords & description, emphasized words parse out hyperlinks (URL & anchor texts) 2. Create Surrogate Documents anchor texts of inlinks header texts (title, meta text, emphasized text) 3. Create Subcollection Indexes Stop & Stem (Simple, Combo stemmer) compute SMART & Okapi term weightsSMARTOkapi 4. Compute whole collection term statistics  Query Indexing 1. Stop & Stem 2. Identify nouns, phrases 3. Expand acronyms 4. Mine synonyms and definitions from Web search  Document Indexing 1. Strip HTML tags extract title, meta keywords & description, emphasized words parse out hyperlinks (URL & anchor texts) 2. Create Surrogate Documents anchor texts of inlinks header texts (title, meta text, emphasized text) 3. Create Subcollection Indexes Stop & Stem (Simple, Combo stemmer) compute SMART & Okapi term weightsSMARTOkapi 4. Compute whole collection term statistics  Query Indexing 1. Stop & Stem 2. Identify nouns, phrases 3. Expand acronyms 4. Mine synonyms and definitions from Web search

7 AIRS2005 WIDIT: Retrieval Module 1.Parallel Searching  Multiple Document Index body text (title, body) anchor text (title, inlink anchor text) header text (title, meta kw & desc, first heading, emphasized words)  Multiple Query formulations stemming (Simple, Combo) expanded query (acronym, noun)  Multiple Subcollections for search speed and scalability search each subcollection using whole collection term statistics 2.Merge subcollection search results  merge & sort by document score 1.Parallel Searching  Multiple Document Index body text (title, body) anchor text (title, inlink anchor text) header text (title, meta kw & desc, first heading, emphasized words)  Multiple Query formulations stemming (Simple, Combo) expanded query (acronym, noun)  Multiple Subcollections for search speed and scalability search each subcollection using whole collection term statistics 2.Merge subcollection search results  merge & sort by document score

8 AIRS2005 WIDIT: Fusion Module  Fusion Formulas  Weighted Sum FS ws =  (w i *NS i )  Select candidate systems to combine  Top performers in each category e.g. best stemmer, qry expansion, doc index  Diverse systems e.g. Content-based, Link-based  One-time brute force combinations for validation Complementary Strength effect  Determine system weights (w i )  Static Tuning Evaluate fusion formulas using a fixed set of values (e.g ) with training data Select the formulas with best performance  Fusion Formulas  Weighted Sum FS ws =  (w i *NS i )  Select candidate systems to combine  Top performers in each category e.g. best stemmer, qry expansion, doc index  Diverse systems e.g. Content-based, Link-based  One-time brute force combinations for validation Complementary Strength effect  Determine system weights (w i )  Static Tuning Evaluate fusion formulas using a fixed set of values (e.g ) with training data Select the formulas with best performance w i = weight of system i (relative contribution of each system) NS i = normalized score of a document by system i = (S i – S min ) / (S max – S min )

9 AIRS2005 WIDIT: Query Classification Module  Statistical Classification (SC)  Classifiers Na ï ve Bayes SVM  Training Data Titles of 2003 topics (50 TD, 150 HP, 150 NP) w/ and w/o stemming (Combo stemmer)  Training Data Enrichment for TD class Added top-level Yahoo Government category labels  Linguistic Classification (LC)  Word Cues Create HP and NP lexicons  Ad-hoc heuristic e.g. HP if ends in all caps, NP if contains YYYY, TD if short topic  Combination  More ad-hoc heuristic if strong word cue, LC else if single word, TD else SC  Statistical Classification (SC)  Classifiers Na ï ve Bayes SVM  Training Data Titles of 2003 topics (50 TD, 150 HP, 150 NP) w/ and w/o stemming (Combo stemmer)  Training Data Enrichment for TD class Added top-level Yahoo Government category labels  Linguistic Classification (LC)  Word Cues Create HP and NP lexicons  Ad-hoc heuristic e.g. HP if ends in all caps, NP if contains YYYY, TD if short topic  Combination  More ad-hoc heuristic if strong word cue, LC else if single word, TD else SC

10 AIRS2005 WIDIT: Re-ranking Module  Re-ranking Features  Field-specific Match Query words, acronyms, phrases in URL, title, header, anchor text  Exact Match title, header text, anchor text body text  Indegree & Outdegree  URL Type: root, subroot, path, file URL Type based on URL ending and slash count  Page Type: HPP, HP, NPP, NP, ?? Page Type based on word cue & heuristic  Re-ranking Formula  Weighted sum of re-ranking features  Dynamic Tuning  Dynamic/interactive optimization of the QT-specific re-ranking formula  Re-ranking Features  Field-specific Match Query words, acronyms, phrases in URL, title, header, anchor text  Exact Match title, header text, anchor text body text  Indegree & Outdegree  URL Type: root, subroot, path, file URL Type based on URL ending and slash count  Page Type: HPP, HP, NPP, NP, ?? Page Type based on word cue & heuristic  Re-ranking Formula  Weighted sum of re-ranking features  Dynamic Tuning  Dynamic/interactive optimization of the QT-specific re-ranking formula

11 AIRS2005 WIDIT: Dynamic Tuning Interface W L

12 AIRS2005 Dynamic Tuning: Observations  Effective re-ranking factors  HP indegree, outdegree, exact match, URL/pagetype minimum number of outdegree =1  NP indegree, outdegree, URLtype o 1/3 impact of HP  TD acronym, outdegree, URLtype minimum number of outdegree =10  Strength  Combines the human intelligence (pattern recognition) w/ computational power of machine  Good for system tuning with many parameters  Facilitates failure analysis  Weakness  Over-tuning  Sensitive to initial results & re-ranking parameter selection  Effective re-ranking factors  HP indegree, outdegree, exact match, URL/pagetype minimum number of outdegree =1  NP indegree, outdegree, URLtype o 1/3 impact of HP  TD acronym, outdegree, URLtype minimum number of outdegree =10  Strength  Combines the human intelligence (pattern recognition) w/ computational power of machine  Good for system tuning with many parameters  Facilitates failure analysis  Weakness  Over-tuning  Sensitive to initial results & re-ranking parameter selection

13 AIRS2005 Results  Run Descriptions  Best fusion run: F3 0.4*A + 0.3*F *F2 where F1= 0.8*B *A *H A= anchor, B=body, H=header  Dynamic re-ranking runs (DR_o) w/ official QT  Observations  Dynamic tuning works well significant improvement over baseline (TD, HP)  NP reranking needs to be optimized relatively small improvement by reranking  Run Descriptions  Best fusion run: F3 0.4*A + 0.3*F *F2 where F1= 0.8*B *A *H A= anchor, B=body, H=header  Dynamic re-ranking runs (DR_o) w/ official QT  Observations  Dynamic tuning works well significant improvement over baseline (TD, HP)  NP reranking needs to be optimized relatively small improvement by reranking MAP (TD) MRR (NP) MRR (HP) DR_o (+38.5%) (+ 6.7%) (+47.2%) F3 (baseline) TREC Median

14 AIRS2005 Discussion: Web IR Methods  What worked?  Fusion Combining multiple sources of evidence (MSE)  Dynamic Tuning Helps multi-parameter tuning & failure analysis  What next?  Expanded MSE mining Web server and search engine logs  Enhanced Reranking Feature Selection & Scoring Modified PageRank/HITS Link noise reduction based on page layout  Streamlined Fusion Optimization  What worked?  Fusion Combining multiple sources of evidence (MSE)  Dynamic Tuning Helps multi-parameter tuning & failure analysis  What next?  Expanded MSE mining Web server and search engine logs  Enhanced Reranking Feature Selection & Scoring Modified PageRank/HITS Link noise reduction based on page layout  Streamlined Fusion Optimization

15 AIRS2005 Discussion: Fusion Optimization  Conventional Fusion Optimization approaches  Exhaustive parameter combination Step-wise search of the whole solution space Computationally demanding when the number of parameter is large  Parameter combination based on past evidence Targeted search of restricted solution space i.e., parameter ranges are estimated based on training data  Next-Generation Fusion Optimization approaches  Non-linear Transformation function for Reranking Feature scores e.g. log transformation to compensate for the power law distribution of PageRank  Hybrid Fusion Optimization Semi-automatic Dynamic Tuning Automatic Fusion Optimization by Category  Conventional Fusion Optimization approaches  Exhaustive parameter combination Step-wise search of the whole solution space Computationally demanding when the number of parameter is large  Parameter combination based on past evidence Targeted search of restricted solution space i.e., parameter ranges are estimated based on training data  Next-Generation Fusion Optimization approaches  Non-linear Transformation function for Reranking Feature scores e.g. log transformation to compensate for the power law distribution of PageRank  Hybrid Fusion Optimization Semi-automatic Dynamic Tuning Automatic Fusion Optimization by Category

16 AIRS2005 Results pool Fetching result sets For different categories Automatic fusion optimization performance gain > threshold? Category 1 Top 10 systems Category n Category 2 Top system in each query length Yes No Automatic Fusion Optimization optimized fusion formula

17 AIRS2005 Resources WIDIT (Web Information Discovery Integrated Tool) Lab: Dynamic Tuning Interface (Web track) WIDIT  projects  TREC  Web track Dynamic Tuning Interface (HARD track) WIDIT  projects  TREC  HARD track Thank you! Questions? WIDIT (Web Information Discovery Integrated Tool) Lab: Dynamic Tuning Interface (Web track) WIDIT  projects  TREC  Web track Dynamic Tuning Interface (HARD track) WIDIT  projects  TREC  HARD track Thank you! Questions?

18 AIRS2005  Length-Normalized Term Weights SMART lnu weight for document terms SMART ltc weight for query terms where:f ik = number of times term k appears in document i idf k = inverse document frequency of term k t = number of terms in document/query  Document Score inner product of document and query vectors where:q k = weight of term k in the query d ik = weight of term k in document i t = number of terms common to query & document  Length-Normalized Term Weights SMART lnu weight for document terms SMART ltc weight for query terms where:f ik = number of times term k appears in document i idf k = inverse document frequency of term k t = number of terms in document/query  Document Score inner product of document and query vectors where:q k = weight of term k in the query d ik = weight of term k in document i t = number of terms common to query & document SMART

19 AIRS2005  Document term weight (simplified formula)  Query term weight  Document Ranking where:Q = query containing terms T K = k 1 ((1-b) + b*(doc_length/avg.doc_length)) tf = term frequency in a document qtf = term frequency in a query tf = term frequency in a document k 1, b, k 3 = parameters (1.2, 0.75, ) w RS = Robertson-Sparck Jones weight N = total number of documents in the collection n = total number of documents in which the term occur R = total number of relevant documents in the collection n = total number of relevant documents retrieved  Document Ranking where:Q = query containing terms T K = k 1 ((1-b) + b*(doc_length/avg.doc_length)) tf = term frequency in a document qtf = term frequency in a query tf = term frequency in a document k 1, b, k 3 = parameters (1.2, 0.75, ) w RS = Robertson-Sparck Jones weight N = total number of documents in the collection n = total number of documents in which the term occur R = total number of relevant documents in the collection n = total number of relevant documents retrieved Okapi

20 AIRS2005  URL Type (Tomlinson, 2003; Kraaij et al., 2002)Tomlinson, 2003; Kraaij et al., 2002 Heuristic o root: slash_cnt=0 or (HP_end & slash_cnt=1) o subroot: HP_end & slash_cnt=2 o path: HP_end & slash_cnt>=3 o file: rest (HP_end =1 if URL ends w/ index.htm, default.htm, /, etc)  Page Type Heuristic if “ welcome ” or “ home ” in title, header, anchor text  HPP else if “ YYYY ” in title, anchor  NPP else if NP lexicon word  NP else if HP lexicon word  HP else if ends in all caps  HP else  ?? NP lexicon o about, annual, report, guide, studies, history, new, how HP lexicon o office, bureau, department, institute, center, committee, agency, administration, council, society, service, corporation, commission, board, division, museum, library, project, group, program, laboratory, site, authority, study, industry  URL Type (Tomlinson, 2003; Kraaij et al., 2002)Tomlinson, 2003; Kraaij et al., 2002 Heuristic o root: slash_cnt=0 or (HP_end & slash_cnt=1) o subroot: HP_end & slash_cnt=2 o path: HP_end & slash_cnt>=3 o file: rest (HP_end =1 if URL ends w/ index.htm, default.htm, /, etc)  Page Type Heuristic if “ welcome ” or “ home ” in title, header, anchor text  HPP else if “ YYYY ” in title, anchor  NPP else if NP lexicon word  NP else if HP lexicon word  HP else if ends in all caps  HP else  ?? NP lexicon o about, annual, report, guide, studies, history, new, how HP lexicon o office, bureau, department, institute, center, committee, agency, administration, council, society, service, corporation, commission, board, division, museum, library, project, group, program, laboratory, site, authority, study, industry Webpage Type Identification