Download presentation
Presentation is loading. Please wait.
Published byRafe Lane Modified over 9 years ago
1
Fusion-based Approach to Web Search Optimization Kiduk Yang, Ning Yu WIDIT Laboratory SLIS, Indiana University AIRS2005 Kiduk Yang, Ning Yu WIDIT Laboratory SLIS, Indiana University AIRS2005
2
2 OUTLINE Introduction WIDIT in TREC Web Track Results & Discussion Introduction WIDIT in TREC Web Track Results & Discussion
3
3 AIRS2005 Introduction Web IR Challenges Size, Heterogeneity, Quality of data Diversity of user tasks, interests, characteristics Opportunities Diverse Sources of Evidence Data Abundance WIDIT Approach to Web IR Leverage Multiple Sources of Evidence Utilize Multiple Methods Apply Fusion Research Question What to combine? How to combine? Web IR Challenges Size, Heterogeneity, Quality of data Diversity of user tasks, interests, characteristics Opportunities Diverse Sources of Evidence Data Abundance WIDIT Approach to Web IR Leverage Multiple Sources of Evidence Utilize Multiple Methods Apply Fusion Research Question What to combine? How to combine?
4
4 AIRS2005 WIDIT in Web Track 2004 Data Documents 1.25 million.gov Web pages (18 GB). Topics (i.e. queries) 75 Topic Distillation (TD), 75 Home Page (HP), 75 Named Page (NP) Task Retrieve relevant documents given a mixed set of query types (QT) Main Strategy Fusion of multiple data representations Static Tuning (QT-independent) Fusion of multiple sources of evidence Dynamic Tuning (QT-specific) Data Documents 1.25 million.gov Web pages (18 GB). Topics (i.e. queries) 75 Topic Distillation (TD), 75 Home Page (HP), 75 Named Page (NP) Task Retrieve relevant documents given a mixed set of query types (QT) Main Strategy Fusion of multiple data representations Static Tuning (QT-independent) Fusion of multiple sources of evidence Dynamic Tuning (QT-specific)
5
5 AIRS2005 WIDIT: Web IR System Architecture Indexing Module Sub-indexes Body Index Anchor Index Header Index Documents Topics Queries Simple Queries Queries Expanded Queries Retrieval Module Fusion Module Sub-indexes Search Results Re-ranking Module Fusion Result Final Result Static Tuning Dynamic Tuning Query Classification Module Query Types
6
6 AIRS2005 WIDIT: Indexing Module Document Indexing 1. Strip HTML tags extract title, meta keywords & description, emphasized words parse out hyperlinks (URL & anchor texts) 2. Create Surrogate Documents anchor texts of inlinks header texts (title, meta text, emphasized text) 3. Create Subcollection Indexes Stop & Stem (Simple, Combo stemmer) compute SMART & Okapi term weightsSMARTOkapi 4. Compute whole collection term statistics Query Indexing 1. Stop & Stem 2. Identify nouns, phrases 3. Expand acronyms 4. Mine synonyms and definitions from Web search Document Indexing 1. Strip HTML tags extract title, meta keywords & description, emphasized words parse out hyperlinks (URL & anchor texts) 2. Create Surrogate Documents anchor texts of inlinks header texts (title, meta text, emphasized text) 3. Create Subcollection Indexes Stop & Stem (Simple, Combo stemmer) compute SMART & Okapi term weightsSMARTOkapi 4. Compute whole collection term statistics Query Indexing 1. Stop & Stem 2. Identify nouns, phrases 3. Expand acronyms 4. Mine synonyms and definitions from Web search
7
7 AIRS2005 WIDIT: Retrieval Module 1.Parallel Searching Multiple Document Index body text (title, body) anchor text (title, inlink anchor text) header text (title, meta kw & desc, first heading, emphasized words) Multiple Query formulations stemming (Simple, Combo) expanded query (acronym, noun) Multiple Subcollections for search speed and scalability search each subcollection using whole collection term statistics 2.Merge subcollection search results merge & sort by document score 1.Parallel Searching Multiple Document Index body text (title, body) anchor text (title, inlink anchor text) header text (title, meta kw & desc, first heading, emphasized words) Multiple Query formulations stemming (Simple, Combo) expanded query (acronym, noun) Multiple Subcollections for search speed and scalability search each subcollection using whole collection term statistics 2.Merge subcollection search results merge & sort by document score
8
8 AIRS2005 WIDIT: Fusion Module Fusion Formulas Weighted Sum FS ws = (w i *NS i ) Select candidate systems to combine Top performers in each category e.g. best stemmer, qry expansion, doc index Diverse systems e.g. Content-based, Link-based One-time brute force combinations for validation Complementary Strength effect Determine system weights (w i ) Static Tuning Evaluate fusion formulas using a fixed set of values (e.g. 0.1..1.0) with training data Select the formulas with best performance Fusion Formulas Weighted Sum FS ws = (w i *NS i ) Select candidate systems to combine Top performers in each category e.g. best stemmer, qry expansion, doc index Diverse systems e.g. Content-based, Link-based One-time brute force combinations for validation Complementary Strength effect Determine system weights (w i ) Static Tuning Evaluate fusion formulas using a fixed set of values (e.g. 0.1..1.0) with training data Select the formulas with best performance w i = weight of system i (relative contribution of each system) NS i = normalized score of a document by system i = (S i – S min ) / (S max – S min )
9
9 AIRS2005 WIDIT: Query Classification Module Statistical Classification (SC) Classifiers Na ï ve Bayes SVM Training Data Titles of 2003 topics (50 TD, 150 HP, 150 NP) w/ and w/o stemming (Combo stemmer) Training Data Enrichment for TD class Added top-level Yahoo Government category labels Linguistic Classification (LC) Word Cues Create HP and NP lexicons Ad-hoc heuristic e.g. HP if ends in all caps, NP if contains YYYY, TD if short topic Combination More ad-hoc heuristic if strong word cue, LC else if single word, TD else SC Statistical Classification (SC) Classifiers Na ï ve Bayes SVM Training Data Titles of 2003 topics (50 TD, 150 HP, 150 NP) w/ and w/o stemming (Combo stemmer) Training Data Enrichment for TD class Added top-level Yahoo Government category labels Linguistic Classification (LC) Word Cues Create HP and NP lexicons Ad-hoc heuristic e.g. HP if ends in all caps, NP if contains YYYY, TD if short topic Combination More ad-hoc heuristic if strong word cue, LC else if single word, TD else SC
10
10 AIRS2005 WIDIT: Re-ranking Module Re-ranking Features Field-specific Match Query words, acronyms, phrases in URL, title, header, anchor text Exact Match title, header text, anchor text body text Indegree & Outdegree URL Type: root, subroot, path, file URL Type based on URL ending and slash count Page Type: HPP, HP, NPP, NP, ?? Page Type based on word cue & heuristic Re-ranking Formula Weighted sum of re-ranking features Dynamic Tuning Dynamic/interactive optimization of the QT-specific re-ranking formula Re-ranking Features Field-specific Match Query words, acronyms, phrases in URL, title, header, anchor text Exact Match title, header text, anchor text body text Indegree & Outdegree URL Type: root, subroot, path, file URL Type based on URL ending and slash count Page Type: HPP, HP, NPP, NP, ?? Page Type based on word cue & heuristic Re-ranking Formula Weighted sum of re-ranking features Dynamic Tuning Dynamic/interactive optimization of the QT-specific re-ranking formula
11
11 AIRS2005 WIDIT: Dynamic Tuning Interface W L
12
12 AIRS2005 Dynamic Tuning: Observations Effective re-ranking factors HP indegree, outdegree, exact match, URL/pagetype minimum number of outdegree =1 NP indegree, outdegree, URLtype o 1/3 impact of HP TD acronym, outdegree, URLtype minimum number of outdegree =10 Strength Combines the human intelligence (pattern recognition) w/ computational power of machine Good for system tuning with many parameters Facilitates failure analysis Weakness Over-tuning Sensitive to initial results & re-ranking parameter selection Effective re-ranking factors HP indegree, outdegree, exact match, URL/pagetype minimum number of outdegree =1 NP indegree, outdegree, URLtype o 1/3 impact of HP TD acronym, outdegree, URLtype minimum number of outdegree =10 Strength Combines the human intelligence (pattern recognition) w/ computational power of machine Good for system tuning with many parameters Facilitates failure analysis Weakness Over-tuning Sensitive to initial results & re-ranking parameter selection
13
13 AIRS2005 Results Run Descriptions Best fusion run: F3 0.4*A + 0.3*F1 + 0.3*F2 where F1= 0.8*B + 0.05*A + 0.15*H A= anchor, B=body, H=header Dynamic re-ranking runs (DR_o) w/ official QT Observations Dynamic tuning works well significant improvement over baseline (TD, HP) NP reranking needs to be optimized relatively small improvement by reranking Run Descriptions Best fusion run: F3 0.4*A + 0.3*F1 + 0.3*F2 where F1= 0.8*B + 0.05*A + 0.15*H A= anchor, B=body, H=header Dynamic re-ranking runs (DR_o) w/ official QT Observations Dynamic tuning works well significant improvement over baseline (TD, HP) NP reranking needs to be optimized relatively small improvement by reranking MAP (TD) MRR (NP) MRR (HP) DR_o 0.1349 (+38.5%) 0.6545 (+ 6.7%) 0.6265 (+47.2%) F3 (baseline) 0.09740.61340.4256 TREC Median 0.10100.58880.5838
14
14 AIRS2005 Discussion: Web IR Methods What worked? Fusion Combining multiple sources of evidence (MSE) Dynamic Tuning Helps multi-parameter tuning & failure analysis What next? Expanded MSE mining Web server and search engine logs Enhanced Reranking Feature Selection & Scoring Modified PageRank/HITS Link noise reduction based on page layout Streamlined Fusion Optimization What worked? Fusion Combining multiple sources of evidence (MSE) Dynamic Tuning Helps multi-parameter tuning & failure analysis What next? Expanded MSE mining Web server and search engine logs Enhanced Reranking Feature Selection & Scoring Modified PageRank/HITS Link noise reduction based on page layout Streamlined Fusion Optimization
15
15 AIRS2005 Discussion: Fusion Optimization Conventional Fusion Optimization approaches Exhaustive parameter combination Step-wise search of the whole solution space Computationally demanding when the number of parameter is large Parameter combination based on past evidence Targeted search of restricted solution space i.e., parameter ranges are estimated based on training data Next-Generation Fusion Optimization approaches Non-linear Transformation function for Reranking Feature scores e.g. log transformation to compensate for the power law distribution of PageRank Hybrid Fusion Optimization Semi-automatic Dynamic Tuning Automatic Fusion Optimization by Category Conventional Fusion Optimization approaches Exhaustive parameter combination Step-wise search of the whole solution space Computationally demanding when the number of parameter is large Parameter combination based on past evidence Targeted search of restricted solution space i.e., parameter ranges are estimated based on training data Next-Generation Fusion Optimization approaches Non-linear Transformation function for Reranking Feature scores e.g. log transformation to compensate for the power law distribution of PageRank Hybrid Fusion Optimization Semi-automatic Dynamic Tuning Automatic Fusion Optimization by Category
16
16 AIRS2005 Results pool Fetching result sets For different categories Automatic fusion optimization performance gain > threshold? Category 1 Top 10 systems Category n Category 2 Top system in each query length Yes No Automatic Fusion Optimization optimized fusion formula
17
17 AIRS2005 Resources WIDIT (Web Information Discovery Integrated Tool) Lab: http://widit.slis.indiana.edu/ http://elvis.slis.indiana.edu/ Dynamic Tuning Interface (Web track) http://elvis.slis.indiana.edu/TREC/web/results/test/postsub0/wdf3oks0a.htm WIDIT projects TREC Web track Dynamic Tuning Interface (HARD track) http://elvis.slis.indiana.edu/TREC/hard/results/test/postsub0/wdf3oks0a.htm WIDIT projects TREC HARD track Thank you! Questions? WIDIT (Web Information Discovery Integrated Tool) Lab: http://widit.slis.indiana.edu/ http://elvis.slis.indiana.edu/ Dynamic Tuning Interface (Web track) http://elvis.slis.indiana.edu/TREC/web/results/test/postsub0/wdf3oks0a.htm WIDIT projects TREC Web track Dynamic Tuning Interface (HARD track) http://elvis.slis.indiana.edu/TREC/hard/results/test/postsub0/wdf3oks0a.htm WIDIT projects TREC HARD track Thank you! Questions?
18
18 AIRS2005 Length-Normalized Term Weights SMART lnu weight for document terms SMART ltc weight for query terms where:f ik = number of times term k appears in document i idf k = inverse document frequency of term k t = number of terms in document/query Document Score inner product of document and query vectors where:q k = weight of term k in the query d ik = weight of term k in document i t = number of terms common to query & document Length-Normalized Term Weights SMART lnu weight for document terms SMART ltc weight for query terms where:f ik = number of times term k appears in document i idf k = inverse document frequency of term k t = number of terms in document/query Document Score inner product of document and query vectors where:q k = weight of term k in the query d ik = weight of term k in document i t = number of terms common to query & document SMART
19
19 AIRS2005 Document term weight (simplified formula) Query term weight Document Ranking where:Q = query containing terms T K = k 1 ((1-b) + b*(doc_length/avg.doc_length)) tf = term frequency in a document qtf = term frequency in a query tf = term frequency in a document k 1, b, k 3 = parameters (1.2, 0.75, 7..1000) w RS = Robertson-Sparck Jones weight N = total number of documents in the collection n = total number of documents in which the term occur R = total number of relevant documents in the collection n = total number of relevant documents retrieved Document Ranking where:Q = query containing terms T K = k 1 ((1-b) + b*(doc_length/avg.doc_length)) tf = term frequency in a document qtf = term frequency in a query tf = term frequency in a document k 1, b, k 3 = parameters (1.2, 0.75, 7..1000) w RS = Robertson-Sparck Jones weight N = total number of documents in the collection n = total number of documents in which the term occur R = total number of relevant documents in the collection n = total number of relevant documents retrieved Okapi
20
20 AIRS2005 URL Type (Tomlinson, 2003; Kraaij et al., 2002)Tomlinson, 2003; Kraaij et al., 2002 Heuristic o root: slash_cnt=0 or (HP_end & slash_cnt=1) o subroot: HP_end & slash_cnt=2 o path: HP_end & slash_cnt>=3 o file: rest (HP_end =1 if URL ends w/ index.htm, default.htm, /, etc) Page Type Heuristic if “ welcome ” or “ home ” in title, header, anchor text HPP else if “ YYYY ” in title, anchor NPP else if NP lexicon word NP else if HP lexicon word HP else if ends in all caps HP else ?? NP lexicon o about, annual, report, guide, studies, history, new, how HP lexicon o office, bureau, department, institute, center, committee, agency, administration, council, society, service, corporation, commission, board, division, museum, library, project, group, program, laboratory, site, authority, study, industry URL Type (Tomlinson, 2003; Kraaij et al., 2002)Tomlinson, 2003; Kraaij et al., 2002 Heuristic o root: slash_cnt=0 or (HP_end & slash_cnt=1) o subroot: HP_end & slash_cnt=2 o path: HP_end & slash_cnt>=3 o file: rest (HP_end =1 if URL ends w/ index.htm, default.htm, /, etc) Page Type Heuristic if “ welcome ” or “ home ” in title, header, anchor text HPP else if “ YYYY ” in title, anchor NPP else if NP lexicon word NP else if HP lexicon word HP else if ends in all caps HP else ?? NP lexicon o about, annual, report, guide, studies, history, new, how HP lexicon o office, bureau, department, institute, center, committee, agency, administration, council, society, service, corporation, commission, board, division, museum, library, project, group, program, laboratory, site, authority, study, industry Webpage Type Identification
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.