Fusion-based Approach to Web Search Optimization Kiduk Yang, Ning Yu WIDIT Laboratory SLIS, Indiana University AIRS2005 Kiduk Yang, Ning Yu WIDIT Laboratory.

Fusion-based Approach to Web Search Optimization Kiduk Yang, Ning Yu WIDIT Laboratory SLIS, Indiana University AIRS2005 Kiduk Yang, Ning Yu WIDIT Laboratory SLIS, Indiana University AIRS2005

2 OUTLINE  Introduction  WIDIT in TREC Web Track  Results & Discussion  Introduction  WIDIT in TREC Web Track  Results & Discussion

3 AIRS2005 Introduction  Web IR  Challenges Size, Heterogeneity, Quality of data Diversity of user tasks, interests, characteristics  Opportunities Diverse Sources of Evidence Data Abundance  WIDIT Approach to Web IR  Leverage Multiple Sources of Evidence  Utilize Multiple Methods  Apply Fusion  Research Question  What to combine?  How to combine?  Web IR  Challenges Size, Heterogeneity, Quality of data Diversity of user tasks, interests, characteristics  Opportunities Diverse Sources of Evidence Data Abundance  WIDIT Approach to Web IR  Leverage Multiple Sources of Evidence  Utilize Multiple Methods  Apply Fusion  Research Question  What to combine?  How to combine?

4 AIRS2005 WIDIT in Web Track 2004  Data  Documents 1.25 million.gov Web pages (18 GB).  Topics (i.e. queries) 75 Topic Distillation (TD), 75 Home Page (HP), 75 Named Page (NP)  Task  Retrieve relevant documents given a mixed set of query types (QT)  Main Strategy  Fusion of multiple data representations Static Tuning (QT-independent)  Fusion of multiple sources of evidence Dynamic Tuning (QT-specific)  Data  Documents 1.25 million.gov Web pages (18 GB).  Topics (i.e. queries) 75 Topic Distillation (TD), 75 Home Page (HP), 75 Named Page (NP)  Task  Retrieve relevant documents given a mixed set of query types (QT)  Main Strategy  Fusion of multiple data representations Static Tuning (QT-independent)  Fusion of multiple sources of evidence Dynamic Tuning (QT-specific)

5 AIRS2005 WIDIT: Web IR System Architecture Indexing Module Sub-indexes Body Index Anchor Index Header Index Documents Topics Queries Simple Queries Queries Expanded Queries Retrieval Module Fusion Module Sub-indexes Search Results Re-ranking Module Fusion Result Final Result Static Tuning Dynamic Tuning Query Classification Module Query Types

6 AIRS2005 WIDIT: Indexing Module  Document Indexing 1. Strip HTML tags extract title, meta keywords & description, emphasized words parse out hyperlinks (URL & anchor texts) 2. Create Surrogate Documents anchor texts of inlinks header texts (title, meta text, emphasized text) 3. Create Subcollection Indexes Stop & Stem (Simple, Combo stemmer) compute SMART & Okapi term weightsSMARTOkapi 4. Compute whole collection term statistics  Query Indexing 1. Stop & Stem 2. Identify nouns, phrases 3. Expand acronyms 4. Mine synonyms and definitions from Web search  Document Indexing 1. Strip HTML tags extract title, meta keywords & description, emphasized words parse out hyperlinks (URL & anchor texts) 2. Create Surrogate Documents anchor texts of inlinks header texts (title, meta text, emphasized text) 3. Create Subcollection Indexes Stop & Stem (Simple, Combo stemmer) compute SMART & Okapi term weightsSMARTOkapi 4. Compute whole collection term statistics  Query Indexing 1. Stop & Stem 2. Identify nouns, phrases 3. Expand acronyms 4. Mine synonyms and definitions from Web search

7 AIRS2005 WIDIT: Retrieval Module 1.Parallel Searching  Multiple Document Index body text (title, body) anchor text (title, inlink anchor text) header text (title, meta kw & desc, first heading, emphasized words)  Multiple Query formulations stemming (Simple, Combo) expanded query (acronym, noun)  Multiple Subcollections for search speed and scalability search each subcollection using whole collection term statistics 2.Merge subcollection search results  merge & sort by document score 1.Parallel Searching  Multiple Document Index body text (title, body) anchor text (title, inlink anchor text) header text (title, meta kw & desc, first heading, emphasized words)  Multiple Query formulations stemming (Simple, Combo) expanded query (acronym, noun)  Multiple Subcollections for search speed and scalability search each subcollection using whole collection term statistics 2.Merge subcollection search results  merge & sort by document score

8 AIRS2005 WIDIT: Fusion Module  Fusion Formulas  Weighted Sum FS ws =  (w i *NS i )  Select candidate systems to combine  Top performers in each category e.g. best stemmer, qry expansion, doc index  Diverse systems e.g. Content-based, Link-based  One-time brute force combinations for validation Complementary Strength effect  Determine system weights (w i )  Static Tuning Evaluate fusion formulas using a fixed set of values (e.g. 0.1..1.0) with training data Select the formulas with best performance  Fusion Formulas  Weighted Sum FS ws =  (w i *NS i )  Select candidate systems to combine  Top performers in each category e.g. best stemmer, qry expansion, doc index  Diverse systems e.g. Content-based, Link-based  One-time brute force combinations for validation Complementary Strength effect  Determine system weights (w i )  Static Tuning Evaluate fusion formulas using a fixed set of values (e.g. 0.1..1.0) with training data Select the formulas with best performance w i = weight of system i (relative contribution of each system) NS i = normalized score of a document by system i = (S i – S min ) / (S max – S min )

9 AIRS2005 WIDIT: Query Classification Module  Statistical Classification (SC)  Classifiers Na ï ve Bayes SVM  Training Data Titles of 2003 topics (50 TD, 150 HP, 150 NP) w/ and w/o stemming (Combo stemmer)  Training Data Enrichment for TD class Added top-level Yahoo Government category labels  Linguistic Classification (LC)  Word Cues Create HP and NP lexicons  Ad-hoc heuristic e.g. HP if ends in all caps, NP if contains YYYY, TD if short topic  Combination  More ad-hoc heuristic if strong word cue, LC else if single word, TD else SC  Statistical Classification (SC)  Classifiers Na ï ve Bayes SVM  Training Data Titles of 2003 topics (50 TD, 150 HP, 150 NP) w/ and w/o stemming (Combo stemmer)  Training Data Enrichment for TD class Added top-level Yahoo Government category labels  Linguistic Classification (LC)  Word Cues Create HP and NP lexicons  Ad-hoc heuristic e.g. HP if ends in all caps, NP if contains YYYY, TD if short topic  Combination  More ad-hoc heuristic if strong word cue, LC else if single word, TD else SC

10 AIRS2005 WIDIT: Re-ranking Module  Re-ranking Features  Field-specific Match Query words, acronyms, phrases in URL, title, header, anchor text  Exact Match title, header text, anchor text body text  Indegree & Outdegree  URL Type: root, subroot, path, file URL Type based on URL ending and slash count  Page Type: HPP, HP, NPP, NP, ?? Page Type based on word cue & heuristic  Re-ranking Formula  Weighted sum of re-ranking features  Dynamic Tuning  Dynamic/interactive optimization of the QT-specific re-ranking formula  Re-ranking Features  Field-specific Match Query words, acronyms, phrases in URL, title, header, anchor text  Exact Match title, header text, anchor text body text  Indegree & Outdegree  URL Type: root, subroot, path, file URL Type based on URL ending and slash count  Page Type: HPP, HP, NPP, NP, ?? Page Type based on word cue & heuristic  Re-ranking Formula  Weighted sum of re-ranking features  Dynamic Tuning  Dynamic/interactive optimization of the QT-specific re-ranking formula

11 AIRS2005 WIDIT: Dynamic Tuning Interface W L

12 AIRS2005 Dynamic Tuning: Observations  Effective re-ranking factors  HP indegree, outdegree, exact match, URL/pagetype minimum number of outdegree =1  NP indegree, outdegree, URLtype o 1/3 impact of HP  TD acronym, outdegree, URLtype minimum number of outdegree =10  Strength  Combines the human intelligence (pattern recognition) w/ computational power of machine  Good for system tuning with many parameters  Facilitates failure analysis  Weakness  Over-tuning  Sensitive to initial results & re-ranking parameter selection  Effective re-ranking factors  HP indegree, outdegree, exact match, URL/pagetype minimum number of outdegree =1  NP indegree, outdegree, URLtype o 1/3 impact of HP  TD acronym, outdegree, URLtype minimum number of outdegree =10  Strength  Combines the human intelligence (pattern recognition) w/ computational power of machine  Good for system tuning with many parameters  Facilitates failure analysis  Weakness  Over-tuning  Sensitive to initial results & re-ranking parameter selection

13 AIRS2005 Results  Run Descriptions  Best fusion run: F3 0.4*A + 0.3*F1 + 0.3*F2 where F1= 0.8*B + 0.05*A + 0.15*H A= anchor, B=body, H=header  Dynamic re-ranking runs (DR_o) w/ official QT  Observations  Dynamic tuning works well significant improvement over baseline (TD, HP)  NP reranking needs to be optimized relatively small improvement by reranking  Run Descriptions  Best fusion run: F3 0.4*A + 0.3*F1 + 0.3*F2 where F1= 0.8*B + 0.05*A + 0.15*H A= anchor, B=body, H=header  Dynamic re-ranking runs (DR_o) w/ official QT  Observations  Dynamic tuning works well significant improvement over baseline (TD, HP)  NP reranking needs to be optimized relatively small improvement by reranking MAP (TD) MRR (NP) MRR (HP) DR_o 0.1349 (+38.5%) 0.6545 (+ 6.7%) 0.6265 (+47.2%) F3 (baseline) 0.09740.61340.4256 TREC Median 0.10100.58880.5838

14 AIRS2005 Discussion: Web IR Methods  What worked?  Fusion Combining multiple sources of evidence (MSE)  Dynamic Tuning Helps multi-parameter tuning & failure analysis  What next?  Expanded MSE mining Web server and search engine logs  Enhanced Reranking Feature Selection & Scoring Modified PageRank/HITS Link noise reduction based on page layout  Streamlined Fusion Optimization  What worked?  Fusion Combining multiple sources of evidence (MSE)  Dynamic Tuning Helps multi-parameter tuning & failure analysis  What next?  Expanded MSE mining Web server and search engine logs  Enhanced Reranking Feature Selection & Scoring Modified PageRank/HITS Link noise reduction based on page layout  Streamlined Fusion Optimization

15 AIRS2005 Discussion: Fusion Optimization  Conventional Fusion Optimization approaches  Exhaustive parameter combination Step-wise search of the whole solution space Computationally demanding when the number of parameter is large  Parameter combination based on past evidence Targeted search of restricted solution space i.e., parameter ranges are estimated based on training data  Next-Generation Fusion Optimization approaches  Non-linear Transformation function for Reranking Feature scores e.g. log transformation to compensate for the power law distribution of PageRank  Hybrid Fusion Optimization Semi-automatic Dynamic Tuning Automatic Fusion Optimization by Category  Conventional Fusion Optimization approaches  Exhaustive parameter combination Step-wise search of the whole solution space Computationally demanding when the number of parameter is large  Parameter combination based on past evidence Targeted search of restricted solution space i.e., parameter ranges are estimated based on training data  Next-Generation Fusion Optimization approaches  Non-linear Transformation function for Reranking Feature scores e.g. log transformation to compensate for the power law distribution of PageRank  Hybrid Fusion Optimization Semi-automatic Dynamic Tuning Automatic Fusion Optimization by Category

16 AIRS2005 Results pool Fetching result sets For different categories Automatic fusion optimization performance gain > threshold? Category 1 Top 10 systems Category n Category 2 Top system in each query length Yes No Automatic Fusion Optimization optimized fusion formula

17 AIRS2005 Resources WIDIT (Web Information Discovery Integrated Tool) Lab: http://widit.slis.indiana.edu/ http://elvis.slis.indiana.edu/ Dynamic Tuning Interface (Web track) http://elvis.slis.indiana.edu/TREC/web/results/test/postsub0/wdf3oks0a.htm WIDIT  projects  TREC  Web track Dynamic Tuning Interface (HARD track) http://elvis.slis.indiana.edu/TREC/hard/results/test/postsub0/wdf3oks0a.htm WIDIT  projects  TREC  HARD track Thank you! Questions? WIDIT (Web Information Discovery Integrated Tool) Lab: http://widit.slis.indiana.edu/ http://elvis.slis.indiana.edu/ Dynamic Tuning Interface (Web track) http://elvis.slis.indiana.edu/TREC/web/results/test/postsub0/wdf3oks0a.htm WIDIT  projects  TREC  Web track Dynamic Tuning Interface (HARD track) http://elvis.slis.indiana.edu/TREC/hard/results/test/postsub0/wdf3oks0a.htm WIDIT  projects  TREC  HARD track Thank you! Questions?

18 AIRS2005  Length-Normalized Term Weights SMART lnu weight for document terms SMART ltc weight for query terms where:f ik = number of times term k appears in document i idf k = inverse document frequency of term k t = number of terms in document/query  Document Score inner product of document and query vectors where:q k = weight of term k in the query d ik = weight of term k in document i t = number of terms common to query & document  Length-Normalized Term Weights SMART lnu weight for document terms SMART ltc weight for query terms where:f ik = number of times term k appears in document i idf k = inverse document frequency of term k t = number of terms in document/query  Document Score inner product of document and query vectors where:q k = weight of term k in the query d ik = weight of term k in document i t = number of terms common to query & document SMART

19 AIRS2005  Document term weight (simplified formula)  Query term weight  Document Ranking where:Q = query containing terms T K = k 1 ((1-b) + b*(doc_length/avg.doc_length)) tf = term frequency in a document qtf = term frequency in a query tf = term frequency in a document k 1, b, k 3 = parameters (1.2, 0.75, 7..1000) w RS = Robertson-Sparck Jones weight N = total number of documents in the collection n = total number of documents in which the term occur R = total number of relevant documents in the collection n = total number of relevant documents retrieved  Document Ranking where:Q = query containing terms T K = k 1 ((1-b) + b*(doc_length/avg.doc_length)) tf = term frequency in a document qtf = term frequency in a query tf = term frequency in a document k 1, b, k 3 = parameters (1.2, 0.75, 7..1000) w RS = Robertson-Sparck Jones weight N = total number of documents in the collection n = total number of documents in which the term occur R = total number of relevant documents in the collection n = total number of relevant documents retrieved Okapi

20 AIRS2005  URL Type (Tomlinson, 2003; Kraaij et al., 2002)Tomlinson, 2003; Kraaij et al., 2002 Heuristic o root: slash_cnt=0 or (HP_end & slash_cnt=1) o subroot: HP_end & slash_cnt=2 o path: HP_end & slash_cnt>=3 o file: rest (HP_end =1 if URL ends w/ index.htm, default.htm, /, etc)  Page Type Heuristic if “ welcome ” or “ home ” in title, header, anchor text  HPP else if “ YYYY ” in title, anchor  NPP else if NP lexicon word  NP else if HP lexicon word  HP else if ends in all caps  HP else  ?? NP lexicon o about, annual, report, guide, studies, history, new, how HP lexicon o office, bureau, department, institute, center, committee, agency, administration, council, society, service, corporation, commission, board, division, museum, library, project, group, program, laboratory, site, authority, study, industry  URL Type (Tomlinson, 2003; Kraaij et al., 2002)Tomlinson, 2003; Kraaij et al., 2002 Heuristic o root: slash_cnt=0 or (HP_end & slash_cnt=1) o subroot: HP_end & slash_cnt=2 o path: HP_end & slash_cnt>=3 o file: rest (HP_end =1 if URL ends w/ index.htm, default.htm, /, etc)  Page Type Heuristic if “ welcome ” or “ home ” in title, header, anchor text  HPP else if “ YYYY ” in title, anchor  NPP else if NP lexicon word  NP else if HP lexicon word  HP else if ends in all caps  HP else  ?? NP lexicon o about, annual, report, guide, studies, history, new, how HP lexicon o office, bureau, department, institute, center, committee, agency, administration, council, society, service, corporation, commission, board, division, museum, library, project, group, program, laboratory, site, authority, study, industry Webpage Type Identification

Fusion-based Approach to Web Search Optimization Kiduk Yang, Ning Yu WIDIT Laboratory SLIS, Indiana University AIRS2005 Kiduk Yang, Ning Yu WIDIT Laboratory.

Similar presentations

Presentation on theme: "Fusion-based Approach to Web Search Optimization Kiduk Yang, Ning Yu WIDIT Laboratory SLIS, Indiana University AIRS2005 Kiduk Yang, Ning Yu WIDIT Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fusion-based Approach to Web Search Optimization Kiduk Yang, Ning Yu WIDIT Laboratory SLIS, Indiana University AIRS2005 Kiduk Yang, Ning Yu WIDIT Laboratory.

Similar presentations

Presentation on theme: "Fusion-based Approach to Web Search Optimization Kiduk Yang, Ning Yu WIDIT Laboratory SLIS, Indiana University AIRS2005 Kiduk Yang, Ning Yu WIDIT Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback