IR Theory: Web Information Retrieval

IR Theory: Web Information Retrieval

Fusion IR Web IR Search Engine

Evolution of IR: Phase I
Brute-force Search User  Raw Data Library Collection Development Quality Control Classification Controlled Vocabulary Bibliographical Records Browsing User  Organized/Filtered Data Searching User  Intermediary  Metadata  Organized/Filtered Data Search Engine

Evolution of IR: Phase II
IR System Automatic Indexing Pattern Matching User  Computer Inverted Index  Raw Data Move from metadata to content-based search IR Research Goal Rank the documents by their relevance to a given query Approach Query-Document Similarity Term-Weights based on term occurrence statistics Query  Document Term Index  Ranked list of matches Controlled and restricted experiments with small, homogeneous, and high quality data Search Engine

Evolution of IR: Phase III
World Wide Web Massive, uncontrolled, heterogeneous, and dynamic environment Content-based Web Search Engines Web Crawler + Basic IR technology Matching of query terms to document terms Web Directories Browse/Search of Organized Web Manual cataloging of Web subset Content- & Link-based Web Search Engines Pattern Matching + Link Analysis Renewed interest in metadata and classification approach Digital Libraries? Integrated (Content, Link, Metadata) Information Discovery Search Engine

Fusion IR: Overview Goal Approaches
To achieve the whole that is greater than sum of its parts Approaches Tag team Use a method best suited for a given situation Single method, single set of results Integration Use a combined method that integrates multiple methods Combined method, single set of results Merging Merge the results of multiple methods Multiple methods, multiple sets of results Meta-fusion All of the above Search Engine

Fusion IR: Research Areas
Data Fusion Combining multiple sources of evidence Single collection, multiple representations, single IR method Collection Fusion Merging the results of multiple collection search Multiple collections, single representation, single IR method Method Fusion Combining multiple IR methods Single collection, single representation, multiple IR methods Paradigm Fusion Combining content analysis, link analysis, and classification Integrating user, system, and data Search Engine

Fusion IR: Research Findings
Findings from content-based IR experiments with small, homogeneous document collections Different IR systems retrieve different sets of documents Documents retrieved by multiple systems are more likely to be relevant Combining different systems is likely to be more beneficial than combining similar systems Fusion is good for IR Is fusion a viable approach for Web IR? Search Engine

Web Fusion IR: Motivation
Web search has become a daily information access mechanism 79% of Web users access the Internet daily. 85% of Web users use search engines to find information. GVU center at Georgia Tech (Kehoe et al., 1999) 670M Web searches per day (250M by Google) Search Engine Watch (2/2003) 2 Billion Internet Users in 2010 (444% growth from 2000) New Challenges Data: massive, dynamic, heterogenious, noisy Users: diverse, “transitory” New Opportunities Multiple sources of evidence content, hyperlinks, document structure, user data, taxonomies Data abundance/redundancy Review Yang (2005). Information Retrieval on the Web, ARIST Vol. 39 Search Engine

Link Analysis: PageRank
PageRank score: R(pi) Propagation of R(pi) through inlinks of the entire Web T = total # of pages in the Web d = damping factor pi = inlink of p C(pi) = outdegree of pi Start w/ all R(pi)=1, repeat computation until convergence Global Measure of a page based on link analysis only Interpretation Models the behavior of random Web surfer A probability distribution/weighting function that estimates the likelihood of arriving at page p by link traversal and random jump (d). Importance/Quality/Popularity of a Web page A link signifies recommendation/citation aggregate all recommendations recursively over entire Web, where each recommendation is weighted by its importance and normalized by its outdegree Search Engine

Link Analysis: HITS Hyperlink Induced Topic Search
Consider both inlinks & outlinks estimates the value of a page based on aggregate value of in/outlinks Identify “authority” & “hub” pages authority = a page pointed to by many good hubs hub = a page pointing to many good authority Query-dependent measure hub & authority scores assigned for each query computed from a small subset of the Web i.e. top N retrieval results Premise Web contains mutually reinforcing communities of hubs & authorities on broad topics Search Engine

Link Analysis: Modified HITS
HITS-based Ranking Expand a set of Text-based search results Root set S = top N documents (e.g. N=200) Inlinks & Outlinks of S (1 or 2 hops) Max. k inlinks per document (e.g. k=50) Delete intrahost links, stoplist URLs Compute Hub and Authority scores Iterative algorithm Fractional weights to links by same authors Rank documents by Authority/Hub scores Search Engine

Modified HITS: Scoring Algorithm
Initialize all h(p) and a(p) to 1 Recompute h(p) and a(p) with fractional weights - normalize contribution of authorship (assumption: host=author) a(p)= (h(q)*auth_wt(q,p)) q is a page linking to p auth_wt (q,p) = 1/m for page q, whose host has m documents linking to p h(p)= (a(q) *hub_wt(p,q)) q is a page linked from p hub_wt(p,q) = 1/n for page q, whose host has n documents linked from p Normalize scores divide score by square root of sum of squared scores (a(p)= h(p)=1) Repeat steps 2 & 3 until scores stabilize Typical convergence in 10 to 50 iterations for 5000 webpages link weights - normalize contribution of authorship - divide each page contribution by #pages by the same author - assumes site = author Score normalization by length of score vector - divide score by square root of sum of squared scores - sum of scores add up to 1 Convergence - max. 200 iterations - Typical convergence in 10 to 50 iterations for 5000 Web pages - Bharat & Henzinger: 150 iterations Kleinberg, J. (1997). Authoritative sources in a hyperlinked environment. Proceeding of the 9th ACM-SIAM Symposium on Discrete Algorithms. Search Engine

Modified HITS: Link Weighting
q3 q3 q4 q4 link weights - normalize contribution of authorship - divide each page contribution by #pages by the same author - assumes site = author Score normalization by length of score vector - divide score by square root of sum of squared scores - sum of scores add up to 1 Convergence - max. 200 iterations - Typical convergence in 10 to 50 iterations for 5000 Web pages - Bharat & Henzinger: 150 iterations Kleinberg, J. (1997). Authoritative sources in a hyperlinked environment. Proceeding of the 9th ACM-SIAM Symposium on Discrete Algorithms. p p q1 q2 q1 q2 a(p)= h(q1) + h(q2) + h(q3) + h(q4)/5 h(p)= a(q1) + a(q2) + a(q3) + a(q4)/6 Search Engine

WIDIT:Web IR System Overview
Mine Multiple Sources of Evidence (MSE) Document Content Document Structure Link Information URL information Execute Parallel Search Multiple Document Representations body text, anchor text, header text Multiple Query formulations query expansion Combine the Parallel Search Results Static Tuning of fusion formula (QT-independent) Identify Query Types (QT) Combination Classifier Rerank the fusion result with MSE Compute Reranking Feature Scores Dynamic Tuning of reranking formulas (QT-specific) Search Engine

WIDIT: Web IR System Architecture
Body Index Anchor Index Header Index Static Tuning Sub-indexes Sub-indexes Sub-indexes Documents Indexing Module Fusion Module Retrieval Module Search Results Topics Queries Queries Simple Queries Expanded Queries Fusion Result Dynamic Tuning Query Classification Module Query Types Re-ranking Module Final Result Search Engine

WIDIT: Dynamic Tuning Interface
Search Engine

SMART Length-Normalized Term Weights Document Score
SMART lnu weight for document terms SMART ltc weight for query terms where: fik = number of times term k appears in document i idfk = inverse document frequency of term k t = number of terms in document/query Document Score inner product of document and query vectors where: qk = weight of term k in the query dik = weight of term k in document i t = number of terms common to query & document Web documents have large variation in lengths Length-normalized Term weights - The denominator is a documentlength normalization factor - keeps short documents from being penalized. Buckley, C., Salton, G., & Allan, J., & Singhal, A. (1995). Automatic query expansion using SMART: TREC 3. In D. K. Harman (Ed.), The Third Text Rerieval Conference (TREC-3) (NIST Spec. Publ , pp. 1-19). Washington, DC: U.S. Government Printing Office. Buckley, C., Singhal, A., Mitra, M., & Salton, G. (1996). New retrieval approaches using SMART: TREC 4. In D. K. Harman (Ed.), The Fourth Text REtrieval Conference (TREC-4) (NIST Spec. Publ , pp ). Washington, DC: U.S. Government Printing Office. Search Engine

Okapi Document Ranking Document term weight (simplified formula)
where: Q = query containing terms T K = k1 ((1-b) + b*(doc_length/avg.doc_length)) tf = term frequency in a document qtf = term frequency in a query k1 , b, k3 = parameters (1.2, 0.75, ) wRS = Robertson-Sparck Jones weight N = total number of documents in the collection n = total number of documents in which the term occur R = total number of relevant documents in the collection n = total number of relevant documents retrieved Document term weight (simplified formula) Query term weight Web documents have large variation in lengths Length-normalized Term weights - The denominator is a documentlength normalization factor - keeps short documents from being penalized. Buckley, C., Salton, G., & Allan, J., & Singhal, A. (1995). Automatic query expansion using SMART: TREC 3. In D. K. Harman (Ed.), The Third Text Rerieval Conference (TREC-3) (NIST Spec. Publ , pp. 1-19). Washington, DC: U.S. Government Printing Office. Buckley, C., Singhal, A., Mitra, M., & Salton, G. (1996). New retrieval approaches using SMART: TREC 4. In D. K. Harman (Ed.), The Fourth Text REtrieval Conference (TREC-4) (NIST Spec. Publ , pp ). Washington, DC: U.S. Government Printing Office. Search Engine

IR Theory: Web Information Retrieval

Similar presentations

Presentation on theme: "IR Theory: Web Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IR Theory: Web Information Retrieval

Similar presentations

Presentation on theme: "IR Theory: Web Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback