IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Link Structure and Web Mining Shuying Wang
Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
Adversarial Information Retrieval The Manipulation of Web Content.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Using Hyperlink structure information for web search.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
IR Theory: Relevance Feedback. Relevance Feedback: Example  Initial Results Search Engine2.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Algorithmic Detection of Semantic Similarity WWW 2005.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Fusion-based Approach to Web Search Optimization Kiduk Yang, Ning Yu WIDIT Laboratory SLIS, Indiana University AIRS2005 Kiduk Yang, Ning Yu WIDIT Laboratory.
1 CS 430: Information Discovery Lecture 5 Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
General Architecture of Retrieval Systems 1Adrienn Skrop.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Automated Information Retrieval
IR Theory: Web Information Retrieval
Search Engine Architecture
DATA MINING Introductory and Advanced Topics Part III – Web Mining
HITS Hypertext-Induced Topic Selection
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Text & Web Mining 9/22/2018.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Information Retrieval
Lecture 22 SVD, Eigenvector, and Web Search
CS 440 Database Management Systems
Data Mining Chapter 6 Search Engines
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Chapter 5: Information Retrieval and Web Search
Web Search Engines.
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Introduction to Search Engines
IR Theory: Web Information Retrieval
Presentation transcript:

IR Theory: Web Information Retrieval

Web IRFusion IR Search Engine 2

Evolution of IR: Phase I  Brute-force Search  User  Raw Data  Library  Collection Development  Quality Control  Classification  Controlled Vocabulary  Bibliographical Records  Browsing  User  Organized/Filtered Data  Searching  User  Intermediary  Metadata  Organized/Filtered Data Search Engine 3

Evolution of IR: Phase II  IR System  Automatic Indexing  Pattern Matching  User  Computer  Inverted Index  Raw Data  Move from metadata to content-based search  IR Research  Goal  Rank the documents by their relevance to a given query  Approach  Query-Document Similarity  Term-Weights based on term occurrence statistics  Query  Document Term Index  Ranked list of matches  Controlled and restricted experiments with small, homogeneous, and high quality data Search Engine 4

Evolution of IR: Phase III  World Wide Web  Massive, uncontrolled, heterogeneous, and dynamic environment  Content-based Web Search Engines  Web Crawler + Basic IR technology  Matching of query terms to document terms  Web Directories  Browse/Search of Organized Web  Manual cataloging of Web subset  Content- & Link-based Web Search Engines  Pattern Matching + Link Analysis  Renewed interest in metadata and classification approach  Digital Libraries?  Integrated (Content, Link, Metadata) Information Discovery Search Engine 5

Fusion IR: Overview  Goal  To achieve the whole that is greater than sum of its parts  Approaches  Tag team  Use a method best suited for a given situation  Single method, single set of results  Integration  Use a combined method that integrates multiple methods  Combined method, single set of results  Merging  Merge the results of multiple methods  Multiple methods, multiple sets of results  Meta-fusion  All of the above Search Engine 6

Fusion IR: Research Areas  Data Fusion  Combining multiple sources of evidence  Single collection, multiple representations, single IR method  Collection Fusion  Merging the results of multiple collection search  Multiple collections, single representation, single IR method  Method Fusion  Combining multiple IR methods  Single collection, single representation, multiple IR methods  Paradigm Fusion  Combining content analysis, link analysis, and classification  Integrating user, system, and data Search Engine 7

Fusion IR: Research Findings  Findings from content-based IR experiments with small, homogeneous document collections  Different IR systems retrieve  different sets of documents  Documents retrieved by multiple systems  are more likely to be relevant  Combining different systems  is likely to be more beneficial than combining similar systems  Fusion is good for IR  Is fusion a viable approach for Web IR? Search Engine 8

Web Fusion IR: Motivation  Motivation  Web search has become a daily information access mechanism  2.4 Billion Internet Users (532% growth from 2000) (InternetWorldStats.com, 2012)InternetWorldStats.com  96% of Web users access the Internet daily. ( 2012)  91% of Web users use search engines to find information. (Pew Internet, 2012)Pew Internet  5.1 billion Google searches per day ( 2012)  New Challenges  Data: massive, dynamic, heterogeneous, noisy  Users: diverse, “transitory”  New Opportunities  Multiple sources of evidence – content, hyperlinks, document structure, user data, taxonomies  Data abundance/redundancy  Review  Yang (2005). Information Retrieval on the Web, ARIST Vol. 39  Search Engine 9

Link Analysis: PageRank  PageRank score: R(p i )  Propagation of R(p i ) through inlinks of the entire Web  T = total # of pages in the Web d = damping factor p i = inlink of p C(p i ) = outdegree of p i  Start w/ all R(p i )=1, repeat computation until convergence  Global Measure of a page based on link analysis only  Interpretation  Models the behavior of random Web surfer – A probability distribution/weighting function that estimates the likelihood of arriving at page p by link traversal and random jump (d).  Importance/Quality/Popularity of a Web page – A link signifies recommendation/citation – aggregate all recommendations recursively over entire Web, where each recommendation is weighted by its importance and normalized by its outdegree Search Engine 10

PageRank Simplified 11

Link Analysis: HITS  Hyperlink Induced Topic Search  Consider both inlinks & outlinks  estimates the value of a page based on aggregate value of in/outlinks  Identify “authority” & “hub” pages  authority = a page pointed to by many good hubs  hub = a page pointing to many good authority  Query-dependent measure  hub & authority scores assigned for each query  computed from a small subset of the Web – i.e. top N retrieval results  Premise  Web contains mutually reinforcing communities of hubs & authorities on broad topics Search Engine 12

Link Analysis: Modified HITS  HITS-based Ranking 1.Expand a set of Text-based search results  Root set S = top N documents (e.g. N=200)  Inlinks & Outlinks of S (1 or 2 hops) – Max. k inlinks per document (e.g. k=50) – Delete intrahost links, stoplist URLs 2.Compute Hub and Authority scores  Iterative algorithm  Fractional weights to links by same authors 3.Rank documents by Authority/Hub scores Search Engine 13

Modified HITS : Scoring Algorithm 1.Initialize all h(p) and a(p) to 1 2.Recompute h(p) and a(p) with fractional weights - normalize contribution of authorship (assumption: host=author) a(p)=  (h(q)*auth_wt(q,p)) q is a page linking to p auth_wt (q,p) = 1/m for page q, whose host has m documents linking to p h(p)=  (a(q) *hub_wt(p,q)) q is a page linked from p hub_wt(p,q) = 1/n for page q, whose host has n documents linked from p 3.Normalize scores  divide score by square root of sum of squared scores (  a(p)=  h(p)=1) 4.Repeat steps 2 & 3 until scores stabilize  Typical convergence in 10 to 50 iterations for 5000 webpages Search Engine 14

Modified HITS : Link Weighting Search Engine 15 p q1q1 q2q2 q3q3 q4q4 h(p)= a(q 1 ) + a(q 2 ) + a(q 3 ) + a(q 4 )/6 q1q1 q2q2 q3q3 q4q4 a(p)= h(q 1 ) + h(q 2 ) + h(q 3 ) + h(q 4 )/5 p

WIDIT: Web IR System Overview 1.Mine Multiple Sources of Evidence (MSE)  Document Content  Document Structure  Link Information  URL information 2.Execute Parallel Search  Multiple Document Representations  body text, anchor text, header text  Multiple Query formulations  query expansion 3.Combine the Parallel Search Results  Static Tuning of fusion formula (QT-independent) 4.Identify Query Types (QT)  Combination Classifier 5.Rerank the fusion result with MSE  Compute Reranking Feature Scores  Dynamic Tuning of reranking formulas (QT-specific) Search Engine 16

WIDIT: Web IR System Architecture Search Engine 17 Indexing Module Sub-indexes Body Index Anchor Index Header Index Documents Topics Queries Simple Queries Queries Expanded Queries Retrieval Module Fusion Module Sub-indexes Search Results Re-ranking Module Fusion Result Final Result Static Tuning Dynamic Tuning Query Classification Module Query Types

WIDIT: Dynamic Tuning Interface Search Engine 18

SMART  Length-Normalized Term Weights  SMART lnu weight for document terms  SMART ltc weight for query terms where:f ik = number of times term k appears in document i idf k = inverse document frequency of term k t = number of terms in document/query  Document Score  inner product of document and query vectors where:q k = weight of term k in the query d ik = weight of term k in document i t = number of terms common to query & document Search Engine 19

 Document term weight (simplified formula)  Query term weight Okapi  Document Ranking where:Q = query containing terms T K = k 1 ((1-b) + b*(doc_length/avg.doc_length)) tf = term frequency in a document qtf = term frequency in a query k 1, b, k 3 = parameters (1.2, 0.75, ) w RS = Robertson-Sparck Jones weight N = total number of documents in the collection n = total number of documents in which the term occur R = total number of relevant documents in the collection n = total number of relevant documents retrieved Search Engine 20