1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
The PageRank Citation Ranking “Bringing Order to the Web”
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Advances & Link Analysis
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Link Structure and Web Mining Shuying Wang
(hyperlink-induced topic search)
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Information Retrieval
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Link Analysis HITS Algorithm PageRank Algorithm.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
Web Search Slides based on those of C. Lee Giles, who credits R. Mooney, S. White, W. Arms C. Manning, P. Raghavan, H. Schutze.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
1 Page Link Analysis and Anchor Text for Web Search Lecture 9 Many slides based on lectures by Chen Li (UCI) an Raymond Mooney (UTexas)
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Overview of Web Ranking Algorithms: HITS and PageRank
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.
Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The PageRank Citation Ranking: Bringing Order to the Web
DATA MINING Introductory and Advanced Topics Part III – Web Mining
HITS Hypertext-Induced Topic Selection
Methods and Apparatus for Ranking Web Page Search Results
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Thanks to Ray Mooney & Scott White
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Data Mining Chapter 6 Search Engines
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

2 Introduction Link analysis uses information about the structure of the web graph to aid search. It is one of the major innovations in web search. It is the primary reason for Google’s success.

3 Anchor Text Indexing Anchor text: text which is visible in the hyperlink on which one clicks.Anchor text Extract anchor text (between and ). – Evil Empire – IBM Anchor text is usually descriptive of the document to which it points. Add anchor text to the content of the destination page to provide additional relevant keyword indices. Many times anchor text is not useful: –“click here”

4 Bibliometrics: Citation Analysis Link analysis for web search has antecedents in the field of citation analysis, overlap with an area known as bibliometrics. Many standard documents include bibliographies (or references), explicit citations to other previously published documents.

Bibliometrics: Citation Analysis Using citations as links, standard corpora can be viewed as a graph. The structure of this graph, independent of content, can provide interesting information about the similarity of documents and the structure of information. 5

6 Bibliographic Coupling Measure of similarity of documents introduced by Kessler in The bibliographic coupling of two documents A and B is the number of documents cited by both A and B. Maybe want to normalize by size of bibliographies. AB

7 Co-Citation An alternate citation-based measure of similarity introduced by Small in Number of documents that cite both A and B. Maybe want to normalize by total number of documents citing either A or B. AB

8 Citations vs. Links Web links are a bit different than citations: –Many pages with high in-degree are portals not content providers. –Not all links are endorsements. –Company websites don’t point to their competitors. –Citations to relevant literature is enforced by peer-review.

9 Authorities Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. In-degree (number of pointers to a page) is one simple measure of authority. However in-degree treats all links as equal. Should links from pages that are themselves authoritative count more?

10 Hubs Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities). Hub pages for IR are included in this page: –

11 HITS Algorithm developed by Kleinberg in Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant subgraph of the web. Based on mutually recursive facts: –Hubs point to lots of authorities. –Authorities are pointed to by lots of hubs.

12 Hubs and Authorities Together they tend to form a bipartite graph: HubsAuthorities

13 HITS Algorithm Computes hubs and authorities for a particular topic specified by a normal query. First determines a set of relevant pages for the query called the base set S. Analyze the link structure of the web subgraph defined by S to find authority and hub pages in this set.

14 Constructing a Base Subgraph For a specific query Q, let the set of documents returned by a standard search engine be called the root set R. Initialize S to R. Add to S all pages pointed to by any page in R. Add to S all pages that point to any page in R. S R

15 Authorities and In-Degree Even within the base set S for a given query, the nodes with highest in-degree are not necessarily authorities (may just be generally popular pages like Yahoo or Amazon). True authority pages are pointed to by a number of hubs (i.e. pages that point to lots of authorities).

16 Iterative Algorithm Use an iterative algorithm to converge on a set of hubs and authorities. Maintain for each page p  S: –Authority score: a p –Hub score: h p Initialize all a p = h p = 1

17 HITS Update Rules Authorities are pointed to by lots of good hubs: Hubs point to lots of good authorities:

18 Illustrated Update Rules 2 3 a 4 = h 1 + h 2 + h h 4 = a 5 + a 6 + a 7

19 HITS Iterative Algorithm Initialize for all p  S: a p = h p = 1 For i = 1 to k: For all p  S: (update auth. scores) For all p  S: (update hub scores) For all p  S: a p = a p /c c: For all p  S: h p = h p /c c: (normalize a) (normalize h)

20 Comments In practice, 20 iterations produces fairly stable results. In most cases, the final authorities were not in the initial root set generated using Altavista.

21 PageRank Alternative link-analysis method used by Google (Brin & Page, 1998). Does not attempt to capture the distinction between hubs and authorities. Ranks pages just by authority. Applied to the entire web rather than a local neighborhood of pages surrounding the results of a query.

22 Initial PageRank Idea Just measuring in-degree (citation count) doesn’t account for the authority of the source of a link. Initial page rank equation for page p: –N q is the total number of out-links from page q. –A page, q, “gives” an equal fraction of its authority to all the pages it points to (e.g. p). –c is a normalizing constant set so that the rank of all pages always sums to 1.

23 Initial PageRank Idea (cont.) Can view it as a process of PageRank “flowing” from pages to the pages they cite

24 Initial Algorithm Iterate rank-flowing process until convergence: Let S be the total set of pages. Initialize  p  S: R(p) = 1/|S| Until ranks do not change (much) (convergence) For each p  S: For each p  S: R(p) = cR´(p) (normalize)

25 Sample Stable Fixpoint

26 Problem with Initial Idea A group of pages that only point to themselves but are pointed to by other pages act as a “rank sink” and absorb all the rank in the system. Rank flows into cycle and can’t get out

27 Random Surfer Model After modification, PageRank can be seen as modeling a “random surfer” that starts on a random page and then at each point: –With probability E(p) randomly jumps to page p. – Otherwise, randomly follows a link on the current page. R(p) models the probability that this random surfer will be on page p at any given time. “E jumps” are needed to prevent the random surfer from getting “trapped” in web sinks with no outgoing links.

28 Speed of Convergence Early experiments on Google used 322 million links. PageRank algorithm converged (within small tolerance) in about 52 iterations. Number of iterations required for convergence is empirically O(log n) (where n is the number of links). Therefore calculation is quite efficient.

29 Simple Title Search with PageRank Use simple Boolean search to search web- page titles and rank the retrieved pages by their PageRank. Sample search for “university”: –Altavista returned a random set of pages with “university” in the title (seemed to prefer short URLs). –Primitive Google returned the home pages of top universities.

30 Google Ranking Complete Google ranking includes (based on university publications prior to commercialization). –Vector-space similarity component. –Keyword proximity component. –HTML-tag weight component (e.g. title preference). –PageRank component. Details of current commercial ranking functions are trade secrets.

31 Google PageRank-Biased Spidering Use PageRank to direct (focus) a spider on “important” pages. Compute page-rank using the current set of crawled pages. Order the spider’s search queue based on current estimated PageRank.