Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Analysis of Large Graphs: TrustRank and WebSpam
Link Analysis: PageRank and Similar Ideas. Recap: PageRank Rank nodes using link structure PageRank: – Link voting: P with importance x has n out-links,
Link Analysis: PageRank
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Social Networks 101 P ROF. J ASON H ARTLINE AND P ROF. N ICOLE I MMORLICA.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Information Retrieval
Link Analysis HITS Algorithm PageRank Algorithm.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Search Engine Optimization
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Presented By: - Chandrika B N
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
CPSC 534L Notes based on the Data Mining book by A. Rajaraman and J. Ullman: Ch. 5.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Lecture 2: Extensions of PageRank
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
The Business Model of Google MBAA 609 R. Nakatsu.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
1 CS 430: Information Discovery Lecture 5 Ranking.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 Ranking. 2 Boolean vs. Non-boolean Queries Until now, we assumed that satisfaction is a Boolean function of a query –it is easy to determine if a document.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.
WEB SPAM.
15-499:Algorithms and Applications
Search Engines and Link Analysis on the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
Information retrieval and PageRank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Graph and Link Mining.
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα Link Analysis and Web Search Chapter 14, from D. Easley and J. Kleinberg Jure Leskovez slides CS224W course

Topics Web Search HITS PageRank

How to Organize the Web First try: Human curated Web directories Yahoo, DMOZ, LookSmart

How to Organize the Web Second try: Web Search Information Retrieval investigates: Find relevant docs in a small and trusted set e.g., Newspaper articles, Patents, etc. (“needle-in-a-haystack”) Limitation of keywords (synonyms, polysemy, etc) But: Web is huge, full of untrusted documents, random things, web spam, etc. Everyone can create a web page of high production value Rich diversity of people issuing queries Dynamic and constantly-changing nature of web content

Size of the Search Index http://www.worldwidewebsize.com/

Link Analysis Not all web pages are equal

Link Analysis Using Hubs and Authorities Assume you search for “newspaper” First collect a large number of pages that are relevant to the topic (e.g., using text-only IR) Note that the pages may not be actual web sites of a “newspaper” but we can use them to access the “authority” of a page on the topic How? Relevant pages “vote” through their links Authority = #in-links from relevant pages

Link Analysis Using Hubs and Authorities Relevant pages “quality” of relevant pages? Best – pages that compile lists of resources relevant to the topic (hubs)

Link Analysis Using Hubs and Authorities Voted for many of the pages value of a hub = sum of the votes received by all pages it voted for

Link Analysis Using Hubs and Authorities: the principle of repeated improvement Vote (link) from a good “hub” should count more! Think about restaurant recommendations!!

Link Analysis Using Hubs and Authorities: the principle of repeated improvement Why stop here? Recompute the score of the hubs using the improved scores of the authorities! Repeat .. Recompute the score of the authories using the improved scores of the hubs! …

Link Analysis Using Hubs and Authorities Interesting pages fall into two classes: Authorities are pages containing useful information (the prominent, highly endorsed answers to the queries) Newspaper home pages Course home pages Home pages of auto manufacturers Hubs are pages that link to authorities (highly value lists) List of newspapers Course bulletin List of US auto manufacturers A good hub links to many good authorities A good authority is linked from many good hubs

Link Analysis Using Hubs and Authorities: HITS algorithm Each page p, has two scores A hub score (h) quality as an expert Total sum of votes of pages pointed to An authority score (a) quality as content Total sum of votes of experts Authority Update Rule: For each page i, update a(i) to be the sum of the hub scores of all pages that point to it. Hub Update Rule: For each page i, update h(i) to be the sum of the authority scores of all pages that it points to. A good hub links to many good authorities A good authority is linked from many good hubs

Link Analysis Using Hubs and Authorities: HITS algorithm Start with all hub scores and all authority scores equal to 1. Choose a number of steps k. Perform a sequence of k hub-authority updates. For each node: - First, apply the Authority Update Rule to the current set of scores. - Then, apply the Hub Update Rule to the resulting set of scores. At the end, hub and authority scores may be very large. Normalize: divide each authority score by the sum of all authority scores, and each hub score by the sum of all hub scores.

Link Analysis Using Hubs and Authorities: HITS algorithm

Link Analysis Using Hubs and Authorities: HITS algorithm Converges?

Hubs and Authorities: Spectral Analysis Vector representation

Hubs and Authorities: Spectral Analysis

Hubs and Authorities: Spectral Analysis

Hubs and Authorities: Spectral Analysis

PageRank

PageRank A page is important if it has many links (in-going? outgoing?) Are all links equal important? A page is important if it is cited by other important pages

PageRank: flow equations

PageRank: Algorithm In a network with n nodes, assign all nodes the same initial PageRank = 1/n. Choose a number of steps k. Perform a sequence of k updates to the PageRank values, using the following rule: Basic PageRank Update Rule: Each page divides its current PageRank equally across its out-going links, and passes these equal shares to the pages it points to. (If a page has no out-going links, it passes all its current PageRank to itself.) Each page updates its new PageRank to be the sum of the shares it receives.

PageRank As a kind of “fluid” that circulates through the network Initially, all nodes PageRank 1/8 As a kind of “fluid” that circulates through the network The total PageRank in the network remains constant (no need to normalize)

PageRank: equilibrium A simple way to check whether an assignment of numbers s forms an equilibrium set of PageRank values: check that they sum to 1, and check that when apply the Basic PageRank Update Rule, we get the same values back. If the network is strongly connected then there is a unique set of equilibrium values.

PageRank: Algorithm

Vector representation PageRank: analysis Vector representation

PageRank: flow equations

PageRank: flow equations

PageRank: Random walks Random walk interpretation Someone randomly browse a network of Web pages. Starts by choosing a page at random, picking each page with equal probability. Then follows links for a sequence of k steps: each step, picking a random out-going link from the current page, and follow it to where it leads. Claim: The probability of being at a page X after k steps of this random walk is precisely the PageRank of X after k applications of the Basic PageRank Update Rule.

PageRank:extensions

PageRank: Spider Trap

PageRank: Spider Trap Scaled PageRank Update Rule: First apply the Basic PageRank Update Rule. Then scale down all PageRank values by a factor of s. This means that the total PageRank in the network has shrunk from 1 to s. We divide the residual 1 − s units of PageRank equally over all nodes, giving (1 − s)/n to each. In practice, s between 0.8 and 0.9

PageRank: Spider Trap

PageRank: Random walks Random walk with jumps With probability s, the walker follows a random edge as before; and with probability 1 − s, the walker jumps to a random page anywhere in the network, choosing each node with equal probability

PageRank: Dead End

PageRank: Dead End

PageRank: Dead End

PageRank: Spectral Analysis

PageRank: Algorithm

PageRank: Example

PageRank: Personalized Page Rank

Other Issues Analysis of anchor text e.g., weight links by a factor indicating the quality of the associated anchor text User Feedback click or not on a result Attempts to score highly in search engine rankings a search engine optimization (SEO) industry (perfect ranking a moving target + secretive)

PageRank: Trust Rank

Size of the Search Index a method that combines word frequencies obtained in a large offline text collection (corpus), and search counts returned by the engines. Each day 50 words are sent to all four search engines. The number of webpages found for these words are recorded; with their relative frequencies in the background corpus, multiple extrapolated estimations are made of the size of the engine's index, which are subsequently averaged. The 50 words have been selected evenly across logarithmic frequency intervals (see Zipf's Law). The background corpus contains more than 1 million webpages from DMOZ, and can be considered a representative sample of the World Wide Web. When you know, for example, that the word 'the' is present in 67,61% of all documents within the corpus, you can extrapolate the total size of the engine's index by the document count it reports for 'the'. If Google says that it found 'the' in 14.100.000.000 webpages, an estimated size of the Google's total index would be 23.633.010.000. http://www.worldwidewebsize.com/

Link Analysis Using Hubs and Authorities: HITS algorithm (again)

PageRank: random web surfer

PageRank: random web surfer