Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Web Intelligence Text Mining, and web-related Applications
Markov Models.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Network Science and the Web Networked Life CIS 112 Spring 2008 Prof. Michael Kearns.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Link Structure and Web Mining Shuying Wang
The Web as Network Networked Life CSE 112 Spring 2006 Prof. Michael Kearns.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
News and Notes, 2/24 Homework 2 due at the start of Thursday’s class New required readings: –“Micromotives and Macrobehavior”, chapters 1, 3 and 4 –Watts,
Link Analysis HITS Algorithm PageRank Algorithm.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Overview of Web Ranking Algorithms: HITS and PageRank
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
BING!-Microsoft's new search engine Launched May 28, 2009 Appealing interface A “decision engine” not just a search engine *Shopping, health, travel, local.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
“Important” Vertices and the PageRank Algorithm Networked Life NETS 112 Fall 2014 Prof. Michael Kearns.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Automated Information Retrieval
The PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web
HITS Hypertext-Induced Topic Selection
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Link-Based Ranking Seminar Social Media Mining University UC3M
CSE 454 Advanced Internet Systems University of Washington
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Information Retrieval and Web Design
Presentation transcript:

Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns

Beyond Macroscopic Structure Broder et al. report on coarse overall structure of the web Use and construction of the web are more fine-grained –people browse the web for certain information or topics –people build pages that link to related or “similar” pages How do we quantify & analyze this more detailed structure? We’ll examine two related examples: –Kleinberg’s hubs and authorities automatic identification of “web communities” –PageRank automatic identification of “important” pages one of the main criteria used by Google –both rely mainly on the link structure of the web –both have an algorithm and a theory supporting them

Hubs and Authorities Suppose we have a large collection of pages on some topic –possibly the results of a standard web search Some of these pages are highly relevant, others not at all How could we automatically identify the important ones? What’s a good definition of importance? Kleinberg’s idea: there are two kinds of important pages: –authorities: highly relevant pages –hubs: pages that point to lots of relevant pages If you buy this definition, it further stands to reason that: –a good hub should point to lots of good authorities –a good authority should be pointed to by many good hubs –this logic is, of course, circular We need some math and an algorithm to sort it out

The HITS System (Hyperlink-Induced Topic Search) Given a user-supplied query Q: –assemble root set S of pages (e.g. first 200 pages by AltaVista) –grow S to base set T by adding all pages linked (undirected) to S –might bound number of links considered from each page in S Now consider directed subgraph induced on just pages in T For each page p in T, define its –hub weight h(p); initialize all to be 1 –authority weight a(p); initialize all to be 1 Repeat “forever”: –a(p) := sum of h(q) over all pages q  p –h(p) := sum of a(q) over all pages p  q –renormalize all the weights This algorithm will always converge! –weights computed related to eigenvectors of connectivity matrix –further substructure revealed by different eigenvectors Here are some examplesexamples

The PageRank Algorithm Let’s define a measure of page importance we will call the rank Notation: for any page p, let –N(p) be the number of forward links (pages p points to) –R(p) be the (to-be-defined) rank of p Idea: important pages distribute importance over their forward links So we might try defining –R(p) := sum of R(q)/N(q) over all pages q  p –can again define iterative algorithm for computing the R(p) –if it converges, solution again has an eigenvector interpretation –problem: cycles accumulate rank but never distribute it The fix: –R(p) := [sum of R(q)/N(q) over all pages q  p] + E(p) –E(p) is some external or exogenous measure of importance –some technical details omitted here (e.g. normalization) Let’s play with the PageRank calculatorPageRank calculator

A F D B C E G

The “Random Surfer” Model Let’s suppose that E(p) sums to 1 (normalized) Then the resulting PageRank solution R(p) will –also be normalized –can be interpreted as a probability distribution R(p) is the stationary distribution of the following process: –starting from some random page, just keep following random links –if stuck in a loop, jump to a random page drawn according to E(p) –so surfer periodically gets “bored” and jumps to a new page –E(p) can thus be personalized for each surfer An important component of Google’s search criteria

But What About Content? PageRank and Hubs & Authorities –both based purely on link structure –often applied to an pre-computed set of pages filtered for content So how do (say) search engines do this filtering? This is the domain of information retrieval

Basics of Information Retrieval Represent a document as a “bag of words”: –for each word in the English language, count number of occurences –so d[i] is the number of times the i-th word appears in the document –usually ignore common words (the, and, of, etc.) –usually do some stemming (e.g. “washed”  “wash”) –vectors are very long (~100Ks) but very sparse –need some special representation exploiting sparseness Note all that we ignore or throw away: –the order in which the words appear –the grammatical structure of sentences (parsing) –the sense in which a word is used firing a gun or firing an employee –and much, much more…

Bag of Words Document Comparison View documents as vectors in a very high-dimensional space Can now import geometry and linear algebra concepts Similarity between documents d and e: –  d[i]*e[i] over all words i –may normalize d and e first –this is their projection onto each other Improve by using TF/IDF weighting of words: –term frequency --- how frequent is the word in this document? –inverse document frequency --- how frequent in all documents? –give high weight to words with high TF and low IDF Search engines: –view the query as just another “document” –look for similar documents via above

Looking Ahead: Left Side vs. Right Side So far we are discussing the “left hand” search results on GoogleGoogle –a.k.a “organic” search; “Right hand” or “sponsored” search: paid advertisements in a formal market –We will spend a lecture on these markets later in the term Same two types of search/results on Yahoo!, MSN,… Common perception: –organic results are “objective”, based on content, importance, etc. –sponsored results are subjective advertisements But both sides are subject to “gaming” (strategic behavior)… –organic: invisible terms in the html, link farms and web spam, reverse engineering –sponsored: bidding behavior, “jamming” –optimization of each side has its own industry: SEO and SEMSEOSEM … and perhaps to outright fraud –organic: typo squattingtypo squatting –sponsored: click fraud More later…