Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.

Slides:



Advertisements
Similar presentations
The Structure of the Web Mark Levene (Follow the links to learn more!)
Advertisements

Markov Models.
Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Mining and Searching Massive Graphs (Networks)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
CS 345A Data Mining Lecture 1
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.
The PageRank Citation Ranking “Bringing Order to the Web”
Network Science and the Web Networked Life CIS 112 Spring 2008 Prof. Michael Kearns.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Web as Network Networked Life CSE 112 Spring 2006 Prof. Michael Kearns.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
News and Notes, 2/24 Homework 2 due at the start of Thursday’s class New required readings: –“Micromotives and Macrobehavior”, chapters 1, 3 and 4 –Watts,
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
CS347 Lecture 12 May 21, 2001 ©Prabhakar Raghavan.
Google and the Page Rank Algorithm Székely Endre
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Lecture 5: Mathematics of Networks (Cont) CS 790g: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
Mathematics of Networks (Cont)
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
 SEO Terms A few additional terms Search site: This Web site lets you search through some kind of index or directory of Web sites, or perhaps both an.
“Important” Vertices and the PageRank Algorithm Networked Life NETS 112 Fall 2014 Prof. Michael Kearns.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
22C:145 Artificial Intelligence
The PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
Introduction to Web Mining
Uniform Sampling from the Web via Random Walks
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
CS246 Web Characteristics.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Graph and Link Mining.
CS246: Web Characteristics
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1
Introduction to Web Mining
CS 345A Data Mining Lecture 1
Presentation transcript:

Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns

The Web as Network Consider the web as a network –vertices: individual (html) pages –edges: hyperlinks between pages –will view as both a directed and undirected graph What is the structure of this network? –connected components –degree distributions –etc. What does it say about the people building and using it? –page and link generation –visitation statistics What are the algorithmic consequences? –web search –community identification

Graph Structure in the Web [Broder et al. paper] Report on the results of two massive “web crawls” Executed by AltaVista in May and October 1999 Details of the crawls: –automated script following hyperlinks (URLs) from pages found –large set of starting points collected over time –crawl implemented as breadth-first search –have to deal with webspam, infinite paths, timeouts, duplicates, etc. May ’99 crawl: –200 million pages, 1.5 billion links Oct ’99 crawl: –271 million pages, 2.1 billion links Unaudited, self-reported Sep ’03 stats:Sep ’03 stats: –3 major search engines claim > 3 billion pages indexed

Five Easy Pieces Authors did two kinds of breadth-first search: –ignoring link direction  weak connectivity –only following forward links  strong connectivity They then identify five different regions of the web: –strongly connected component (SCC): can reach any page in SCC from any other in directed fashion –component IN: can reach any page in SCC in directed fashion, but not reverse –component OUT: can be reached from any page in SCC, but not reverse –component TENDRILS: weakly connected to all of the above, but cannot reach SCC or be reached from SCC in directed fashion (e.g. pointed to by IN) –SCC+IN+OUT+TENDRILS form weakly connected component (WCC) –everything else is called DISC (disconnected from the above) –here is a visualization of this structurevisualization

Size of the Five SCC: ~56M pages, ~28% IN: ~43M pages, ~ 21% OUT: ~43M pages, ~21% TENDRILS: ~44M pages, ~22% DISC: ~17M pages, ~8% WCC > 91% of the web --- the giant component One interpretation of the pieces: –SCC: the heart of the web –IN: newer sites not yet discovered and linked to –OUT: “insular” pages like corporate web sites

Diameter Measurements Directed worst-case diameter of the SCC: –at least 28 Directed worst-case diameter of IN  SCC  OUT: –at least 503 Over 75% of the time, there is no directed path between a random start and finish page in the WCC –when there is a directed path, average length is 16 Average undirected distance in the WCC is 7 Moral: –web is a “small world” when we ignore direction –otherwise the picture is more complex

Degree Distributions They are, of course, heavy-tailedheavy-tailed Power law distribution of component size –consistent with the Erdos-Renyi model Undirected connectivity of web not reliant on “connectors” –what happens as we remove high-degree vertices?remove high-degree vertices?

Exploiting Web Structure: Google and PageRank

The PageRank Algorithm Let’s define a measure of page importance we will call the rank Notation: for any page p, let –N(p) be the number of forward links (pages p points to) –R(p) be the (to-be-defined) rank of p Idea: important pages distribute importance over their forward links So we might try defining –R(p) := sum of R(q)/N(q) over all pages q  p –can define iterative algorithm for computing the R(p) –(if it converges, solution has an eigenvector interpretation) –problem: cycles accumulate rank but never distribute it The fix: –R(p) := [sum of R(q)/N(q) over all pages q  p] + E(p) –E(p) is some external or exogenous measure of importance –some technical details omitted here (e.g. normalization) Let’s play with the PageRank calculatorPageRank calculator

The “Random Surfer” Model Let’s suppose that E(p) sums to 1 (normalized) Then the resulting PageRank solution R(p) will –also be normalized –can be interpreted as a probability distribution R(p) is the stationary distribution of the following process: –starting from some random page, just keep following random links –if stuck in a loop, jump to a random page drawn according to E(p) –so surfer periodically gets “bored” and jumps to a new page –E(p) can thus be personalized for each surfer PageRank one (important) component Google’s search tech –don’t forget the words! –information retrieval and statistical language processing

Looking Ahead: Left Side vs. Right Side So far we are discussing the “left hand” search results on GoogleGoogle –a.k.a “organic” search; “Right hand” or “sponsored” search: paid advertisements in a formal market –We will spend a lecture on these markets later in the term Same two types of search/results on Yahoo!, MSN,… Common perception: –organic results are “objective”, based on content, importance, etc. –sponsored results are subjective advertisements But both sides are subject to “gaming” (strategic behavior)… –organic: invisible terms in the html, link farms and web spam, reverse engineering –sponsored: bidding behavior, “jamming” –optimization of each side has its own industry: SEO and SEMSEOSEM … and perhaps to outright fraud –organic: typo squattingtypo squatting –sponsored: click fraud More later…