The web as a graph: structure and interpretation. Sridhar Rajagopalan IBM Almaden Ravi Kumar, Prabhakar Raghavan, Andrew Tomkins (IBM, Almaden) Andrei.

Slides:



Advertisements
Similar presentations
Scale Free Networks.
Advertisements

CS347 Lecture 11 May 16, 2001 ©Prabhakar Raghavan.
Analysis and Modeling of Social Networks Foudalis Ilias.
Week 5 - Models of Complex Networks I Dr. Anthony Bonato Ryerson University AM8002 Fall 2014.
Lecture 21 Network evolution Slides are modified from Jurij Leskovec, Jon Kleinberg and Christos Faloutsos.
Information Networks Generative processes for Power Laws and Scale-Free networks Lecture 4.
SILVIO LATTANZI, D. SIVAKUMAR Affiliation Networks Presented By: Aditi Bhatnagar Under the guidance of: Augustin Chaintreau.
Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
1 Evolution of Networks Notes from Lectures of J.Mendes CNR, Pisa, Italy, December 2007 Eva Jaho Advanced Networking Research Group National and Kapodistrian.
Complex Networks Third Lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA TexPoint fonts used in EMF. Read the.
Mining and Searching Massive Graphs (Networks)
Information Retrieval IR10 Today’s lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
1 Complex systems Made of many non-identical elements connected by diverse interactions. NETWORK New York Times Slides: thanks to A-L Barabasi.
CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
CS Lecture 6 Generative Graph Models Part II.
CS276B Text Retrieval and Mining Winter 2005 Lecture 11.
Advanced Topics in Data Mining Special focus: Social Networks.
SDSC, skitter (July 1998) A random graph model for massive graphs William Aiello Fan Chung Graham Lincoln Lu.
INF 2914 Web Search Lecture 4: Link Analysis Today’s lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
CS347 Lecture 6 April 25, 2001 ©Prabhakar Raghavan.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
Mining social networks for knowledge management Prabhakar Raghavan.
Network analysis and applications Sushmita Roy BMI/CS 576 Dec 2 nd, 2014.
Computer Science 1 Web as a graph Anna Karpovsky.
CS347 Lecture 12 May 21, 2001 ©Prabhakar Raghavan.
Knowledge Compilation from the Web. Some Examples  Finding relationships  Discovering micro-communities  Creating concept hierarchies.
WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 13.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis.
Survey on Evolving Graphs Research Speaker: Chenghui Ren Supervisors: Prof. Ben Kao, Prof. David Cheung 1.
Information retrieval Lecture 9 Recap and today’s topics Last lecture web search overview pagerank Today more sophisticated link analysis using links.
PrasadL17LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)
ITCS 6265 Lecture 17 Link Analysis This lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
Lecture 14: Link Analysis
CS276 Lecture 18 Link Analysis Today’s lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 18: Link analysis.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
| May 2003 | Almaden Research Center, San Jose, CA © 2003 IBM Corporation IMA Tutorial (part II): Measurement and modeling of the web and related data.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 21: Link analysis.
Introduction to Information Retrieval LINK ANALYSIS 1.
Web Search and Tex Mining Lecture 9 Link Analysis.
“Adversarial Deletion in Scale Free Random Graph Process” by A.D. Flaxman et al. Hammad Iqbal CS April 2006.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Lecture 10: Network models CS 765: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
Clusters Recognition from Large Small World Graph Igor Kanovsky, Lilach Prego Emek Yezreel College, Israel University of Haifa, Israel.
How Do “Real” Networks Look?
CS276B Text Information Retrieval, Mining, and Exploitation Lecture 3.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Models of Web-Like Graphs: Integrated Approach
Sridhar Rajagopalan The Web as a Graph, Models and Algorithms Sridhar Rajagopalan IBM Almaden Research Center.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
Modified by Dongwon Lee from slides by
Topics In Social Computing (67810)
Information Retrieval Christopher Manning and Prabhakar Raghavan
How Do “Real” Networks Look?
Random Graph Models of large networks
How Do “Real” Networks Look?
How Do “Real” Networks Look?
Lecture 13 Network evolution
How Do “Real” Networks Look?
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CS246: Web Characteristics
Lecture 21 Network evolution
Network Models Michael Goodrich Some slides adapted from:
Advanced Topics in Data Mining Special focus: Social Networks
Presentation transcript:

The web as a graph: structure and interpretation. Sridhar Rajagopalan IBM Almaden Ravi Kumar, Prabhakar Raghavan, Andrew Tomkins (IBM, Almaden) Andrei Broder, Farzin Maghoul (AltaVista Corp.) Raymie Stata, Janet Wiener (Compaq SRC) Eli Upfal (Brown University)

A Picture of (~200M) pages.

Part A: Structure. The graph. The Questions …. … and some Answers. The picture.

The web graph Nodes (or vertices) = web pages. Edges = (non-nepotistic) links. The graph = all web pages and links. Many nodes, estimates range from 500M to over 1B. Is very sparse. Average links/page between Average (links/page | more than 6 links) > 30. Concentrate on graph structure, ignore content.

Questions about the web graph How big is the graph?How many links on a page (outdegree)? How many links to a page (indegree)? Can one browse from any web page to any other? How many clicks? Can we pick a random page on the web? How different is browsing from a “random walk”? Can we exploit the structure of the web graph for searching and mining? What does the web graph reveal about social processes which result in its creation and dynamics?

Power laws: How many pages point to a random page on the web? Indegrees. Slope = 2.1

How many links on a page? Slope = 2.7

Yule/Pareto/Zipf and power laws. Inverse polynomial tail. Word frequency in text. Yule (later Mandelbrot) Statistical study of the literary vocabulary. [Yule, 1944]. Citation analysis [Lotka, 1926]. Zipf Human behavior and the principle of least effort. [Zipf, 1947]. Pareto Cours d’economie politique. [Pareto,1897] Network graph. [Faloutsos-Faloutsos-Faloutsos, 1999] Oligonucleotide sequences [Martindale-Konopka, 1996] Many other instances.

More Germane Access statistics for web pages. (From server logs) [Glassman- 97] User behavior (by instrumenting browsers and proxies) [Lukose- Huberman-98, Crovella and others,97-99] Earliest analytical model, [Herb Simon, 1955].

Co-citation and Bibliographic coupling: Signature of a community. Bipartite cores: small “complete” bipartite subgraphs. Bibliographic coupling, Co-citation analysis. Hubs and Authorities. Uses: Web searching (HITS/Clever). Mining communities (Campfire project). Backlink browsing, “find similar.”

Small world. (Small World Prediction) [Barabasi and Albert 99, Albert-Jeong- Barabasi 99]. Based on a simple model, predict that most pages are within 19 links of each other. Justify the model by crawling nd.edu

Facts (about the crawl). Most of the time (75%) a random page u is not reachable from another random page v. Indegree and Outdegree distributions satisfy the power law. Consistent over time and scale.

Component sizes. Component sizes are distributed by the power law.

Reachability How many vertices are reachable from a random vertex?

A Picture of (~200M) pages.

Part B: Interpretation Random graph theory. Application 1: The Campfire Project. Application 2: Classical IR/Learning.

Random Graphs Erdos and Renyi’s model [Bollobas]. Graph with n vertices. Each of n(n-1) arcs appear with probability p. Graphical evolution [Palmer]: study properties of the resulting random graph as p is increased from 0 to 1. [Shelah and Spencer] 0-1 law: Most properties exhibit a threshold “phase change” like behavior. p

Facts about the Erdos-Renyi model A random graph with average degree 4 has a giant connected component containing almost all (90%) of the vertices. Indegrees and outdegrees are concentrated around the mean. And have exponentially declining tails. Most vertices in the graph are close to most others (small world).

A new random graph model.

Content creation hypothesis Some page creators create content without regard to what exists on the web. Many create pages which are inspired by pre-existing content. Effectively, some links are random, others are copied from pre- existing pages.

Probabilistic analysis: Evolving graphs. Creation and Deletion processes for nodes and edges. –e.g. at each time step, a new node is created with a fixed probability –at each time step, a new edge is created with probability links two random nodes with probability a node in proportion to its indegree with probability (copy a random link). –At each time step a node (resp. edge) is deleted with probability (resp ) Simple model: creation probabilities are 1 and deletion probabilities are 0.

Theory

Why study models? Good predictors of macroscopic behavior. –Degree distributions. Existence and number of cores. [WWW8] Algorithmic advantages (speed and accuracy). –Better and analyzable algorithmic methods. Inclusion-Exclusion pruning. [VLDB]. –Applications to Data Mining. Better understanding of the data/corpus. –What is “surprising” depends on what is typical. To find interesting stuff, you must know what is expected.

Be careful about... Predicting and analyzing microscopic properties. – Microscopic Properties which can be changed by the addition/deletion of a few nodes/edges/features. Examples: Diameter and girth, rare terms and features. Very susceptible to noise and systematic but small inconsistencies in the model. – Macroscopic Major dataset surgery required to significantly alter the property. Examples: Degree distributions. Connectivity. Law of large numbers or equivalent applies.

Application 1: The campfire project.

Co-citation: Signature of a community. Bipartite cores: small “complete” bipartite subgraphs.

Campfire project Automatically find and organize communities on the web. Approach: –Find all cores. –Grow cores into the full community. –Do IR/Categorization/Clustering etc. to organize the community space. [KRRT] WWW8, and [KRRT] VLDB’99.

The cores are interesting. hotels in costa rica clipart japanese elementary schools turkish student associations oil spills off the coast of japan australian fire brigades aviation/aircraft vendors guitar manufacturers Yahoo!, Excite, Infoseek webrings news groups mailing lists Explicit communities. Implicit communities (1) Implicit communities are defined by cores. (2) There are an order of magnitute more of these. There are efficient heuristics to compute all cores. (3) Can grow the core to the community using Clever.

Costa Rican Hotels. The Costa Rica Inte...ion on arts, busi... Informatica Interna...rvices in Costa Rica Cocos Island Research Center Aero Costa Rica Hotel Tilawa - Home Page COSTA RICA BY tamarindo.com Costa Rica New Page 5 The Costa Rica Internet Directory. Costa Rica, Zarpe Travel and Casa Maria Si Como No Resort Hotels & Villas Apartotel El Sesteo... de San José, Cos... Spanish Abroad, Inc. Home Page Costa Rica's Pura V...ry - Reservation... YELLOW\RESPALDO\HOTELES\Orquide1 Costa Rica - Summary Profile COST RICA, MANUEL A...EPOS: VILLA Hotels and Travel in Costa Rica Nosara Hotels & Res...els & Restaurants... Costa Rica Travel, Tourism & Resorts Association Civica de Nosara Costa Rica, Healthy...t Pura Vida Domestic & International Airline HOTELES / HOTELS - COSTA RICA tourgems Hotel Tilawa - Links Costa Rica Hotels T...On line Reservations Yellow pages Costa...Rica Export INFOHUB Costa Rica Travel Guide Hotel Parador, Manuel Antonio, Costa Rica Destinations

Elementary Schools in Japan The American School in Japan The Link Page ‰ªèŽs—§ˆä“c¬ŠwZƒz[ƒ€ƒy[ƒW Kids' Space ˆÀéŽs—§ˆÀé¼”¬ŠwZ ‹{é‹³ˆç‘åŠw‘®¬ŠwZ KEIMEI GAKUEN Home Page ( Japanese ) Shiranuma Home Page fuzoku-es.fukui-u.ac.jp welcome to Miasa E&J school _“ސ쌧E‰¡lŽs—§’†ì¼¬ŠwZ‚̃y fukui haruyama-es HomePage Torisu primary school goo Yakumo Elementary,Hokkaido,Japan FUZOKU Home Page Kamishibun Elementary School... schools LINK Page-13 “ú–{‚ÌŠwZ a‰„¬ŠwZƒz[ƒ€ƒy[ƒW 100 Schools Home Pages (English) K-12 from Japan 10/...rnet and Education ) ‚l‚f‚j¬ŠwZ‚U”N‚P‘g¨Œê ÒŠ—’¬—§ÒŠ—“Œ¬ŠwZ Koulutus ja oppilaitokset TOYODA HOMEPAGE Education Cay's Homepage(Japanese) –y“쏬ŠwZ‚̃z[ƒ€ƒy[ƒW UNIVERSITY ‰J—³¬ŠwZ DRAGON97-TOP ŽÂ‰ª¬ŠwZ‚T”N‚P‘gƒz[ƒ€ƒy[ƒW ¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼

Application 2: Classical Learning/IR.

Vector space and other classical models. Document is a vector in a real-valued space with dimensions identified with “features.” [Salton] Some notion of similarity, usually, cosine or dot-product. Built in assumption: Features are independent.

Uses of the Vector Space model. Search, Clustering, Classification. Term weighting. [Salton, Dumais, Sparck-Jones] SVD (for instance, LSI [Deerwester et.al.]). Gaussian assumption and classification. (for instance, [Koller-Sahami], [Chakrabarti et.al.]). Many ad-hoc methods and heuristics, some of which work remarkably well.[Modha et.al.] Clustering. [Drineas et.al.] Dimensionality reduction. Feature selection. [Johnson-Lindenstrauss, Koller- Sahami, Chakrabarti et.al. and others]

Two (new ?) ingredients. Hypertext -- the graph. Zipfian distributions on term occurances.

Hypertext Classification/Clustering. Class of a page is a function of text + class of neighbor set. –Classification problem -- Markov Random fields. [Chakrabarti-Dom-Indyk] –Clustering problem -- [Modha]

Research Issue Rework applications in these new (rather old) context. OR Explain why the standard algorithms continue to work despite the sometime questionable assumptions behind their derivation.