CS347 Lecture 12 May 21, 2001 ©Prabhakar Raghavan.

Slides:



Advertisements
Similar presentations
The Structure of the Web Mark Levene (Follow the links to learn more!)
Advertisements

Complex Networks Advanced Computer Networks: Part1.
Structure of The World Wide Web From “Networks, Crowds and Markets” Chapter 13 Eyal Feder Nov, 14.
Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
CS 599: Social Media Analysis University of Southern California1 The Basics of Network Analysis Kristina Lerman University of Southern California.
Mining and Searching Massive Graphs (Networks)
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
CS 345A Data Mining Lecture 1
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Peer-to-Peer and Grid Computing Exercise Session 3 (TUD Student Use Only) ‏
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Link Analysis, PageRank and Search Engines on the Web
Decoding the Structure of the WWW : A Comparative Analysis of Web Crawls AUTHORS: M.Angeles Serrano Ana Maguitman Marian Boguna Santo Fortunato Alessandro.
The Web as Network Networked Life CSE 112 Spring 2006 Prof. Michael Kearns.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
News and Notes, 2/24 Homework 2 due at the start of Thursday’s class New required readings: –“Micromotives and Macrobehavior”, chapters 1, 3 and 4 –Watts,
Overview of Web Data Mining and Applications Part I
Leveraging Big Data: Lecture 11 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo.
Social Media Mining Graph Essentials.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
(Social) Networks Analysis III Prof. Dr. Daning Hu Department of Informatics University of Zurich Oct 16th, 2012.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.
Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.
Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Lecture 5: Mathematics of Networks (Cont) CS 790g: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Mathematics of Networks (Cont)
Chapter 3. Community Detection and Evaluation May 2013 Youn-Hee Han
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Web Intelligence Complex Networks I This is a lecture for week 6 of `Web Intelligence Example networks in this lecture come from a fabulous site of Mark.
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Models of Web-Like Graphs: Integrated Approach
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Data mining in web applications
22C:145 Artificial Intelligence
Topics In Social Computing (67810)
Link-Based Ranking Seminar Social Media Mining University UC3M
Introduction to Web Mining
How Do “Real” Networks Look?
How Do “Real” Networks Look?
How Do “Real” Networks Look?
How Do “Real” Networks Look?
CS246 Web Characteristics.
CS246: Web Characteristics
CS276A Text Information Retrieval, Mining, and Exploitation
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1
Introduction to Web Mining
CS 345A Data Mining Lecture 1
Presentation transcript:

CS347 Lecture 12 May 21, 2001 ©Prabhakar Raghavan

Topics Web characterization Research Problems

The Web: A directed graph Nodes = static web pages (1+ billion) Edges = static hyperlinks (~10 billion) Web graph = Snapshot of web pages and hyperlinks Sparse graph: ~7 links/page on average Focus on graph structure, ignore content

Questions about the web graph How big is the graph? How many links on a page (outdegree)? How many links to a page (indegree)? Can one browse from any web page to any other? How many clicks? Can we pick a random page on the web? –Search engine measurement.

Questions about the web graph Can we exploit the structure of the web graph for searching and mining? What does the web graph reveal about social processes which result in its creation and dynamics? How different is browsing from a “random walk”?

Why? Exploit structure for Web algorithms –Crawl strategies –Search –Mining communities Classification/organization Web anthropology –Prediction, discovery of structures –Sociological understanding

Web snapshots Altavista crawls (May 99/Oct 99/Feb 00) 220/317/500M pages 1.5/2.1B/5B hyperlinks Compaq CS2 connectivity server –back-link information –10bytes/url, 3.4bytes/link, 0.15  s/access –given pages, return their in/out neighborhood

Algorithms Weakly connected components (WCC) Strongly connected components (SCC) Breadth-first search (BFS) Diameter

Challenges from scale Typical diameter algorithm: –number of steps ~ pages  links. –For 500 million pages, 5 billion links, even at a very optimistic 0.15  s/step, we need ~4 billion seconds. Hopeless. Will estimate diameter/distance metrics.

Scale On the other hand, can handle tasks linear in the links (5 billion) at a  s/step. –E.g., breadth-first search First eliminate duplicate pages/mirrors. Linear-time implementations for WCC and SCC.

May 1999 crawl 220 million pages after duplicate elimination. Giant WCC has ~186 million pages. Giant SCC has ~56 million pages. –Cannot browse your way from any page to any other –Next biggest SCC ~150K pages Fractions roughly the same in other crawls.

Tentative picture WCC 186M pages SCC 56M pages Disconnected debris 34M pages

Breadth-first search (BFS) Start at a page p –get its neighbors; –their neighbors, etc. Get profile of the number of pages reached by crawling out of p, as a function of distance d Can do this following links forwards as well as backwards

BFS experiment Start at random pages For each start page, build BFS (reachability vs. distance) profiles going forwards, and backwards

Reachability How many pages are reachable from a random page?

Net of BFS experiments BFS out of a page –either dies quickly (~100 pages reached) –“explodes” and reaches ~100 million pages somewhat over 50% of starting pages –SCC pages ~25% of total, reach >56M pages Qualitatively the same following in- or out- links

Interpreting BFS expts Need another = 44M pages reachable from SCC –gives us 100M pages reachable from SCC Likewise, need another ~44M pages reachable from SCC going backwards These together don’t account for all 186M pages in giant WCC.

Web anatomy

Distance measurements For random pages p1,p2: Pr[p1 reachable from p2] ~ 1/4 Maximum directed distance between 2 SCC nodes: >28 Maximum directed distance between 2 nodes, given there is a path: > 900 Average directed distance between 2 SCC nodes: ~16 Average undirected distance: ~7

Exercise Given the BFS and component size measurements, how can we infer all of the above measurements?

Power laws on the Web Inverse polynomial distributions: Pr[k] ~ c/k  for a constant c.  log Pr[k] ~ c -  log k Thus plotting log Pr[k] against log k should give a straight line (of negative slope).

In-degree distribution Slope = -2.1 Probability that a random page has k other pages pointing to it is ~k -2.1 (Power law)

Out-degree distribution Slope = -2.7 Probability that a random page points to k other pages is ~k -2.7

Connected components Largest WCC = 186M, SCC = 56M Connected component sizes:

Other Web/internet power laws Rates of visits to sites Degrees of nodes in physical network

Resources Broder et al. Graph structure in the Web. WWW9, / Albert, R., Jeong, H., & Barabasi, A.L. (1999). Diameter of the world wide web, Nature, 401, M. Faloutsos, P. Faloutsos, and C. Faloutsos, On Power- Law Relationships of the Internet Topology. SIGCOMM '99, pp , Aug

Open Problems P $ Papers/prizes Money Difficulty

Computational bottlenecks If computation were not a limit, could we get better ranking in search results? Better classification? Better clustering? What does “better” mean? P $ PPP

Set intersection in search For query w/AND of two terms, we retrieve and intersect their postings’ sets –Can do work disproportionately large compared to the size of the output. Is there a data structure that does better than this - without keeping a postings entry for each pair of terms? P $$

Text query optimization Recommended query processing order in early lectures - simple heuristics –infamous true/false question from midterm What can we do that’s more sophisticated but still fast in practice? P $$

Practical nearest-neighbor search In high-dimensional vector spaces –moderate preprocessing –fast query processing –nearly accurate nearest neighbors P $ P $$$ P

Classification Saw several schemes (Bayes, SVM) for classifying based on exemplary docs. Can also automatically classify based on persistent queries. How can we combine the two? Issues: –Combined representation of topic. –UI design vs.representation. $ P $$$

Benchmarks Web IR - search/classification benchmarks. Benchmarks for measuring recommendation systems. P $ P PP

Taxonomy construction Metrics of human effort –how much human effort vs. accuracy training by exemplary docs vs. persistent queries UI effects –what is the ideal user environment for building taxonomies What does it take to get to 98+ % accuracy? –Combination of UI, algorithms, best practices P $ PP $

Summarization How do you summarize a set of docs? –Results of clustering/trawling/ … –Visual vs. textual vs. combinations Measuring quality of summarization. P $$ PPP

Corpus analysis Given a corpus, extract its significant themes –organize into a navigation structure Visualization of themes in corpus Power set: all subsets of docs in a corpus –some subsets are interesting - which ones? –how do you organize them for human consumption? P $ P PPP $$

Intranets vs. internet How do intranet structures differ from internet structures? –Influence of policy on content creators. P $ P P

Recurring themes Not an exact science Focus on end-user –who? why? how? Bend rules for progress –ignore performance to start with –think huge - power sets