Presentation is loading. Please wait.

Presentation is loading. Please wait.

COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.

Similar presentations


Presentation on theme: "COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002."— Presentation transcript:

1 COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002

2 Structure of the Web Courtesy: infotoday.com

3 Why is Web Structure Interesting? Design of search engines: Improved crawl strategies Make use of link information to give better ranking, e.g., Google Generate good representative structures for simulations Relationship to other Internet structures Traffic patterns User access patterns

4 Why is Web Structure Interesting? Understanding the sociology of content creation on the web: Six degrees of separation and the small- world phenomenon [Milgram 67] Is every web page just six clicks away from every other web page? Simply because it is out there!

5 Background for the Study Conducted by researchers at AltaVista, Compaq, and IBM Analyzed the connectivity of more than 200M web pages and 1.5B links AltaVista web crawl, May 1999 Start from a large number of sources Follow links in a breadth-first search manner and add pages to the database Structure determined by set of all web pages crawled together with their in-links and out-links

6 A More Detailed Look Broder et al, WWW Conference, 1999

7 Bowtie Components SCC (Core) Largest strongly connected component Every page in core can reach every other page in core 56 million IN (Origination) All pages outside the core that can reach the core 44 million

8 Bowtie Components OUT (Termination) All pages that are reachable from SCC 44 million Other pages: Neither reachable from SCC nor can reach the SCC Reachable from IN or can reach OUT (Tendrils) Completely disconnected from the rest (Disconnected) Total of 60 million

9 Example Pages: SCC CCS! http://www.ccs.neu.edu Links to many communities and other authoritative sites outside CCS Authoritative sites such as http://www.ccs.neu.edu/home/rraj/Courses/172x/F02/ http://www.northeastern.edu http://www.boston.com http://www.yahoo.com

10 Example Pages: IN Individual home pages on web hosting services: Do not have links from authoritative sources and core pages Have connections to core pages through series of links New or obscure web pages that have not attracted attention

11 Example Pages: OUT Commercial sites Pages point to pages within the site Rarely point to pages outside the site http://www.ibm.com Can be reached from a core site, but does not have links back to core http://www.ccs.neu.edu/home/rraj/papers.html

12 Example Pages: Tendrils Pages not in OUT or CORE with paths to OUT Pages not in IN or CORE with paths from IN A private web page in IN points to a page with links to corporate sites

13 Example Pages: Disconnected Pages Temporary set of pages for working on a project http://www.ccs.neu.edu/home/chenj/rsch/discussions.htm Pages that were linked to the core, OUT, or IN earlier, with the links now removed

14 How was the Study Done? Crawlers searched from over many initial locations: Covered over 200 M webpages With 1.5 billion links among these pages 9.6 GB storage after compression Webpage characterized by URL and links to other URLs only Page content not relevant to study A view that extracts essential information relevant to the purpose and ignores inessential details Abstraction!

15 Finding the Structure Got a list of 200 M web pages and 1.5 billion links How do we find out: The distance between two pages? Which pages can be reached from a given page? Which is the most popular webpage? Represent the web as a graph!

16 CCS Web as a Graph http://www.ccs.neu.edu Chapters Directory US CCS Contact Us IS People Help Research NU Orgns. Alumni NU ACM

17 Directed Graphs A directed graph is a pair G = (V,E) V: Set of vertices (nodes) E: Set of directed edges (links), each going from one vertex to another NU ACM Directory US Chapters V = {NUACM, Chapters, Directory, US} E = {(NUACM,Chapters), (Chapters, Directory), (Directory, US), (US, NUACM)}

18 Graph Terms In-degree: Number of edges into a node Out-degree: Number of edges out of a node Suppose a directed graph has n nodes and m edges: Average in-degree? Average out-degree?

19 More Graph Terms Strongly connected graph: There is a path between every two nodes Distance from node u to v: Number of links on the shortest path from u to v Diameter: Maximum distance between any two nodes Finite for strongly connected graphs only

20 Undirected Graphs Edges are undirected (u,v) equivalent to (v,u) Degree of a node: Number of edges adjacent to it Connected: If there is a path between any two nodes 4 1 2 3

21 Graphs: Useful Representation Tools Social networks Transportation networks Control flow of a program Flowchart of a manufacturing process Computer networks Bibliography citations …

22 Structure of the Web Broder et al, WWW Conference, 1999

23 Structural Properties of the Web Diameter of the SCC is at least 28 Pick a random source page u and a random destination page v: How many links is v away from u? 75% of the time, there is no path! The other 25% of the time, average distance is 16 Interesting distribution of degrees and sizes of connected components: power laws

24 Representations of a Graph Adjacency matrix 1101 0101 1010 0001 1 4 2 3 1 2 34 1 4 3 2

25 Representations of a Graph Adjacency list 1 2 3 4 4 1 4 2 3 2 1 4

26 References Structure of the Web: Broder et al, WWW Conference 1999 Graphs: Books on elementary discrete math Graph Theory, by F. Harary Graph algorithms: Algorithms and data structures books and courses


Download ppt "COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002."

Similar presentations


Ads by Google