1 Graphs & more on Web search Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002
Announcements
3 Homework 5 Homework Assignment #5 will be out on Friday. Must do some reading in order to complete it. Must take a progress quiz. Get started today and as usual, think b4 u hack!
4 Reading About graphs: Chapter 14 About Web search: /srchad.htm /srchad.htm A HTML tutorial:
Introduction to Graphs
6 Graphs — an overview Vertices (aka nodes)
7 Graphs — an overview PIT BOS JFK DTW LAX SFO Vertices (aka nodes) Edges
8 Undirected Graphs — an overview PIT BOS JFK DTW LAX SFO Vertices (aka nodes) Edges
9 Undirected Graphs — an overview PIT BOS JFK DTW LAX SFO Vertices (aka nodes) Edges Weights
10 Terminology Graph G = (V,E) Set V of vertices (nodes) Set E of edges Elements of E are pairs (v,w) where v,w V. An edge (v,v) is a self-loop. (Usually assume no self- loops.) Weighted graph Elements of E are (v,w,x) where x is a weight.
11 Terminology, cont’d Directed graph (digraph) The edge pairs are ordered Every edge has a specified direction The Web is a directed graph Undirected graph The edge pairs are unordered E is a symmetric relation (v,w) E implies (w,v) E In an undirected graph (v,w) and (w,v) are usually treated as though they are the same edge
12 Directed Graph (digraph) PIT BOS JFK DTW LAX SFO Vertices (aka nodes) Edges
13 Undirected Graph PIT BOS JFK DTW LAX SFO Vertices (aka nodes) Edges
14 Terminology, cont’d v and w adjacent (neighbors) if (v,w)E or (w,v)E d(v) (degree of v) = # neighbors of v (for undirected graphs) d + (v) (out-degree of v)= # of edges (v,w)E d - (v) (in-degree of v)= # of edges (w,v)E
15 Terminology, cont’d Path a list of nodes (v[1], v[2],...,v[n]) s.t. (v[i],v[i+1]) E for all 0 < i < n The length of the above path is n-1 Cycle a path that begins and ends with the same node Cyclic graph – contains at least one cycle Acyclic graph - no cycles
16 Elements of a Graph PIT BOS JFK DTW LAX SFO
17 Terminology, cont’d Subgraph of a graph G a subset of V with the corresponding edges from E. Connected graph a graph where for every pair of nodes there exists a sequence of edges starting at one node and ending at the other. Connected component of a graph G a connected subgraph of G.
18 Elements of a Graph, cont’d PIT BOS JFK DTW LAX SFO
19 Terminology, cont’d Unrooted (undirected) tree a acyclic connected undirected graph Theorem: in any unrooted tree T=(V,E), |V|=|E|+1. Proof: by induction on |V| Base case |V|=1 (|E|=0) Show there exists a node of degree one Remove that node and apply induction hypothesis
20 Example of a unrooted tree PIT BOS JFK DTW LAX SFO
21 Quiz Break
22 So, is this a connected graph? Cyclic or Acyclic? Directed or Undirected?
23 Directed graph (unconnected) Cyclic or Acyclic?
Representing Graphs
25 Representing graphs Adjacency matrix 2-dimensional array For each edge (u,v), set A[u][v] to true; otherwise false xx 2xx 3x 4xxx 5xx Adjacency lists For each vertex, keep a list of adjacent vertices
26 Choosing a representation Size of V relative to size of E is a primary factor. Dense: |E|/|V| is large Sparse: |E|/|V| is small Adjacency matrix is expensive in terms of space if the graph is sparse (O(|V| 2 > O(|E|+|V|)). Adjacency list is expensive in terms of checking edges if the graph is dense.
27 Size of a Graph How many undirected graphs for a set of n given vertices? Answer: How many edges in a undirected graph with n vertices? Minimum: 0 Maximum:
Graphs are Everywhere
29 Graphs as models The Internet Communication pathways DNS hierarchy The WWW The physical world Road topology and maps Airline routes and fares Electrical circuits Job and manufacturing scheduling
30 Graphs as models Physical objects are often modeled by meshes, which are a particular kind of graph structure. By Jonathan Shewchuk
31 More graph models See also and NASA CFD labs By Paul Heckbert and David Garland
32 Structure of the Internet Europe Japan Backbone 1 Backbone 2 Backbone 3 Backbone 4, 5, N Australia Regional A Regional B NAP SOURCE: CISCO SYSTEMS MAPS UUNET MAP
33 Relationship graphs Graphs are also used to model relationships among entities. Scheduling and resource constraints. Inheritance hierarchies
34 Where are we right now?
The Web Graph
36 Web Graph Documents written in HTML HTML (HyperText Markup Language) TAGS: ,, , (anchor, link)
37 A simple HTML example A Simple HTML Example Carnegie Mellon University
38 Web Graph A directed graph where : V = (all web pages) E = (all HTML-defined links from one web page to another)
39 Web Graph Web Pages are nodes (vertices) HTML references are links (edges)
40 Is the Web Graph connected? Sparse, unconnected graph AUTHORITIES web pages containing a “reasonable” amount of relevant information about a specific topic HUBS web pages that point (link) to many pages containing relevant information about a given topic
41 Finding Hubs & Authorities Nice iterative algorithm by Jon Kleinberg HUB: Avrim’s Machine Learning page AUTHORITY: Extra credit opportunity for homework 5
Graphs : Application Search Engines
43 Search Engines
44 What are they? Tools for finding information on the Web Problem: “hidden” databases, e.g. New York Times (ie, databases of keywords hosted by the web site itself. These cannot be accessed by Yahoo, Google etc.) Search engine A machine-constructed index (usually by keyword) So many search engines, we need search engines to find them. Searchenginecollosus.comSearchenginecollosus.com
45 Did you know? Vivisimo was developed here at CMU Developed by Prof. Raul Valdes-Perez Developed in 2000
46 SE Architecture Spider Crawls the web to find pages. Follows hyperlinks. Never stops Indexer Produces data structures for fast searching of all words in the pages (ie, it updates the lexicon) Retriever Query interface Database lookup to find hits 1 billion documents 1 TB RAM, many terabytes of disk Ranking
47 A look at 10,000 servers (WOW!) Web site traffic grows over 20% per month Spiders over 2 Billion URLs Supports 28 language searches Over 100 million searches per day “Even CMU uses it!”
48 Google’s server farm
49 Web Crawlers Start with an initial page P 0. Find URLs on P 0 and add them to a queue When done with P 0, pass it to an indexing program, get a page P 1 from the queue and repeat Can be specialized (e.g. only look for addresses) Issues Which page to look at next? (Special subjects, recency) How deep within a site do you go (depth search)? How frequently to visit pages?
50 So, why Spider the Web? Refresh Collection by deleting dead links OK if index is slightly smaller Done every 1-2 weeks in best engines Finding new sites Respider the entire web Done every 2-4 weeks in best engines
51 Cost of Spidering Spider can (and does) run in parallel on hundreds of severs Very high network connectivity (e.g. T3 line) Servers can migrate from spidering to query processing depending on time-of-day load Running a full web spider takes days even with hundreds of dedicated servers
52 Indexing Arrangement of data (data structure) to permit fast searching Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak Sorting helps. Why? Permits binary search. About log 2 n probes into list log 2 (1 billion) ~ 30 Permits interpolation search. About log 2 (log 2 n) probes log 2 log 2 (1 billion) ~ 5
53 Inverted Files A file is a list of words by position - First entry is the word in position 1 (first word) - Entry 4562 is the word in position 4562 (4562 nd word) - Last entry is the last word An inverted file is a list of positions by word! POS FILE a (1, 4, 40) entry (11, 20, 31) file (2, 38) list (5, 41) position (9, 16, 26) positions (44) word (14, 19, 24, 29, 35, 45) words (7) 4562 (21, 27) INVERTED FILE
54 Inverted Files for Multiple Documents DOCID OCCUR POS 1 POS “jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document LEXICON WORD INDEX
55 Ranking (Scoring) Hits Hits must be presented in some order What order? Relevance, recency, popularity, reliability, alphabetic? Some ranking methods Presence of keywords in title of document Closeness of keywords to start of document Frequency of keyword in document Link popularity (how many pages point to this one)
56 Spamdexing & Link Popularity Spamdexing means influencing retrieval ranking by altering a web page. (Puts “spam” in the index) Link popularity is used for ranking Many measures Number of links in (In-links) Weighted number of links in (by weight of referring page)
57 Search Engine Sizes AVAltavista EXExciteFAST GGGoogle INKInktomi NLNorthern Light SOURCE: SEARCHENGINEWATCH.COMSEARCHENGINEWATCH.COM
58 Historical Notes WebCrawler: first documented spider Lycos: first large-scale spider Top-honors for most web pages spidered: First Lycos, then AltaVista, then Google...
59 Overview Engines are a critical Web resource Very sophisticated, high technology, but secret Most spidering re-traverses stable web graph They don’t spider the Web completely Spamdexing is a problem New paradigms needed as Web grows What about images, music, video? Google’s image search engine Napster