Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.

Similar presentations


Presentation on theme: "1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004."— Presentation transcript:

1 1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004

2 2 Announcements  Homework Assignment #6 out today  Building a Web Search Engine  Web Crawler (due March 25)  Web Reader (due March 25)  Web Search (due April 5)  Start early and ask questions

3 3 Graphs — an overview Vertices (aka nodes)

4 4 Graphs — an overview PIT BOS JFK DTW LAX SFO Vertices (aka nodes)

5 5 Graphs — an overview PIT BOS JFK DTW LAX SFO Vertices (aka nodes) Edges

6 6 Undirected Graphs — an overview PIT BOS JFK DTW LAX SFO Vertices (aka nodes) Edges

7 7 Undirected Graphs — an overview PIT BOS JFK DTW LAX SFO Vertices (aka nodes) Edges 1987 2273 344 2145 2462 618 211 318 190 Weights

8 8 Examples of Graphs  Questions  What is the fastest way to get from Pittsburgh to St Louis?  What is the cheapest way to get from Pittsburgh to St Louis? Vanguard Airlines Route Map

9 9 Web Graph  Web Pages are nodes (vertices)  HTML references are links (edges)

10 10 Graphs as models  Physical objects are often modeled by meshes, which are a particular kind of graph structure. By Jonathan Shewchuk

11 11 Structure of the Internet Europe Japan Backbone 1 Backbone 2 Backbone 3 Backbone 4, 5, N Australia Regional A Regional B NAP SOURCE: CISCO SYSTEMS

12 12 Relationship graphs  Graphs are also used to model relationships among entities.  Scheduling and resource constraints.  Inheritance hierarchies. 15-11315-151 15-211 15-21215-213 15-312 15-41115-412 15-251 15-45115-462

13 13 Terminology  Graph G = (V,E)  Set V of vertices (nodes)  Set E of edges  Elements of E are pair (v,w) where v,w  V.  An edge (v,v) is a self-loop. (Usually assume no self- loops.)  Weighted graph  Elements of E are (v,w,x) where x is a weight, i.e., a cost associated with the edge.

14 14 Terminology, cont’d  Directed graph (digraph)  The edge pairs are ordered  Every edge has a specified direction  Undirected graph  The edge pairs are unordered  E is a symmetric relation  (v,w)  E implies (w,v)  E  In an undirected graph (v,w) and (w,v) are usually treated as though they are the same edge

15 15 Directed Graph (digraph) PIT BOS JFK DTW LAX SFO Vertices (aka nodes) Edges

16 16 Undirected Graph PIT BOS JFK DTW LAX SFO Vertices (aka nodes) Edges

17 17 Paths  A path is a sequence of edges from one node to another  A length of a path is the number of edges in it. PIT BOS JFK DTW LAX SFO Question: What are paths of length 2 in the above graph?

18 18 Paths  A simple path is a path where no vertex is repeated  first and last vertices can be the same Question: an example of a simple path? A non-simple path? PIT BOS JFK DTW LAX SFO

19 19 Connected graphs  Connected graph  A graph is connected if for all (u,v) in V, there exists a path from u to v.  A directed graph is strongly connected if for all (u,v) in V, there exists a path from u to v.  A directed graph is weakly connected if for all (u,v) in V, either (u,v) is in E or (v,u) is in E.  Complete Graph  A graph with all nodes connected to each other directly

20 20 Cycles  A cycle (in a directed graph) is a path that begins and ends in the same vertex.  A directed acyclic graph (DAG) is a directed graph having no cycles.

21 21 So, is this a connected graph? Cyclic or Acyclic? Directed or Undirected?

22 22 Directed graph (unconnected) Cyclic or Acyclic?

23 23 Tree is a Graph  A graph with no cycles is called a tree.  This is a general definition of a tree A C B F D E G A C B F D E G

24 24 Edges and Degree  Degree  number of edges incident to a vertex  in a directed graph  in-degree(v) - number of edges into vertex v  out-degree(v) - number of edges from vertex v

25 25 Dense Graphs  For most graphs |E|  |V| 2  A dense graph when most edges are present  E = (|V| 2 )  large number of edges (quadratic)  Also |E| > |V| log |V| can be considered dense

26 26 Sparse Graphs  A sparse graph is a graph with relatively few edges  no clear definition  metric for sparsity |E| < |V| log |V|  Examples of sparse graphs  computer network

27 Representing Graphs

28 28 Graph operations  Navigation in a graph.  (v,w)E  R v = {w | (v,w)E}  D w = {v | (v,w)E}  Enumeration of the elements of a graph.  E  V  Size of a graph  |V|  |E|

29 29 Representing graphs  Adjacency matrix  2-dimensional array  For each edge (u,v), set A[u][v] to true; otherwise false 1 4 7 6 3 5 2 1234567 1xx 2xx 3x 4xxx 5xx 6 7 34 45 6 36 47 7  Adjacency lists  For each vertex, keep a list of adjacent vertices  Q: How to represent weights? 1 2 3 4 5 6 7

30 30 Choosing a representation  Size of V relative to size of E is a primary factor.  Dense: E/V is large  Sparse: E/V is small  Adjacency matrix is expensive if the graph is sparse. WHY?  Adjacency list is expensive if the graph is dense. WHY?  Dynamic changes to V.  Adjacency matrix is expensive to copy/extend if V is extended. WHY?

31 Graphs : Application Search Engines

32 32 Search Engines

33 33 What are they?  Tools for finding information on the Web  Problem: “hidden” databases, e.g. New York Times (ie, databases of keywords hosted by the web site itself. These cannot be accessed by Yahoo, Google etc.)  Search engine  A machine-constructed index (usually by keyword)  So many search engines, we need search engines to find them. Searchenginecollosus.comSearchenginecollosus.com

34 34 What are they?  Search engines: key tools for ecommerce  Buyers and sellers must find each other  How do they work?  How much do they index?  Are they reliable?  How are hits ordered?  Can the order be changed?

35 35 The Process 1. Acquire the collection, i.e. all the documents [Off-line process] 2. Create an inverted index [Off-line process] 3. Match queries to documents [On-line process, the actual retrieval] 4. Present the results to user [On-line process: display, summarize,...]

36 36 SE Architecture  Spider  Crawls the web to find pages. Follows hyperlinks. Never stops  Indexer  Produces data structures for fast searching of all words in the pages (i.e., it updates the lexicon)  Retriever  Query interface  Database lookup to find hits  Billions of documents  1 TB RAM, many terabytes of disk  Ranking

37 37 Did you know?  The concept of a Web spider was developed by Dr.Fuzzy Mouldin  Implemented in 1994 on the Web  Went into the creation of Lycos  Lycos propelled CMU into the top 5 most successful schools  Commercialization proceeds  Tangible evidence  Newel-Simon Hall Dr. Michael L. (Fuzzy) Mauldin

38 38 Did you know?  A meta-search engine: send a query to many search engines, than rank and cluster the results. (especially useful if you’re not sure which keywords produce optimal results)  Vivisimo was developed here at CMU  Developed by Prof. Raul Valdes-Perez  Developed in 2000

39 39 A look at  > 20,000 servers  Spiders over:  4.28 billion pages  880 million images  Supports 35 non-English languages  Over 200 million queries per day

40 40 Google’s server farm

41 41 Web Spiders  Start with an initial page P 0. Find URLs on P 0 and add them to a queue  When done with P 0, pass it to an indexing program, get a page P 1 from the queue and repeat  Can be specialized (e.g. only look for email addresses)  Issues  Which page to look at next? (Special subjects, recency)  Avoid overloading a site  How deep within a site do you go (depth search)?  How frequently to visit pages?

42 42 So, why Spider the Web? User Perceptions  Most annoying: Engine finds nothing (too small an index, but not an issue since 1997 or so).  Somewhat annoying: Obsolete links  Refresh Collection by deleting dead link  OK if index is slightly smaller  Done every 1-2 weeks in best engines  Mildly annoying: Failure to find new site => Re-spider entire web => Done every 2-4 weeks in best engines

43 43 So, why Spider the Web?, cont’d Cost of Spidering  Semi-parallel algorithmic decomposition  Spider can (and does) run in hundreds of severs simultaneously  Very high network connectivity  Servers can migrate from spidering to query processing depending on time-of-day load  Running a full web spider takes days even with hundreds of dedicated servers

44 44 Indexing  Arrangement of data (data structure) to permit fast searching  Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak  Sorting helps. Why?  Permits binary search. About log 2 n probes into list  log 2 (1 billion) ~ 30  Permits interpolation search. About log 2 (log 2 n) probes  log 2 log 2 (1 billion) ~ 5

45 45 Inverted Files A file is a list of words by position - First entry is the word in position 1 (first word) - Entry 4562 is the word in position 4562 (4562 nd word) - Last entry is the last word An inverted file is a list of positions by word! POS 1 10 20 30 36 FILE a (1, 4, 40) entry (11, 20, 31) file (2, 38) list (5, 41) position (9, 16, 26) positions (44) word (14, 19, 24, 29, 35, 45) words (7) 4562 (21, 27) INVERTED FILE

46 46 Inverted Files for Multiple Documents DOCID OCCUR POS 1 POS 2...... “jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document 56... LEXICON WORD INDEX

47 47 Ranking (Scoring) Hits  Hits must be presented in some order  What order?  Relevance, recency, popularity, reliability?  Some ranking methods  Presence of keywords in title of document  Closeness of keywords to start of document  Frequency of keyword in document  Link popularity (how many pages point to this one)

48 48 Ranking (Scoring) Hits, cont’d  Can the user control? Can the page owner control?  Can you find out what order is used?  Spamdexing: influencing retrieval ranking by altering a web page. (Puts “spam” in the index)

49 49 Link Popularity  How many pages link to this page?  on the whole Web  in our database?  http://www.linkpopularity.com http://www.linkpopularity.com  Link popularity is used for ranking  Many measures  Number of links in (In-links)  Weighted number of links in (by weight of referring page)

50 50 Search Engine Sizes ATW AllTheWeb AVAltavista GGGoogle INKInktomi TMATeoma SOURCE: SEARCHENGINEWATCH.COMSEARCHENGINEWATCH.COM Billions of textual documents indexed.

51 51 Search Engine Sizes, cont’d SOURCE: SEARCHENGINEWATCH.COMSEARCHENGINEWATCH.COM Billions of textual documents indexed (as of Sept 2, 2003) ATW AllTheWeb AVAltavista GGGoogle INKInktomi TMATeoma

52 52 Current Status of Web Spiders Historical Notes  WebCrawler: first documented spider  Lycos: first large-scale spider

53 53 Current Status of Web Spiders Enhanced Spidering  In-link counts to pages can be established during spidering.  Hint: Hmmm… hw6?  In-link counts are the basis for Google’s page-rank method

54 54 Current Status of Web Spiders Unsolved Problems  Most spidering re-traverses stable web graph => on-demand re-spidering when changes occur  Completeness or near-completeness is still a major issue  Cannot Spider information stored in local-DB  Spidering non-textual data

55 55 Reading  About graphs:  Chapter 14.  About Web search:  http://www.cadenza.org/search_engine _terms/srchad.htm http://www.cadenza.org/search_engine _terms/srchad.htm


Download ppt "1 Introduction to Graphs 15-211 Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004."

Similar presentations


Ads by Google