1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002.

Slides:



Advertisements
Similar presentations
Cpt S 223 – Advanced Data Structures Graph Algorithms: Introduction
Advertisements

Graph A graph, G = (V, E), is a data structure where: V is a set of vertices (aka nodes) E is a set of edges We use graphs to represent relationships among.
CS 206 Introduction to Computer Science II 11 / 11 / Veterans Day Instructor: Michael Eckmann.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Graph & BFS.
Chapter 9 Graph algorithms. Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
Introduction to Graphs
ECOMMERCE TECHNOLOGY SUMMER 2002 COPYRIGHT © 2002 MICHAEL I. SHAMOS Lecture 5: Search Engines.
Graphs. Graphs Many interesting situations can be modeled by a graph. Many interesting situations can be modeled by a graph. Ex. Mass transportation system,
Chapter 9 Graph algorithms Lec 21 Dec 1, Sample Graph Problems Path problems. Connectedness problems. Spanning tree problems.
Building Web Spiders Web-Based Information Architectures MSEC Mini II Jaime Carbonell.
ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS Lecture 5: Search Engines.
CS/ENGRD 2110 Object-Oriented Programming and Data Structures Fall 2014 Doug James Lecture 17: Graphs.
Review of Graphs A graph is composed of edges E and vertices V that link the nodes together. A graph G is often denoted G=(V,E) where V is the set of vertices.
Important Problem Types and Fundamental Data Structures
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
1 Graphs Algorithms Sections 9.1, 9.2, and Graphs v1v1 v2v2 v5v5 v7v7 v8v8 v3v3 v6v6 v4v4 A graph G = (V, E) –V: set of vertices (nodes) –E: set.
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
Lecture 13 Graphs. Introduction to Graphs Examples of Graphs – Airline Route Map What is the fastest way to get from Pittsburgh to St Louis? What is the.
Computer Science 112 Fundamentals of Programming II Introduction to Graphs.
Chapter 2 Graph Algorithms.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
GRAPHS CSE, POSTECH. Chapter 16 covers the following topics Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component,
Representing and Using Graphs
Introduction to Graphs. Introduction Graphs are a generalization of trees –Nodes or verticies –Edges or arcs Two kinds of graphs –Directed –Undirected.
Data Structures Week 9 Introduction to Graphs Consider the following problem. A river with an island and bridges. The problem is to see if there is a way.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.
Artificial Intelligence Web Spidering & HW1 Preparation Jaime Carbonell 22 January 2002 Today's Agenda Finish A*, B*, Macrooperators.
Graphs CSE 2011 Winter June Graphs A graph is a pair (V, E), where  V is a set of nodes, called vertices  E is a collection of pairs.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Introduction to Data Structures Vamshi Ambati
CSE373: Data Structures & Algorithms Lecture 15: Introduction to Graphs Nicki Dell Spring 2014.
Graphs A graphs is an abstract representation of a set of objects, called vertices or nodes, where some pairs of the objects are connected by links, called.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 13: Graphs Data Abstraction & Problem Solving with C++
1 Directed Graphs Chapter 8. 2 Objectives You will be able to: Say what a directed graph is. Describe two ways to represent a directed graph: Adjacency.
1 Introduction to Graphs Fundamental Data Structures and Algorithms Aleks Nanevski March 16, 2004.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
GRAPHS. Graph Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component, spanning tree Types of graphs: undirected,
© 2006 Pearson Addison-Wesley. All rights reserved 14 A-1 Chapter 14 Graphs.
DATA STRUCTURES AND ALGORITHMS Lecture Notes 10 Prepared by İnanç TAHRALI.
CSE 421 Algorithms Richard Anderson Winter 2009 Lecture 5.
Introduction to Graphs Fundamental Data Structures and Algorithms Klaus Sutner March 16, 2004.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Chapter 05 Introduction to Graph And Search Algorithms.
CSE 421 Algorithms Richard Anderson Autumn 2015 Lecture 5.
Graphs David Kauchak cs302 Spring Admin HW 12 and 13 (and likely 14) You can submit revised solutions to any problem you missed Also submit your.
Lecture 20. Graphs and network models 1. Recap Binary search tree is a special binary tree which is designed to make the search of elements or keys in.
1 CSE 332: Graphs Richard Anderson Spring Announcements This week and next week – Graph Algorithms Reading, Monday and Wednesday, Weiss
Graphs.
Graphs Lecture 19 CS2110 – Spring 2013.
Introduction to Graphs
CSE373: Data Structures & Algorithms Lecture 16: Introduction to Graphs Linda Shapiro Winter 2015.
CSE373: Data Structures & Algorithms Lecture 15: Introduction to Graphs Catie Baker Spring 2015.
CS120 Graphs.
CMSC 341 Lecture 21 Graphs (Introduction)
CSE373: Data Structures & Algorithms Lecture 16: Introduction to Graphs Linda Shapiro Spring 2016.
Graphs Lecture 18 CS2110 – Fall 2009.
Graphs Chapter 13.
Graphs CSE 2011 Winter November 2018.
Graphs All tree structures are hierarchical. This means that each node can only have one parent node. Trees can be used to store data which has a definite.
Lecture 5: Search Engines
What is a Graph? a b c d e V= {a,b,c,d,e} E= {(a,b),(a,c),(a,d),
Minimum Spanning Tree Section 7.3: Examples {1,2,3,4}
Graphs ORD SFO LAX DFW Graphs Graphs
Chapter 15 Graphs © 2006 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.
Graphs G = (V, E) V are the vertices; E are the edges.
GRAPHS Lecture 17 CS2110 Spring 2018.
Important Problem Types and Fundamental Data Structures
Chapter 14 Graphs © 2011 Pearson Addison-Wesley. All rights reserved.
INTRODUCTION A graph G=(V,E) consists of a finite non empty set of vertices V , and a finite set of edges E which connect pairs of vertices .
Presentation transcript:

1 Graphs & more on Web search Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

Announcements

3 Homework 5  Homework Assignment #5 will be out on Friday.  Must do some reading in order to complete it.  Must take a progress quiz.  Get started today and as usual, think b4 u hack!

4 Reading  About graphs:  Chapter 14  About Web search:  /srchad.htm /srchad.htm  A HTML tutorial: 

Introduction to Graphs

6 Graphs — an overview Vertices (aka nodes)

7 Graphs — an overview PIT BOS JFK DTW LAX SFO Vertices (aka nodes) Edges

8 Undirected Graphs — an overview PIT BOS JFK DTW LAX SFO Vertices (aka nodes) Edges

9 Undirected Graphs — an overview PIT BOS JFK DTW LAX SFO Vertices (aka nodes) Edges Weights

10 Terminology  Graph G = (V,E)  Set V of vertices (nodes)  Set E of edges  Elements of E are pairs (v,w) where v,w  V.  An edge (v,v) is a self-loop. (Usually assume no self- loops.)  Weighted graph  Elements of E are (v,w,x) where x is a weight.

11 Terminology, cont’d  Directed graph (digraph)  The edge pairs are ordered  Every edge has a specified direction  The Web is a directed graph  Undirected graph  The edge pairs are unordered  E is a symmetric relation  (v,w)  E implies (w,v)  E  In an undirected graph (v,w) and (w,v) are usually treated as though they are the same edge

12 Directed Graph (digraph) PIT BOS JFK DTW LAX SFO Vertices (aka nodes) Edges

13 Undirected Graph PIT BOS JFK DTW LAX SFO Vertices (aka nodes) Edges

14 Terminology, cont’d  v and w adjacent (neighbors) if (v,w)E or (w,v)E  d(v) (degree of v) = # neighbors of v (for undirected graphs)  d + (v) (out-degree of v)= # of edges (v,w)E  d - (v) (in-degree of v)= # of edges (w,v)E

15 Terminology, cont’d  Path a list of nodes (v[1], v[2],...,v[n]) s.t. (v[i],v[i+1])  E for all 0 < i < n  The length of the above path is n-1  Cycle a path that begins and ends with the same node  Cyclic graph – contains at least one cycle  Acyclic graph - no cycles

16 Elements of a Graph PIT BOS JFK DTW LAX SFO

17 Terminology, cont’d  Subgraph of a graph G a subset of V with the corresponding edges from E.  Connected graph a graph where for every pair of nodes there exists a sequence of edges starting at one node and ending at the other.  Connected component of a graph G a connected subgraph of G.

18 Elements of a Graph, cont’d PIT BOS JFK DTW LAX SFO

19 Terminology, cont’d  Unrooted (undirected) tree a acyclic connected undirected graph  Theorem: in any unrooted tree T=(V,E), |V|=|E|+1. Proof: by induction on |V|  Base case |V|=1 (|E|=0)  Show there exists a node of degree one  Remove that node and apply induction hypothesis

20 Example of a unrooted tree PIT BOS JFK DTW LAX SFO

21 Quiz Break

22 So, is this a connected graph? Cyclic or Acyclic? Directed or Undirected?

23 Directed graph (unconnected) Cyclic or Acyclic?

Representing Graphs

25 Representing graphs  Adjacency matrix  2-dimensional array  For each edge (u,v), set A[u][v] to true; otherwise false xx 2xx 3x 4xxx 5xx  Adjacency lists  For each vertex, keep a list of adjacent vertices

26 Choosing a representation  Size of V relative to size of E is a primary factor.  Dense: |E|/|V| is large  Sparse: |E|/|V| is small  Adjacency matrix is expensive in terms of space if the graph is sparse (O(|V| 2 > O(|E|+|V|)).  Adjacency list is expensive in terms of checking edges if the graph is dense.

27 Size of a Graph  How many undirected graphs for a set of n given vertices?  Answer:  How many edges in a undirected graph with n vertices?  Minimum: 0  Maximum:

Graphs are Everywhere

29 Graphs as models  The Internet  Communication pathways  DNS hierarchy  The WWW  The physical world  Road topology and maps  Airline routes and fares  Electrical circuits  Job and manufacturing scheduling

30 Graphs as models  Physical objects are often modeled by meshes, which are a particular kind of graph structure. By Jonathan Shewchuk

31 More graph models See also and NASA CFD labs By Paul Heckbert and David Garland

32 Structure of the Internet Europe Japan Backbone 1 Backbone 2 Backbone 3 Backbone 4, 5, N Australia Regional A Regional B NAP SOURCE: CISCO SYSTEMS MAPS UUNET MAP

33 Relationship graphs  Graphs are also used to model relationships among entities.  Scheduling and resource constraints.  Inheritance hierarchies

34 Where are we right now?

The Web Graph

36 Web Graph  Documents written in HTML  HTML (HyperText Markup Language)  TAGS: ,, ,  (anchor, link)

37 A simple HTML example A Simple HTML Example Carnegie Mellon University

38 Web Graph  A directed graph where :  V = (all web pages)  E = (all HTML-defined links from one web page to another)

39 Web Graph  Web Pages are nodes (vertices)  HTML references are links (edges)

40 Is the Web Graph connected?  Sparse, unconnected graph  AUTHORITIES web pages containing a “reasonable” amount of relevant information about a specific topic  HUBS web pages that point (link) to many pages containing relevant information about a given topic

41 Finding Hubs & Authorities  Nice iterative algorithm by Jon Kleinberg  HUB: Avrim’s Machine Learning page  AUTHORITY:  Extra credit opportunity for homework 5

Graphs : Application Search Engines

43 Search Engines

44 What are they?  Tools for finding information on the Web  Problem: “hidden” databases, e.g. New York Times (ie, databases of keywords hosted by the web site itself. These cannot be accessed by Yahoo, Google etc.)  Search engine  A machine-constructed index (usually by keyword)  So many search engines, we need search engines to find them. Searchenginecollosus.comSearchenginecollosus.com

45 Did you know?  Vivisimo was developed here at CMU  Developed by Prof. Raul Valdes-Perez  Developed in 2000

46 SE Architecture  Spider  Crawls the web to find pages. Follows hyperlinks. Never stops  Indexer  Produces data structures for fast searching of all words in the pages (ie, it updates the lexicon)  Retriever  Query interface  Database lookup to find hits  1 billion documents  1 TB RAM, many terabytes of disk  Ranking

47 A look at  10,000 servers (WOW!)  Web site traffic grows over 20% per month  Spiders over 2 Billion URLs  Supports 28 language searches  Over 100 million searches per day  “Even CMU uses it!”

48 Google’s server farm

49 Web Crawlers  Start with an initial page P 0. Find URLs on P 0 and add them to a queue  When done with P 0, pass it to an indexing program, get a page P 1 from the queue and repeat  Can be specialized (e.g. only look for addresses)  Issues  Which page to look at next? (Special subjects, recency)  How deep within a site do you go (depth search)?  How frequently to visit pages?

50 So, why Spider the Web?  Refresh Collection by deleting dead links  OK if index is slightly smaller  Done every 1-2 weeks in best engines  Finding new sites  Respider the entire web  Done every 2-4 weeks in best engines

51 Cost of Spidering  Spider can (and does) run in parallel on hundreds of severs  Very high network connectivity (e.g. T3 line)  Servers can migrate from spidering to query processing depending on time-of-day load  Running a full web spider takes days even with hundreds of dedicated servers

52 Indexing  Arrangement of data (data structure) to permit fast searching  Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak  Sorting helps. Why?  Permits binary search. About log 2 n probes into list  log 2 (1 billion) ~ 30  Permits interpolation search. About log 2 (log 2 n) probes  log 2 log 2 (1 billion) ~ 5

53 Inverted Files A file is a list of words by position - First entry is the word in position 1 (first word) - Entry 4562 is the word in position 4562 (4562 nd word) - Last entry is the last word An inverted file is a list of positions by word! POS FILE a (1, 4, 40) entry (11, 20, 31) file (2, 38) list (5, 41) position (9, 16, 26) positions (44) word (14, 19, 24, 29, 35, 45) words (7) 4562 (21, 27) INVERTED FILE

54 Inverted Files for Multiple Documents DOCID OCCUR POS 1 POS “jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document LEXICON WORD INDEX

55 Ranking (Scoring) Hits  Hits must be presented in some order  What order?  Relevance, recency, popularity, reliability, alphabetic?  Some ranking methods  Presence of keywords in title of document  Closeness of keywords to start of document  Frequency of keyword in document  Link popularity (how many pages point to this one)

56 Spamdexing & Link Popularity  Spamdexing means influencing retrieval ranking by altering a web page. (Puts “spam” in the index)   Link popularity is used for ranking  Many measures  Number of links in (In-links)  Weighted number of links in (by weight of referring page)

57 Search Engine Sizes AVAltavista EXExciteFAST GGGoogle INKInktomi NLNorthern Light SOURCE: SEARCHENGINEWATCH.COMSEARCHENGINEWATCH.COM

58 Historical Notes  WebCrawler: first documented spider  Lycos: first large-scale spider  Top-honors for most web pages spidered: First Lycos, then AltaVista, then Google...

59 Overview  Engines are a critical Web resource  Very sophisticated, high technology, but secret  Most spidering re-traverses stable web graph  They don’t spider the Web completely  Spamdexing is a problem  New paradigms needed as Web grows  What about images, music, video?  Google’s image search engine  Napster