Searching in Graphs
Google: life time of a query All web pages need to be in Google’s index Over 20 billion webpages New ones are constantly being added How can Google keep searching for new web pages?
Web Crawlers First crawler: Web Wanderer from MIT, 1993 Measure the growth of the web Well known crawlers GoogleBot MSNBot Slurp (from Yahoo!) Teoma (from AskJeeves)
Crawler Architecture PARSER HREFs extractor Citations and normalizer Load Monitor SCHEDULER Crawl Metadata Duplicate URL Eliminator Filter Hosts HREFs extractor and normalizer PARSER Internet seed URLs URL FRONTIER Citations RETRIEVERS DNS HTTP
Web Crawler Architechture High level structure Start with a set of URLs Repeatedly get web pages, scan for outlinks Issues Latency of several seconds per page DNS lookup delays Duplicate pages “Spider traps”: hyperlinks constructed to trap the crawler Crashing the server due to overload Delays in server response
Web Crawler Architechture www.vt.edu/robots.txt http://www.troutbums.com/Flyfactory/flyfactory/flyfactory/ hatchline/hatchline/flyfactory/hatchline/flyfactory/hatchline/ flyfactory/flyfactory/flyfactory/hatchline/flyfactory/hatchline/ Spider traps: dummy links Basic web crawl: searching a graph
Graph Theory: Basic Definitions and Applications Section 3.1 of [KT]
Connections between web links College of Engineering Academics VT home page Computer Science Sports
Road Map
Airline routes
Directed Graphs 1 2 3 4 Directed graph. G = (V, E) V = nodes. E = edges between pairs of nodes. Captures pairwise relationship between objects. Graph size parameters: n = |V|, m = |E|. Maximum number of distinct edges = O(n2) Edges are asymmetric: edge (1,4) but not (4,1) V = { 1, 2, 3, 4} E = { (1,2), (1,3), (1,4), (2,4), (4,2), (4,3)} n = 4 m = 6 1 2 3 4
Adjacencies 1 2 3 4 In(v) = { u : (u,v) is an edge} Indegree(v) = | In(v)| Out(v) = { w: (v,w) is an edge } Outdegree(v) = |Out(v)| Maximum Indegree, Outdegree = O(n) Outdegree(1) Indegree(2) 1 2 3 4
Undirected Graphs Undirected graph. G = (V, E) V = nodes. E = edges between pairs of nodes. Captures symmetric pairwise relationship between objects. Graph size parameters: n = |V|, m = |E|. V = { 1, 2, 3, 4, 5, 6, 7, 8 } E = { (1,2), (1,3), (2,3), (2,4), (2,5), (3,5), (3,7), (3,8), (4,5), (5,6) } n = 8 m = 11
Some Graph Applications Nodes Edges transportation street intersections highways communication computers fiber optic cables World Wide Web web pages hyperlinks social people relationships food web species predator-prey software systems functions function calls scheduling tasks precedence constraints circuits gates wires
World Wide Web Web graph. Directed graph Node: web page. Edge: hyperlink from one page to another. cnn.com netscape.com novell.com cnnsi.com timewarner.com hbo.com sorpranos.com
Ecological Food Web Food web graph. Directed graph Node = species. Edge = from prey to predator. Reference: http://www.twingroves.district96.k12.il.us/Wetlands/Salamander/SalGraphics/salfoodweb.giff
Road Map Nodes: intersections Edges: roads
Other graphs in the real world Airline routes Nodes: cities Edges: Flights Yeast protein network Nodes: proteins Edges: interacting pairs
Other graphs in the real world Sexual interaction network High school dating network
Phylogeny Trees Phylogeny trees. Describe evolutionary history of species. biologists draw their tree from left to right The phylogeny states that there was an ancestral species that gave rise to mammals and birds, but not to the other species shown in the tree (that is, mammals and birds share a common ancestor that they do not share with other species on the tree), that all animals are descended from an ancestor not shared with mushrooms, trees, and bacteria, and so on.
GUI Containment Hierarchy GUI containment hierarchy. Describe organization of GUI widgets. Reference: http://java.sun.com/docs/books/tutorial/uiswing/overview/anatomy.html
Paths and Connectivity Def. A path in an undirected graph G = (V, E) is a sequence P of nodes v1, v2, …, vk-1, vk with the property that each consecutive pair vi, vi+1 is joined by an edge in E. Def. A path is simple if all nodes are distinct. Def. An undirected graph is connected if for every pair of nodes u and v, there is a path between u and v.
Cycles Def. A cycle is a path v1, v2, …, vk-1, vk in which v1 = vk, k > 2, and the first k-1 nodes are all distinct. cycle C = 1-2-4-5-3-1
Trees Def. An undirected graph is a tree if it is connected and does not contain a cycle. Theorem. Let G be an undirected graph on n nodes. Any two of the following statements imply the third. G is connected. G does not contain a cycle. G has n-1 edges.
Rooted Trees Rooted tree. Given a tree T, choose a root node r and orient each edge away from r. Importance. Models hierarchical structure. root r by rooting a tree, it's easy to see that it has n-1 edges (exactly one edge leading upward from each non-root node.) parent of v v child of v a tree the same tree, rooted at 1
Phylogeny Trees Phylogeny trees. Describe evolutionary history of species. biologists draw their tree from left to right The phylogeny states that there was an ancestral species that gave rise to mammals and birds, but not to the other species shown in the tree (that is, mammals and birds share a common ancestor that they do not share with other species on the tree), that all animals are descended from an ancestor not shared with mushrooms, trees, and bacteria, and so on.
GUI Containment Hierarchy GUI containment hierarchy. Describe organization of GUI widgets. Reference: http://java.sun.com/docs/books/tutorial/uiswing/overview/anatomy.html
Binary Trees A rooted tree in which every node has either two 1 A rooted tree in which every node has either two or zero children 2 3 4 5 Complete binary tree: all leaf nodes are at the same level #nodes in a complete binary tree with k levels?