Structural Web Search Using a Graph-Based Discovery System Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington
Structured Web Search n Existing search engines use linear feature match n Web contains structural information as well n Hyperlink information n Web viewed as a graph [Kleinberg] n Subdue searches based on structure n Use as foundation of a structural search engine n Incorporation of WordNet allows for synonym match
object triangle n Discovers structural patterns in input graphs n A substructure is connected subgraph n An instance of a substructure is a subgraph that is isomorphic to substructure definition n Pattern discovery, classification, clustering R1 C1 T1 S1 T2 S2 T3 S3 T4 S4 Input DatabaseSubstructure S1 (graph form) Compressed Database R1 C1 object square on shape
Subdue Algorithm Start with individual vertices Start with individual vertices Keep only best substructures on queue Keep only best substructures on queue Expand substructure by adding edge/vertex Expand substructure by adding edge/vertex Compress graph and repeat to generate hierarchical description Compress graph and repeat to generate hierarchical description Optional use of background knowledge Optional use of background knowledge
Inexact Graph Match n Some variations may occur between instances n Want to abstract over minor differences n Difference = cost of transforming one graph to make it isomorphic to another n Match if cost/size < threshold
Application Domains n Protein data n Human Genome DNA data n Spatial-temporal domains n Earthquake data n Aircraft Safety and Reporting System n Telecommunications data n Program source code n Web data
page Represent Web as Graph n Breadth-first search of domain to generate graph n Nodes represent pages / documents n Edges represent hyperlinks n Additional nodes represent document keywords page university texas learning group projects subdu e robotics parallel hyperlink work word planning
WebSubdue’s Structural Search n Formulate query as graph n Use Subdue’s predefined substructure option to search for instances of query Instructor Teaching Robotics Research Robotics Publicatio n Robotics http Postscript | PDF
Query: Find all pages which link to a page containing term ‘Subdue’ Subgraph vertices: 1 page URL: 7 page URL: 8Subdue [1->7] hyperlink [7->8] word Subdue page hyperlink /* Vertex ID Label */ s v 1 page v 2 page v 3 Subdue /* Edge Vertex 1 Vertex 2 Label */ d 1 2 hyperlink d 2 3 word word page
Search for Presentation Pages n WebSubdue 22 instances n AltaVista Query “host:www-cse.uta.edu AND image:next_motif.gif AND image:up_motif.gif AND image:previous_motif.gif.” 12 instances page hyperlink
Search for Reference Pages n Search for page with at least 35 in links n WebSubdue found 5 pages in www-cse n AltaVista cannot perform this type of search page hyperlink …
Inclusion of WordNet n When generating graph n Use common stopword list n When searching for subgraph instances n Morphology functions n October = Oct n teaching = teach n Synsets n Optional allowance of synonyms
Search for pages on ‘jobs in computer science’ n Inexact match: allow one level of synonyms n WebSubdue found 33 matches n Words include employment, work, job, problem, task n AltaVista found 2 matches page jobscomputerscience word
Search for ‘authority’ hub and authority pages n WebSubdue found 3 hub (and 3 authority) pages n AltaVista cannot perform this type of search n Inexact match applied with threshold = 0.2 (4.2 transformations allowed) n WebSubdue found 13 matches page hyperlink page word algorithms HUBS AUTHORITIES
Subdue Learning from Web Data n Distinguish professors’ and students’ web pages n Learned concept (professors have “box” in address field) n Distinguish online stores and professors’ web pages n Learned concept (stores have more levels in graph) pagebox word page
Conclusions n WebSubdue can be used to search for structural web data n Could be enhanced with additional WordNet features such as synset path length n Efficient structural search necessary for future of web search tools
To Learn More cygnus.uta.edu/subdue