Download presentation
Presentation is loading. Please wait.
1
Structural Web Search Using a Graph-Based Discovery System Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington cook@cse.uta.eduhttp://www-cse.uta.edu/~cook
2
Structured Web Search n Existing search engines use linear feature match n Web contains structural information as well n Hyperlink information n Web viewed as a graph [Kleinberg] n Subdue searches based on structure n Use as foundation of a structural search engine n Incorporation of WordNet allows for synonym match
3
object triangle n Discovers structural patterns in input graphs n A substructure is connected subgraph n An instance of a substructure is a subgraph that is isomorphic to substructure definition n Pattern discovery, classification, clustering R1 C1 T1 S1 T2 S2 T3 S3 T4 S4 Input DatabaseSubstructure S1 (graph form) Compressed Database R1 C1 object square on shape
4
Subdue Algorithm Start with individual vertices Start with individual vertices Keep only best substructures on queue Keep only best substructures on queue Expand substructure by adding edge/vertex Expand substructure by adding edge/vertex Compress graph and repeat to generate hierarchical description Compress graph and repeat to generate hierarchical description Optional use of background knowledge Optional use of background knowledge
5
Inexact Graph Match n Some variations may occur between instances n Want to abstract over minor differences n Difference = cost of transforming one graph to make it isomorphic to another n Match if cost/size < threshold
6
Application Domains n Protein data n Human Genome DNA data n Spatial-temporal domains n Earthquake data n Aircraft Safety and Reporting System n Telecommunications data n Program source code n Web data
7
page Represent Web as Graph n Breadth-first search of domain to generate graph n Nodes represent pages / documents n Edges represent hyperlinks n Additional nodes represent document keywords page university texas learning group projects subdu e robotics parallel hyperlink work word planning
8
WebSubdue’s Structural Search n Formulate query as graph n Use Subdue’s predefined substructure option to search for instances of query Instructor Teaching Robotics Research Robotics Publicatio n Robotics http Postscript | PDF
9
Query: Find all pages which link to a page containing term ‘Subdue’ Subgraph vertices: 1 page URL: http://cygnus.uta.eduhttp://cygnus.uta.edu 7 page URL: http://cygnus.uta.edu/projects.htmlhttp://cygnus.uta.edu/projects.html 8Subdue [1->7] hyperlink [7->8] word Subdue page hyperlink /* Vertex ID Label */ s v 1 page v 2 page v 3 Subdue /* Edge Vertex 1 Vertex 2 Label */ d 1 2 hyperlink d 2 3 word word page
10
Search for Presentation Pages n WebSubdue 22 instances n AltaVista Query “host:www-cse.uta.edu AND image:next_motif.gif AND image:up_motif.gif AND image:previous_motif.gif.” 12 instances page hyperlink
11
Search for Reference Pages n Search for page with at least 35 in links n WebSubdue found 5 pages in www-cse n AltaVista cannot perform this type of search page hyperlink …
12
Inclusion of WordNet n When generating graph n Use common stopword list n When searching for subgraph instances n Morphology functions n October = Oct n teaching = teach n Synsets n Optional allowance of synonyms
13
Search for pages on ‘jobs in computer science’ n Inexact match: allow one level of synonyms n WebSubdue found 33 matches n Words include employment, work, job, problem, task n AltaVista found 2 matches page jobscomputerscience word
14
Search for ‘authority’ hub and authority pages n WebSubdue found 3 hub (and 3 authority) pages n AltaVista cannot perform this type of search n Inexact match applied with threshold = 0.2 (4.2 transformations allowed) n WebSubdue found 13 matches page hyperlink page word algorithms HUBS AUTHORITIES
15
Subdue Learning from Web Data n Distinguish professors’ and students’ web pages n Learned concept (professors have “box” in address field) n Distinguish online stores and professors’ web pages n Learned concept (stores have more levels in graph) pagebox word page
16
Conclusions n WebSubdue can be used to search for structural web data n Could be enhanced with additional WordNet features such as synset path length n Efficient structural search necessary for future of web search tools
17
To Learn More cygnus.uta.edu/subdue cook@cse.uta.edu http://www-cse.uta.edu/~cook
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.