Structural Web Search Using a Graph-Based Discovery System Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington

Slides:



Advertisements
Similar presentations
Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.
Advertisements

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,
gSpan: Graph-based substructure pattern mining
Mining di Dati Web Web Community Mining and Web log Mining : Commody Cluster based execution Romeo Zitarosa.
Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 CHAPTER 4 - PART 2 GRAPHS 1.
GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington.
Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins Shaobing Su Supervisor: Dr. Lawrence B. Holder.
Data Mining Techniques: Clustering
FLAIRS '991 Applying the SUBDUE Substructure Discovery System to the Chemical Toxicity Domain Ravindra N. Chittimoori, Diane J. Cook, Lawrence B. Holder.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Graph-Based Concept Learning Jesus A. Gonzalez, Lawrence B. Holder, and Diane J. Cook Department of Computer Science and Engineering University of Texas.
Structural Knowledge Discovery Used to Analyze Earthquake Activity Jesus A. Gonzalez Lawrence B. Holder Diane J. Cook.
Graph-Based Data Mining Diane J. Cook University of Texas at Arlington
FLAIRS Graph-Based Concept Learning Jesus Gonzalez, Lawrence Holder and Diane Cook Department of Computer Science and Engineering The University.
Subdue Graph Visualizer by Gayathri Sampath, M.S. (CSE) University of Texas at Arlington.
March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems
Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.
Data Mining Presentation Learning Patterns in the Dynamics of Biological Networks Chang hun You, Lawrence B. Holder, Diane J. Cook.
GUI implementation for Supervised and Unsupervised SUBDUE System.
(hyperlink-induced topic search)
Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington
ANALYSIS OF GENETIC NETWORKS USING ATTRIBUTED GRAPH MATCHING.
Workshop1 Efficient Mining of Graph-Based Data Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department.
Unit 4.4 We are HTML Editors
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Hubert CARDOTJY- RAMELRashid-Jalal QURESHI Université François Rabelais de Tours, Laboratoire d'Informatique 64, Avenue Jean Portalis, TOURS – France.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Machine Learning Approach for Ontology Mapping using Multiple Concept Similarity Measures IEEE/ACIS International Conference on Computer and Information.
1 SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS Jesus A. Gonzalez Supervisor:Dr. Lawrence B. Holder Committee:Dr. Diane J. Cook Dr. Lynn.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
Using Hyperlink structure information for web search.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
The Internet 8th Edition Tutorial 4 Searching the Web.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Where do I find it? Created by Connie CampbellConnie Campbell.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
The World Wide Web: Information Resource. Hock, Randolph. The Extreme Searcher’s Internet Handbook. 2 nd ed. CyberAge Books: Medford. (2007). Internet.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
Topics Paths and Circuits (11.2) A B C D E F G.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Post-Ranking query suggestion by diversifying search Chao Wang.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
CS 351/ IT 351 Modeling and Simulation Technologies Review ( ) Dr. Jim Holten.
Graphs Definition: a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
INTERNET VOCAB. WEB BROWSER An app for finding info on the web.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Developing GRID Applications GRACE Project
Smart Web Search Agents Data Search Engines >> Information Search Agents - Traditional searching on the Web is done using one of the following three: -
Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Gspan: Graph-based Substructure Pattern Mining
Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater,
The Web Web Design. 3.2 The Web Focus on Reading Main Ideas A URL is an address that identifies a specific Web page. Web browsers have varying capabilities.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
CS 201: Design and Analysis of Algorithms
Csc 2720 Instructor: Zhuojun Duan
1.01- Understand Internet search tools and methods.
The Recommendation Click Graph: Properties and Applications
Efficient Subgraph Similarity All-Matching
Graph-Based Anomaly Detection
Presentation transcript:

Structural Web Search Using a Graph-Based Discovery System Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington

Structured Web Search n Existing search engines use linear feature match n Web contains structural information as well n Hyperlink information n Web viewed as a graph [Kleinberg] n Subdue searches based on structure n Use as foundation of a structural search engine n Incorporation of WordNet allows for synonym match

object triangle n Discovers structural patterns in input graphs n A substructure is connected subgraph n An instance of a substructure is a subgraph that is isomorphic to substructure definition n Pattern discovery, classification, clustering R1 C1 T1 S1 T2 S2 T3 S3 T4 S4 Input DatabaseSubstructure S1 (graph form) Compressed Database R1 C1 object square on shape

Subdue Algorithm Start with individual vertices Start with individual vertices Keep only best substructures on queue Keep only best substructures on queue Expand substructure by adding edge/vertex Expand substructure by adding edge/vertex Compress graph and repeat to generate hierarchical description Compress graph and repeat to generate hierarchical description Optional use of background knowledge Optional use of background knowledge

Inexact Graph Match n Some variations may occur between instances n Want to abstract over minor differences n Difference = cost of transforming one graph to make it isomorphic to another n Match if cost/size < threshold

Application Domains n Protein data n Human Genome DNA data n Spatial-temporal domains n Earthquake data n Aircraft Safety and Reporting System n Telecommunications data n Program source code n Web data

page Represent Web as Graph n Breadth-first search of domain to generate graph n Nodes represent pages / documents n Edges represent hyperlinks n Additional nodes represent document keywords page university texas learning group projects subdu e robotics parallel hyperlink work word planning

WebSubdue’s Structural Search n Formulate query as graph n Use Subdue’s predefined substructure option to search for instances of query Instructor Teaching Robotics Research Robotics Publicatio n Robotics http Postscript | PDF

Query: Find all pages which link to a page containing term ‘Subdue’ Subgraph vertices: 1 page URL: 7 page URL: 8Subdue [1->7] hyperlink [7->8] word Subdue page hyperlink /* Vertex ID Label */ s v 1 page v 2 page v 3 Subdue /* Edge Vertex 1 Vertex 2 Label */ d 1 2 hyperlink d 2 3 word word page

Search for Presentation Pages n WebSubdue  22 instances n AltaVista  Query “host:www-cse.uta.edu AND image:next_motif.gif AND image:up_motif.gif AND image:previous_motif.gif.”  12 instances page hyperlink

Search for Reference Pages n Search for page with at least 35 in links n WebSubdue found 5 pages in www-cse n AltaVista cannot perform this type of search page hyperlink …

Inclusion of WordNet n When generating graph n Use common stopword list n When searching for subgraph instances n Morphology functions n October = Oct n teaching = teach n Synsets n Optional allowance of synonyms

Search for pages on ‘jobs in computer science’ n Inexact match: allow one level of synonyms n WebSubdue found 33 matches n Words include employment, work, job, problem, task n AltaVista found 2 matches page jobscomputerscience word

Search for ‘authority’ hub and authority pages n WebSubdue found 3 hub (and 3 authority) pages n AltaVista cannot perform this type of search n Inexact match applied with threshold = 0.2 (4.2 transformations allowed) n WebSubdue found 13 matches page hyperlink page word algorithms HUBS AUTHORITIES

Subdue Learning from Web Data n Distinguish professors’ and students’ web pages n Learned concept (professors have “box” in address field) n Distinguish online stores and professors’ web pages n Learned concept (stores have more levels in graph) pagebox word page

Conclusions n WebSubdue can be used to search for structural web data n Could be enhanced with additional WordNet features such as synset path length n Efficient structural search necessary for future of web search tools

To Learn More cygnus.uta.edu/subdue