Research Issues in Web Data Mining Sanjay Kumar Madria Department of Computer Science Purdue University, West Lafayette,

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Web Mining.
XML: Extensible Markup Language
C6 Databases.
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Information Retrieval in Practice
Managing Data Resources
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN
1 WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science Purdue University, West Lafayette, IN 47907
The Hierarchy of Data Bit (a binary digit): a circuit that is either on or off Byte: 8 bits Character: each byte represents a character; the basic building.
Web Mining Research: A Survey
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Web Mining Research: A Survey
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Web Usage Mining - W hat, W hy, ho W Presented by:Roopa Datla Jinguang Liu.
Building Knowledge-Driven DSS and Mining Data
BUSINESS DRIVEN TECHNOLOGY
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Overview of Search Engines
Lecture-8/ T. Nouf Almujally
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
ACS1803 Lecture Outline 2 DATA MANAGEMENT CONCEPTS Text, Ch. 3 How do we store data (numeric and character records) in a computer so that we can optimize.
Chapter 5 Lecture 2. Principles of Information Systems2 Objectives Understand Data definition language (DDL) and data dictionary Learn about popular DBMSs.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Case 2: Emerson and Sanofi Data stewards seek data conformity
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
1.file. 2.database. 3.entity. 4.record. 5.attribute. When working with a database, a group of related fields comprises a(n)…
Data Warehousing Data Mining Privacy. Reading Bhavani Thuraisingham, Murat Kantarcioglu, and Srinivasan Iyer Extended RBAC-design and implementation.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Data Mining By Dave Maung.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
Database Design – Lecture 18 Client/Server, Data Warehouse and E-Commerce Database Design.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
DATA RESOURCE MANAGEMENT
Foundations of Business Intelligence: Databases and Information Management.
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services
WebMiningResearchASurvey Web Mining Research: A Survey Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Computer Science Department University.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.
Managing Data Resources File Organization and databases for business information systems.
Intro to MIS – MGS351 Databases and Data Warehouses
A Web Mining Platform for Enhancing Knowledge Management on the Web KOK-LEONG ONG WEE-KEONG NG EE-PENG LIM Center for Advanced Information Systems,
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
WHOWEDA : Warehouse of Web Data
WHOWEDA : Warehouse of Web Data
Data Warehousing and Data Mining
Web Mining Department of Computer Science and Engg.
Web Warehousing : Design and Issues
Web Mining Research: A Survey
Presentation transcript:

Research Issues in Web Data Mining Sanjay Kumar Madria Department of Computer Science Purdue University, West Lafayette, IN Sourav Bhowmick, Ng Wee Keong and Lim Ee Peng Nanyang Technological University, Singapore

WHOWEDA! A WareHouse Of WEb DAta Web Information Coupling Model (WICM) –Web Objects –Web Schema Web Information Coupling Algebra Web Information Maintenance Web Mining and Knowledge discovery

Web Objects Node - url, title, format, size, date, text Link - source-url, target-url, label, link-type Web tuple Web table Web schema Web database

WebInformationCouplingSystem Web Information Maintenance System Web Information Mining System WarehouseConceptMart WebMart WWW WebWarehouse WebMart WebMart WebMart Web Querying & Analysis Component User

Global Web Manipulation WarehouseConceptMart WWW WebWarehouseWebWarehouse Web Query & Display User Pre processing Local Web Manipulation Global Web Coupling Coupling Global Ranking Data Visualization Web Select Local Web Coupling Web Project Local Ranking Web Join Web Union Web Intersection Schema Tightness Schema Search Schema Match Schema Tightness Data Visualization

Web Schema Structural ‘summary’ of web table Information Coupling using a Query graph Query graph ->Web schema directed graph as ordered 4-tuple: –Set of node variables –Set of link variables –Connectivities –Predicates

Informatio n Square's homepage Headline article 1 Headline article n News specials Airport info (List of video files) List of links to local news List of links to world news Local news 1 Local news k World news 1 World news t

x y e x y e gg f label CONTAINS "Local News" target_URL CONTAINS "newshub/spe cials" z url CONTAINS "local" label CONTAINS "World News" w url CONTAINS "world" target_url CONTAINS "article” h url contains “headlines”

Information Square's homepage Headline article 1 News specials List of links to local news List of links to world news Local news 1 World news 1

Schema- example Node variables:Xn = { x, y, z, w } Link variable:Xl = { e, f, g } Connectivities:C = { x y and x z and x w }

Predicates P={x.url=” -square”, y.url CONTAINS “headlines” e.target_url CONTAINS "article", f.target.url CONTAINS "newshub/specials", g.label CONTAINS "Local News", z.url CONTAINS "local", h.label CONTAINS "World News", w.url CONTAINS "world" }

Query Graph - Example Query graph - same as schema Informally, it is directed connected graph consists of nodes, links and keywords imposed on them. Produce a list of diseases with their symptoms, evaluation procedures and treatment starting from the web site at Web table Diseases

List of Diseases x Treatment list q Treatmentg Symptoms list z Symptoms f Issues y e Evaluation wp Evaluation

List of Diseases x0 Treatment list q1 Treatment g1 Symptomslist z1 Symptoms f1 Issues y1 e1 Evaluation w1p2 Elisa Test AIDS Evaluation

Example 2 Produce a list of drugs, and their uses and side effects starting from the web site at Web table Drugs

List of Diseases Drug list Issues Uses Use Side effects ab c d r s k Sideeffects

List of Diseases Drug list list Issues Uses of Indavir Use Side effects a0b1c1d1 r1 s1 k1 AIDS Indavir of Indavir

WWW Data Mining web structure mining : Web structure mining involves mining the web document’s structures and links. web content mining : Web content mining describes the automatic search of information resources available on-line. web usage mining : Web usage mining includes the data from server access logs, user registration or profiles, user sessions or transactions etc.

Web Structure Mining : Issues  Measuring the frequency of the local links (links in the same server) in the web tuples in a web table.  web tuples have more information about inter- related documents that exists at the same server.  measures the completeness of the web site in a sense that most of the closely related information is available at the same site(server).  For example, an airline’s home page will have more local links connecting the “routing information with air-fares and schedules”.

 Measuring the frequency of web tuples in a web table containing links which are interior; links which are within the same file.  measures a web document’s ability to cross- reference other related web pages within the same document.  measures the flow of the web documents.

 Measuring the frequency of web tuples in a web table that contains links that are global; links which span different web sites.  measures the visibility of the web documents and ability to relate similar or related documents across different sites.  For example, research documents related to “semi-structured data” will be available at many sites and such sites should be visible to other related sites by providing cross references by the popular phrases such as “more related links”.

 Measuring the frequency of identical web tuples that appear in a web table or among the web tables.  measures the replication of web documents and may help in identifying the mirrored sites.  What is the in-degree and out-degree of each node (web document)? What is the meaning of high and low in- and out-degrees?  Locating links to popular web sites in the web tuples in a table.  Number of web tuples are returned in response to a query on some popular phrases such as “Bio- science” with respect to queries containing keywords like “earth-science”.

 discover the nature of the hyperlinks in the web sites of a particular domain.  What information do they provide and how are they related conceptually.  Is it possible to extract a conceptual hierarchical information for designing web sites of a particular domain.  generalizing the flow of information in web sites representing some particular domain.

Web Bags and Web Structure Mining Most of the search engines fail to handle the following knowledge discovery goals:  locate the most visible web sites or documents for reference. Many paths (high fan in) can reach that sites or documents.  locate the most luminous web sites or documents for reference. web sites or documents which have the most number of outgoing links.  find the most traversed path for a particular query result. To identify the set of most popular interlinked web documents that have been traversed frequently to obtain the query result.

Applications of Visibility Association rules e-commerce

From the results returned, find most visible pages. Assume Z1 is the most visible page with the given threshold. This gives estimates about different restaurants selling pizzas. Lower threshold gives you set (Z1, Z2) as visible pages, which sells both pizza and pasta. Generalize rules such as out of 66% of restaurants which offer pizza to their customers, 33% also offers pasta.

E-commerce application My web site’s visibility is going down!!!!

Application - Luminosity Association rules such as X% of all the companies which makes a product “A”, Y% of them also makes a set of products “B and C”. Exmple - certain companies (33%) if they make a product A also make products B and C. the company C makes only the product A. That is, 66% of companies which make a product “A”, 33% of them also make products B and C.

Web Content Mining  what does it mean to mine content from the web?  Is extracting information from a very small subset of all HTML web pages is also an instance of web data mining?  mining a subset of web pages stored in one or more web tables is more feasible option.  Similarity and difference between web content mining in web warehouse context and conventional data mining.

 Selection of type of data in the WWW to do web content mining.  Cleaning of selected data to mine effectively.  Types of knowledge that can be discovered in a web warehouse context.  Discovery of types of information hidden in a web warehouse which are useful for decision making.  specify, measure and justify the interestingness of the discovered knowledge  knowledge to be discovered are as follows: generalized relation, characteristic rule, discriminate rule, classification rule, association rule, and deviation rule.

 Do the data mining techniques applicable to web mining and if yes, how? For example, we are interested in generating the following types of rules: 40% of web tuples (i..e, web pages) in response to a “travel information query from Hong Kong to Macau” suggest that popular means of traveling is by ferry.  To derive some additional knowledge in a web warehouse for web content mining.  mining previously unknown knowledge in a web warehouse.  Presentation of discovered knowledge to the users to expedite complex decision making.

Web Usage Mining discovery of user access patterns from web servers; user profile, access pattern for pages, etc. used for efficient and effective web site management and the user behavior.  In WHOWEDA, the user initiates a coupling framework to collect related information.  For example, coupling a query graph “to find the hotel information” with the query graph “to find the places of interest”.  From this query graph, we can generate some user access pattern of coupling framework like “50% of users who query “hotel” also couple their query with “places of interest”.

 find coupled concepts from the coupling framework.  helps in organizing web sites.  For example, web documents that provide information on “hotels” should also have hyperlinks to web pages providing information on “places of interest”.

Warehouse Concept Mart Knowledge discovery in web data becomes more and more complex due to the large number of data on WWW.  build the concept hierarchies involving web data to use them in knowledge discovery.  collection of concept hierarchies a Warehouse Concept Mart (WCM).  concept mart is build by extracting and generalizing terms from web documents to represent classification knowledge of a given class hierarchy.

 For unclassified words, they can be clustered based on their common properties. Once the clusters are decided, the keywords can be labeled with their corresponding clusters, and common features of the terms are summarized to form the concept description.  associate a weight at each level of concept marts to evaluate the importance of a term with respect to the concept level in the concept hierarchy.

Web Concept Mart Applications Intelligent answering of web queries  supply the threshold for a given key word in the warehouse concept mart and the words with the threshold more than the given value can be taken into consideration when answering the query.  use different levels of concepts in the warehouse concept mart or can provide approximate answers.  provide the user some knowledge in framing the global coupling query graph. Example - DBMS and Oracle –. Web mining and Concept Mart  Mining association rules techniques to mine the association between words appearing in the concept mart at various levels and in the web tuples returned as the result of a query.  Mining knowledge at multiple levels may help WWW users to find some interesting rules that are difficult to be discovered otherwise.  A knowledge discovery process may climb up and step down to different concepts in the warehouse concept mart’s level with user’s interactions and instructions including different threshold values.

Web mining and Concept Mart mine the association between words appearing in the concept mart at various levels and in the web tuples returned as the result of a query.  Mining knowledge at multiple levels may help WWW users to find some interesting rules that are difficult to be discovered otherwise.  A knowledge discovery process may climb up and step down to different concepts in the warehouse concept mart’s level with user’s interactions and instructions including different threshold values.  capture the flow of web sites of particular domain; helpful in location information

Conclusions  web mining issues in context of the web warehousing project called WHOWEDA (Warehouse of Web Data).  discussed web mining issues with respect to web structure, web content and web usage.  Our focus is to design tools and techniques for web mining to generate some useful knowledge from the WWW data.  We are working on formal algorithms to generate association rules and classification rules.