Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

Chapter 5: Introduction to Information Retrieval
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Machine Learning and the Semantic Web
Information Retrieval in Practice
Chapter 12: Web Usage Mining - An introduction
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Web Mining Research: A Survey
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
Web Mining Research: A Survey
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Web Usage Mining - W hat, W hy, ho W Presented by:Roopa Datla Jinguang Liu.
Information Retrieval
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Overview of Web Data Mining and Applications Part I
Chapter 5: Information Retrieval and Web Search
Authors:Jochen Dijrre, Peter Gerstl, Roland Seiffert Adapted from slides by: Trevor Crum Presenter: Nicholas Romano Text Mining: Finding Nuggets in Mountains.
Overview of Search Engines
Web Mining Research: A survey
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Chapter 1 Introduction to Data Mining
Lecture 9: Knowledge Discovery Systems Md. Mahbubul Alam, PhD Associate Professor Dept. of AEIS Sher-e-Bangla Agricultural University.
Web Usage Patterns Ryan McFadden IST 497E December 5, 2002.
Master Thesis Defense Jan Fiedler 04/17/98
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Web Mining By:- Vineeta 8pgc18 M.Tech (II Semester)
Data Mining By Dave Maung.
Chapter 6: Information Retrieval and Web Search
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
8/12/10 By Uday Kumar WEB MINING. 8/12/10 Agenda World Wide Web – a brief history Introduction to Data Mining Data Mining Process & Techniques Web Mining.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Data Mining for Web Intelligence Presentation by Julia Erdman.
1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
WEB USAGE MINING Web Usage Mining 1. Contents Web Usage Mining 2  Web Mining  Web Mining Taxonomy  Web Usage Mining  Web analysis tools  Pattern.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services
Chapter 8: Web Analytics, Web Mining, and Social Analytics
WebMiningResearchASurvey Web Mining Research: A Survey Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Computer Science Department University.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Data mining in web applications
Information Retrieval in Practice
Search Engine Architecture
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Data Warehousing and Data Mining
Data Mining Chapter 6 Search Engines
Web Mining Department of Computer Science and Engg.
Web Mining Research: A Survey
Presentation transcript:

Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01

outline Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Review Exam Questions pg 02

outline Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Review Exam Questions pg 03

Introduction “The Web is huge, diverse, and dynamic... we are currently drowning in information and facing information overload.” Web users encounter problems: Finding relevant information Creating new knowledge out of the information available on the Web Personalization of the information Learning about consumers or individual users pg 04

outline Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Review Exam Questions pg 05

Web Mining “Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services.” Web mining subtasks: 1.Resource finding 2.Information selection and pre-processing 3.Generalization 4.Analysis pg 06

Web Mining Information Retrieval & Information Extraction Information Retrieval (IR) o the automatic retrieval of all relevant documents while at the same time retrieving as few of the non- relevant as possible Information Extraction (IE) o transforming a collection of documents into information that is more readily digested and analyzed pg 07

Live demo pg 08

outline Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Review Exam Questions pg 09

Web Content Mining Information Retrieval View Unstructured Documents Most utilizes “bag of words” representation to generate documents features o ignores the sequence in which the words occur Document features can be reduced with selection algorithms o ie. information gain Possible alternative document feature representations: o word positions in the document o phrases/terms (ie. “annual interest rate”) Semi-Structured Documents Utilize additional structural information gleaned from the document o HTML markup (intra-document structure) o HTML links (inter-document structure) pg 10

Web content mining, IR unstructured documents pg 11

Web content mining, IR semi structured documents pg 12

Web Content Mining Database View “the Database view tries... to transform a Web site to become a database so that... querying on the Web become[s] possible.” Uses Object Exchange Model (OEM) o represents semi-structured data by a labeled graph Database view algorithms typically start from manually selected Web sites o site-specific parsers Database view algorithms produce: o extract document level schema or DataGuides  structural summary of semi-structured data o extract frequent substructures (sub-schema) o multi-layered database  each layer is obtained by generalizations on lower layers pg 13

Web content mining, Database view pg 14

outline Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Review Exam Questions pg 15

Web Structure Mining “... we are interested in the structure of the hyperlinks within the Web itself” Inspired by the study of social networks and citation analysis o based on incoming & outgoing links we could discover specific types of pages (such as hubs, authorities, etc) Some algorithms calculate the quality/relevancy of each Web page o ie. Page Rank Others measure the completeness of a Web site o measuring frequency of local links on the same server o interpreting the nature of hierarchy of hyperlinks on one domain pg 16

outline Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Review Exam Questions pg 17

Web Usage Mining “... focuses on techniques that could predict user behavior while the user interacts with the Web.” Web usage is mined by parsing Web server logs o mapped into relational tables → data mining techniques applied o log data utilized directly Users connecting through proxy servers and/or users or ISP’s utilizing caching of Web data results in decreased server log accuracy Two applications: o personalized - user profile or user modeling in adaptive interfaces o impersonalized - learning user navigation patterns pg 18

outline Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Review Exam Questions pg 19

Review Web mining o 4 subtasks o IR & IE Web content mining o primarily intra-page analysis o IR view vs DB view Web structure mining o primarily inter-page analysis Web usage mining o primarily analysis of server activity logs pg 20

Web mining categories Web Mining Web Content Mining Web Structure MiningWeb Usage Mining IR ViewDB View View of Data - Unstructured - Semi structured - Web site as DB - Links structure- Interactivity Main Data - Text documents - Hypertext documents - Links structure- Server logs - Browser logs Representation - Bag of word, n-grams - Terms, phrases - Concepts of ontology - Relational - Edge-labeled graph (OEM) - Relational - Graph- Relational table - Graphs Method - TFIDF and variants - Machine learning - Statistical (incl. NLP) - Proprietary algorithms - ILP - (modified) association rules - Proprietary algorithms- Machine Learning - Statistical - (modified) association rules Application Categories - Categorization - Clustering - Finding extraction rules - Finding patterns in text - User modeling - Finding frequent sub- structures - Web site schema discovery - Categorization - Clustering - Site construction, adaptation, and management - Marketing - User modeling pg 21

outline Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Review Exam Questions pg 22

Exam Question 1 Q:Of the following Web mining paradigms: Information Retrieval Information Extraction Which does a traditional Web search engine (google.com, bing.com, etc.) attempt to accomplish? Briefly support your answer. pg 23

Exam Question 1 Q:Of the following Web mining paradigms: Information Retrieval Information Extraction Which does a traditional Web search engine (google.com, bing.com, etc.) attempt to accomplish? Briefly support your answer. A:Information Retrieval, the search engine attempts provides a list of documents ranked by their relevancy to the search query. pg 24

Exam Question 2 Q:State one common problem hampering accurate Web usage mining? Briefly support your answer. pg 25

Exam Question 2 Q:State one common problem hampering accurate Web usage mining? Briefly support your answer. A: Users connecting to a Web site though a proxy server, Users (or their ISP’s) utilizing Web data caching, will result in decreased server log accuracy. Accurate server logs are required for accurate Web usage mining. pg 26

Exam Question 3 Q:What is the phrase associated with the most popular method for Web content mining algorithms to generate document features from unstructured documents? pg 27

Exam Question 3 Q:What is the phrase associated with the most popular method for Web content mining algorithms to generate document features from unstructured documents? A:“Bag of words” representation. pg 28