Web Mining Research: A Survey

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Machine Learning and the Semantic Web
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Web Mining Research: A Survey
Web Usage Mining: Processes and Applications
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Web Data Mining and Applications Part I
Authors:Jochen Dijrre, Peter Gerstl, Roland Seiffert Adapted from slides by: Trevor Crum Presenter: Nicholas Romano Text Mining: Finding Nuggets in Mountains.
Overview of Search Engines
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
Web Mining Research: A survey
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Chapter 7 DATA, TEXT, AND WEB MINING Pages , 311, Sections 7.3, 7.5, 7.6.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Chapter 1 Introduction to Data Mining
Web Usage Patterns Ryan McFadden IST 497E December 5, 2002.
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Web Mining By:- Vineeta 8pgc18 M.Tech (II Semester)
Data Mining By Dave Maung.
Srivastava J., Cooley R., Deshpande M, Tan P.N.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Data Mining for Web Intelligence Presentation by Julia Erdman.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
User Modeling and Recommender Systems: Introduction to recommender systems Adolfo Ruiz Calleja 06/09/2014.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Artificial Intelligence Techniques Internet Applications 4.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
WEB USAGE MINING Web Usage Mining 1. Contents Web Usage Mining 2  Web Mining  Web Mining Taxonomy  Web Usage Mining  Web analysis tools  Pattern.
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Shamil Mustafayev 04/16/
Chapter 8: Web Analytics, Web Mining, and Social Analytics
WebMiningResearchASurvey Web Mining Research: A Survey Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Computer Science Department University.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Data mining in web applications
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Search Engine Architecture
Information Retrieval and Web Search
Web Mining Ref:
Information Retrieval and Web Search
Data Warehousing and Data Mining
Data Mining Chapter 6 Search Engines
CSE 635 Multimedia Information Retrieval
Web Mining Department of Computer Science and Engg.
Web Mining Research: A Survey
Presentation transcript:

Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD , July 2000 Presented by Drew DeHaas

Outline Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions

Motivation for Web Mining World Wide Web is popular/interactive medium for disseminating information It is also huge, diverse, and dynamic: raising issues of scalability, multimedia data, and temporal information. Both information users and information providers face problems due to the nature of the web.

Problems: Information Users Finding relevant information Relevant search results are hard to come by Inability to index all of the information on web Creating new knowledge out of available information on the web Extract knowledge out of collected data Personalizing the information available Catering to personal preference in content and presentation

Problem: Information Providers The main problem that information providers face is learning about consumers/users What does the customer do? What does the customer want to do? Personalizing to individual users Using web data to effectively market products and/or services

Other Approaches Web mining is not the only approach Database approach Information retrieval Natural language processing In-depth syntactic and semantic analysis Web document community Standards, manually appended meta-information, maintained directories, etc

Direct vs Indirect Web Mining Web mining techniques can be used to solve the information overload problems: Directly Attack the problem with web mining techniques E.g. newsgroup agent classifies news as relevant Indirectly Used as part of a bigger application that addresses problems E.g. used to create index terms for a web search service

The Research Converging research from: Database, information retrieval, and artificial intelligence (specifically NLP and machine learning) Paper focuses on research from the machine learning point of view

Outline Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions

Web Mining: Definition “Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.” Can be viewed as four subtasks Not the same as Information Retrieval Not the same as Information Extraction

Web Mining: Subtasks Resource finding Retrieving intended documents Information selection/pre-processing Select and pre-process specific information from selected documents Generalization Discover general patterns within and across web sites Analysis Validation and/or interpretation of mined patterns

Web Mining: Not IR or IE Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant as possible Web document classification, which is a Web Mining task, could be part of an IR system (e.g. indexing for a search engine)

Web Mining: Not IR or IE Information extraction (IE) aims to extract the relevant facts from given documents while IR aims to select the relevant documents IE systems for the general Web are not feasible Most focus on specific Web sites or content

Web Mining and Machine Learning Web mining not the same as learning from the Web. Some applications of machine learning on the web are not Web Mining Some methods used for Web Mining besides machine learning However, there is a close relationship between web mining and machine learning.

Web Mining Categories Web Content Mining Web Structure Mining Discovering useful information from web contents/data/documents. Web Structure Mining Discovering the model underlying link structures on the Web Web Usage Mining Try to make sense of data generated by Web surfer’s sessions or behaviors

Web Mining: The Agent Paradigm User Interface Agents information retrieval agents, information filtering agents, & personal assistant agents. Distributed Agents distributed agents for knowledge discovery or data mining. Problem solving by a group of agents

Web Mining: The Agent Paradigm Content-based approach The system searches for items that match based on an analysis of the content using the user preferences. Collaborative approach The system tries to find users with similar interests Recommendations given based on what similar users did

Outline Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions

Web Content Mining: Intro Motivations Most of the data on the internet is accessible through the Web Digital libraries are becoming prevalent Businesses and services are moving “online” Applications are moving from the “desktop” to the Web

Web Content Mining: Intro Types of data dealt with Textual, image, audio, video, metadata, hyperlinks Multimedia mining Can be an instance of Web Mining Hidden data Dynamic or private Unstructured (free text), semi-structured (HTML, etc), and structured (data in tables, or pages generated from a database)

Web Content Mining: IR View Unstructured Documents Bag of words, or phrase-based feature representation Features can be boolean or frequency based Features can be reduced using different feature selection techniques Word stemming, combining morphological variations into one feature Possibly use n-gram representations (encodes some context)

Web Content Mining: IR View Semi-Structured Documents Uses richer representations for features, based on information from the document structure (typically HTML and hyperlinks) Uses common data mining methods (whereas unstructured might use more text mining methods)

Web Content Mining: DB View Tries to infer the structure of a Web site or transform a Web site to become a database Better information management Better querying on the Web Can be achieved by: Finding the schema of Web documents Building a Web warehouse Building a Web knowledge base Building a virtual database

Web Content Mining: DB View Mainly uses the Object Exchange Model (OEM) Represents semi-structured data (some structure, no rigid schema) by a labeled graph Process typically starts with manual selection of Web sites for content mining Main application: building a structural summary of semi-structured data (schema extraction or discovery)

Outline Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions

Web Structure Mining Interested in the structure between Web documents (not within a document) Inspired by the study of social networks and citation analysis Example: PageRank – Google Application: Discovering micro-communities in the Web Measuring the “completeness” of a Web site

Outline Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions

Web Usage Mining Tries to predict user behavior from interaction with the Web Wide range of data (logs) Web client data Proxy server data Web server data Two common approaches Map usage data into relational tables and use adapted data mining techniques Use log data directly by utilizing special pre-processing techniques

Web Usage Mining Typical problems: Distinguishing among unique users, server sessions, episodes, etc in the presence of caching and proxy servers Often Usage Mining uses some background or domain knowledge E.g. site topology, Web content, etc

Web Usage Mining Two main categories: Learning a user profile (personalized) Web users would be interested in techniques that learn their needs and preferences automatically Learning user navigation patterns (impersonalized) Information providers would be interested in techniques that improve the effectiveness of their Web site or biasing the users towards the goals of the site

Outline Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions

Conclusions Tried to resolve confusion with regards to the term Web Mining Differentiated from IR and IE Suggest three Web mining categories: Content, Structure, and Usage Mining Briefly described approaches for the three categories Explored connection with agent paradigm

Exam Question #1 Question: Outline the main characteristics of Web information. Answer: Web information is huge, diverse, and dynamic.

Exam Question #2 Question: How data mining techniques can be used in Web information analysis? Give at least two examples. Classification: classification on server logs using decision tree, Naïve-Bayes classifier to discover the profiles of users belonging to a particular class Clustering: Clustering can be used to group users exhibiting similar browsing patterns. Association Analysis: association analysis can be used to relate pages that are most often referenced together in a single server session.

Exam Question #1 Question: What are the three main areas of interest for Web mining? Answer: (1) Web Content (2) Web Structure (3) Web Usage

Questions?