8/12/10 By Uday Kumar WEB MINING. 8/12/10 Agenda World Wide Web – a brief history Introduction to Data Mining Data Mining Process & Techniques Web Mining.

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
 To publish information for global distribution, one needs a universally understood language, a kind of publishing mother tongue that all computers may.
Chapter 12: Web Usage Mining - An introduction
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
Web Mining Research: A Survey
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
Web Mining Research: A Survey
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Web Usage Mining - W hat, W hy, ho W Presented by:Roopa Datla Jinguang Liu.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Overview of Web Data Mining and Applications Part I
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
WEB ANALYTICS Prof Sunil Wattal. Business questions How are people finding your website? What pages are the customers most interested in? Is your website.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining Techniques
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
The 2nd International Conference of e-Learning and Distance Education, 21 to 23 February 2011, Riyadh, Saudi Arabia Prof. Dr. Torky Sultan Faculty of Computers.
Data Mining Chun-Hung Chou
Understanding Data Analytics and Data Mining Introduction.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Chapter 7 DATA, TEXT, AND WEB MINING Pages , 311, Sections 7.3, 7.5, 7.6.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Data Mining By Dave Maung.
Log files presented to : Sir Adnan presented by: SHAH RUKH.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Srivastava J., Cooley R., Deshpande M, Tan P.N.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
MIS2502: Data Analytics Advanced Analytics - Introduction.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services
Chapter 2 Data, Text, and Web Mining. Data Mining Concepts and Applications  Data mining (DM) A process that uses statistical, mathematical, artificial.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
WebMiningResearchASurvey Web Mining Research: A Survey Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Computer Science Department University.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Data mining in web applications
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Web Mining Ref:
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Text & Web Mining 9/22/2018.
Data Warehousing and Data Mining
Data Mining Chapter 6 Search Engines
Supporting End-User Access
Web Mining Department of Computer Science and Engg.
Web Mining Research: A Survey
Presentation transcript:

8/12/10 By Uday Kumar WEB MINING

8/12/10 Agenda World Wide Web – a brief history Introduction to Data Mining Data Mining Process & Techniques Web Mining Data Mining Vs Web Mining Classification of Web Mining Benefits & Application Areas of Web Mining Web Mining Softwares Summary

8/12/10 World-Wide Web - a brief history Web’s Characteristics: billions of documents authored by millions of diverse people distributed over millions of computers, connected by variety of media Large size, Dynamic content, Time dimension and Multilingual Different data types: text, image, hyperlinks and user usage information. Who invented the World-Wide Web ? (Sir) Tim Berners-Lee in 1989, while working at CERN, invented the World Wide Web, including URL scheme, HTML, and in 1990 wrote the first server (httpd) and the first browser.

8/12/10 Mining Large Data Sets - Motivation  There is often information “hidden” in the data that is not readily evident  Human analysts may take weeks to discover useful information  Much of the data is never analyzed at all

8/12/10 Data Mining

8/12/10 Data Mining - Definition » It is commonly defined as the process of extracting meaningful information from data sources e.g databases, texts, images, the web e.t.c » It is the process of performing automated extraction and generating predictive information from large data banks which enables us to understand the current market trends and enables us to proactive measures to gain maximum benefit from the same.

8/12/10 Data Mining Process

8/12/10 Data Mining Tasks » Data mining makes use of various algorithms to perform a variety of tasks. These algorithms examine the sample data of a problem and determine a model that fits close to solving the problem. » A Predictive model enables you to predicts the values of data by making use of known results from a different set of sample data. The list of tasks that forms the part of predictive model are: Classification Regression Time Series Analysis

8/12/10 Data Mining Tasks Contd.. » A Descriptive model enables you to determine the patterns and relationships in a sample data. The list of tasks that forms the part of descriptive model are: Clustering Summarization Association rules Sequence discovery

8/12/10 Data Mining Tasks Contd.. » Classification: enables you to classify data in a large data bank into predefined set of classes. Ex: People with age less than 40 and salary > 40k trade on-line » Regression: enables to forecast data values based on the present and past values Ex: helps the organization to predict the need for recruiting new employees and purchases based in the past and current growth rate. » Time Series Analysis: enables to predict future values for the current set of values are time dependent (monthly, yearly..) » Summarization:The use of summarization enables you to summarize a large chunk of data containing in a web page.

8/12/10 Data Mining Tasks Contd.. » Clustering: enables you to create new groups (clusters) based on the study of patterns and relation between values of data in a data bank. It is similar to classification but does not require you to predefine groups.(also called as Unsupervised Learning) Ex: Users A and B access similar URLs » Association Rules:It defines certain rules of associativity between data items and then use those rules to establish relationships. Ex: Find the items that tend to be purchased together and specify their relationship. » Sequence Discovery:enables to determine the sequential patterns that might exist in a large and unorganized data bank. Ex: crime detection.

8/12/10 Data Mining Techniques » Data mining is not so much a single technique as the idea that there is more knowledge hidden in the data than shows itself on the surface. Any technique that helps extract more out of your data is useful, list of data mining techniques are. Statistical techniques: is the branch of mathematics, which deals with the collection and analysis of numerical data by using various methods and techniques. Machine Learning: is the process of generating a computer system that is capable of acquiring data and integrating the data to generate useful knowledge. Decision trees: is a tree-shaped structure, in which each branch represents a classification question while leaves of the tree represents the partition of classified information.

8/12/10 Data Mining Techniques » Hidden Markov Models:enables you to predict future actions to be taken in time series. The model provides the probability of a future event, when provided with the present and previous events. » Neural networks:In this a large set of historical data is analyzed in order to predict the output of a particular future situation or a problem. » Genetic algorithms:If you have a certain set of sample data, then GA enables to determine the best possible model out of a set of models in order to represent the sample data.

8/12/10 Data Mining vs. Web Mining Traditional data mining data is structured and relational well-defined tables, columns, rows, keys, and constraints. Web data Semi-structured (HTML documents)and unstructured (free text) readily available data rich in features and patterns

8/12/10 Problems when interacting with the Web » Finding relevant information » Creating new knowledge out of the information available on the Web » Personalization of the information » Learning about consumers or individual users

8/12/10 Web Mining

8/12/10 Web Mining - Definition » “Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.” » The web mining process is similar to the data mining process, the difference is usually in the data collection. » In data mining, the data is often already collected and stored in a data warehouse. » In web mining, data collection can be a substantial task, especially for web structure and content mining, which involves crawling a large number of target web pages.

8/12/10 Web Mining - Subtasks Resource finding Retrieving intended documents Information selection/pre-processing Select and pre-process specific information from selected documents Generalization Discover general patterns at individual web sites as well as across multiple web sites Analysis Validation and/or interpretation of mined patterns

8/12/10 Web Mining Contd.. Web Mining is not IR:  Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible Web Mining is not IE:  Information extraction (IE) aims to extract the relevant facts from given documents IE systems for the general Web are not feasible Most focus on specific Web sites or content

8/12/10 Classification of Web Mining

Click to edit the outline text format Second Outline Level  Third Outline Level Fourth Outline Level  Fifth Outline Level  Sixth Outline Level  Seventh Outline Level  Eighth Outline Level Ninth Outline LevelClick to edit Master text styles – Second level Third level – Fourth level » Fifth level 8/12/10 Web Usage Mining refers to the discovery of user access patterns from the web usage logs, which record every click made by each user. The usage data records the user’s behavior when the user browses or makes transactions on the web site in order to better understand and serve the needs of users or Web-based applications. It is an activity that involves the automatic discovery of patterns from one or more Web servers. Web Usage Mining

8/12/10 Web Usage Mining Contd.. Organizations often generate and collect large volumes of data; most of this information is usually generated automatically by Web servers and collected in server log.  Analyzing such data can help these organizations to determine: the value of particular customers cross marketing strategies across products the effectiveness of promotional campaigns, etc.  Typical Sources of Data automatically generated data stored in server access logs, proxy server logs referrer logs, browser logs, bookmark data, mouse clicks and scrolls and client-side cookies user profiles meta data: page attributes, content attributes, usage data

8/12/10 Web Usage Mining Contd..  The first web analysis tools simply provided mechanisms to report user activity as recorded in the servers. Using such tools, it was possible to determine such information as: the number of accesses to the server the times or time intervals of visits the domain names and the URLs of users of the Web server.  Two main categories: Learning a user profile (personalized) Web users would be interested in techniques that learn their needs and preferences automatically Learning user navigation patterns (impersonalized) Information providers would be interested in techniques that improve the effectiveness of their Web site or biasing the users towards the goals of the site

8/12/10 Web Usage Mining Contd..  Web servers, Web proxies, and client applications can quite easily capture Web Usage data. Web server log: Every visit to the pages, what and when files have been requested, the IP address of the request, the error code, the number of bytes sent to user, and the type of browser used…  By analyzing the Web usage data, web mining systems can discover useful knowledge about a system’s usage characteristics and the users’ interests which has various applications: Personalization and Collaboration in Web-based systems Marketing Web site design and evaluation Decision support

8/12/10 Web Server Log - A Sample

8/12/10 Web Usage Mining Contd.. The technique to retrieve visitor based information from web servers based log files and apply this information to analyze data is known as Web Log Mining. The major types of log files are Access Log- file maintains a list of all the web pages that the visitors have requested. Agent Log- file consists of information about the browser that was used to explore the various web pages.

Click to edit the outline text format Second Outline Level  Third Outline Level Fourth Outline Level  Fifth Outline Level  Sixth Outline Level  Seventh Outline Level  Eighth Outline Level Ninth Outline LevelClick to edit Master text styles – Second level Third level – Fourth level » Fifth level 8/12/10  Web Content Mining extracts or mines useful information or knowledge from web page contents.  In this mining, patterns are extracted from online sources such as HTML files Text documents Images E-books or messages Audio or Video  The concept of WCM is far wider than searching for any specific term or only keyword extraction or some simple statistics of words and phrases in documents.  A tool that performs WCM can summarize a web page so that you need not read the complete document and save your time and energy. Web Content Mining

8/12/10 Web Content Mining Contd.. The two basic approaches or models to implement WCM are Local Knowledge base Model: The abstract characterizations of several web pages are stored locally. (i.e References to several web sites relating to the categories are stored in a database and based on the selection of the category the searching is performed with in the web site) Agent Based Model: This approach applies the Artificial Intelligence systems known as Web Agents that can perform a search on behalf of a particular user for discovering and organizing documents in the web. Some web agents can apply individual user profiles for searching information from the web and organize and interpret the discovered information.

8/12/10 Preprocessing Content Content Preparation: Extract text from HTML. Perform Stemming. Remove Stop Words. Calculate Collection Wide Word Frequencies (DF). Calculate per Document Term Frequencies (TF). Vector Creation: Common Information Retrieval Technique. Each document (HTML page) is represented by a sparse vector of term weights. Typically, additional weight is given to terms appearing as keywords or in titles.

8/12/10 Common Mining Techniques The more basic and popular data mining techniques include:  Classification- Classification on server logs using decision trees, Naives-Bayes classifier to discover the profiles of users belonging to a particular category.  Clustering- can be used to group users exhibiting similar browsing patterns.  Associations- can be used to relate pages that are most often referenced together in a single server session. The other significant ideas are:  Topic Identification, tracking and drift analysis  Concept hierarchy creation  Relevance of content.

Click to edit the outline text format Second Outline Level  Third Outline Level Fourth Outline Level  Fifth Outline Level  Sixth Outline Level  Seventh Outline Level  Eighth Outline Level Ninth Outline LevelClick to edit Master text styles – Second level Third level – Fourth level » Fifth level 8/12/10  Web Structure Mining discovers useful knowledge from hyper links, which represent the structure of the web.  Web structure mining can be divided into two kinds:  Extract patterns from hyperlinks in the web. A hyperlink is a structural component that connects the web page to a different location.  Mining the document structure. It is using the tree-like structure to analyze and describe the HTML or XML tags within the web page.  The process of using the graph theory to analyze the node and connection structure of a web site. Web Structure Mining

8/12/10 Web Structure Mining Contd.. Web Structure is a useful source for extracting information such as Web Page Classification  Classifying web pages according to various topics Quality of Web Page  The authority of a page on a topic  Ranking of web pages Which pages to crawl  Deciding which web pages to add to the collection of web pages Finding Related Pages  Given one relevant page, find all related pages

8/12/10 Web Structure Mining Contd.. The Hyperlink Induced Topic Search (HITS ) is the common method or algorithm for knowledge discovery in the Web. The Concept of HITS is

8/12/10 Web Structure Mining Identication of  Authorities: authoritative, high-quality web pages on broad topics  hubs: web pages that link to a collection of authorities  A good authority is pointed to by many good hubs  A good hub points to many good authorities Web structure mining has been largely influenced by research in  Social network analysis  Citation analysis (bibliometrics).  in-links: the hyperlinks pointing to a page  out-links: the hyperlinks found in a page.  Usually, the larger the number of in-links, the better a page is.

8/12/10 Web Structure Mining Contd.. Each Web page is a node of the Web-graph The out-degree of a node, is the number of distinct links originating at that point to other nodes. The probability, at any step, that the person will continue is a damping factor d =0.85 N- Number of web pages

8/12/10 Application Areas of Web Mining E-commerce Search Engines Personalization Website Design Web mining applications Amazon.com Google Double Click AOL Ebay MyYahoo CiteSeer I-MODE v-TAG Web Mining Server

8/12/10 Applications Contd.. Amazon: A host of Web mining techniques, e.g. associations between pages visited, click-path analysis, etc., are used to improve the customer’s experience during a ’store visit’. Knowledge gained from Web mining is the key intelligence behind Amazon’s features such as ’instant recommendations’, ’purchase circles’, ’wish-lists’, etc.

8/12/10 Applications Contd.. Google Earlier search engines concentrated on the Web content to return the relevant pages to a query. Google was the first to introduce the importance of the link structure in mining the information from the web. Page Rank, that measures an importance of a page, is the underlying technology in all Google search products. The Page Rank technology, that makes use of the structural information of the Web graph, is the key to returning quality results relevant to a query.

8/12/10 Benefits of Web Mining Match your available resources to visitor interests Increase the value of each visitor Improve the visitor's experience at the website Perform targeted resource management Collect information in new ways Test the relevance of content and web site architecture

8/12/10 Web Mining Softwares Web Miner: Sinope Summarizer: Teleport Pro: Click Tracks

8/12/10 Major Limitations of Web Mining research: Difficult to collect Web Usage data across different Web Sites. Lack of suitable test collections that can be reused by researchers Future research directions: Multimedia data mining: A picture is worth a thousand words. Multilingual knowledge extraction: Web page translations The Hidden Web: Forms, Dynamically generated web pages. Semantic Web Wireless Web: WML and HDML. Summary

8/12/10