Web Mining Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services Discovering.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Web Mining.
The Internet and the Web
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
CS 345A Data Mining Lecture 1
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
Media Planning: Advertising and the Internet
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Web Usage Mining - W hat, W hy, ho W Presented by:Roopa Datla Jinguang Liu.
Overview of Web Data Mining and Applications Part I
Internet Research Search Engines & Subject Directories.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
CSE Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11.
Chapter 1 Introduction to Data Mining
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Data Mining By Dave Maung.
Data Mining for Web Intelligence Presentation by Julia Erdman.
1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
Chapter Twelve Digital Interactive Media Arens|Schaefer|Weigold Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
User Modeling and Recommender Systems: Introduction to recommender systems Adolfo Ruiz Calleja 06/09/2014.
1 Data Mining at work Krithi Ramamritham. 2 Dynamics of Web Data Dynamically created Web Pages -- using scripting languages Ad Component Headline Component.
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services
Smart Web Search Agents Data Search Engines >> Information Search Agents - Traditional searching on the Web is done using one of the following three: -
Web-Content Mining -Akanksha Dombe. Specifies  The WWW is huge, widely distributed, global information service centre for  Information services: news,
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Data mining in web applications
22C:145 Artificial Intelligence
Data Mining – Intro.
DATA MINING Introductory and Advanced Topics Part III – Web Mining
CCT356: Online Advertising and Marketing
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Web Mining Ref:
Introduction to Web Mining
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Adrian Tuhtan CS157A Section1
Text & Web Mining 9/22/2018.
Mining Dynamics of Data Streams in Multi-Dimensional Space
Search Engines & Subject Directories
Restrict Range of Data Collection for Topic Trend Detection
WIRED Week 2 Syllabus Update Readings Overview.
Data Warehousing and Data Mining
I don’t need a title slide for a lecture
Trust and Culture on the Web
What is a Search Engine EIT, Author Gay Robertson, 2017.
Data Mining: Concepts and Techniques
Data Mining Chapter 6 Search Engines
Supporting End-User Access
Data Mining: Concepts and Techniques
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Web Mining Department of Computer Science and Engg.
Data Mining: Introduction
Search Engines & Subject Directories
Search Engines & Subject Directories
CS 345A Data Mining Lecture 1
Data Mining: Concepts and Techniques
CS 345A Data Mining Lecture 1
Introduction to Web Mining
CS 345A Data Mining Lecture 1
CSE591: Data Mining by H. Liu
PolyAnalyst™ text mining tool Allstate Insurance example
Presentation transcript:

Web Mining Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services Discovering useful information from the World-Wide Web and its usage patterns My Definition: Using data mining techniques to make the web more useful and more profitable (for some) and to increase the efficiency of our interaction with the web

Web Mining Data Mining Techniques Applications to the Web Association rules Sequential patterns Classification Clustering Outlier discovery Applications to the Web E-commerce Information retrieval (search) Network management

Examples of Discovered Patterns Association rules 98% of AOL users also have E-trade accounts Classification People with age less than 40 and salary > 40k trade on-line Clustering Users A and B access similar URLs Outlier Detection User A spends more than twice the average amount of time surfing on the Web

Web Mining The WWW is huge, widely distributed, global information service centre for Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc. Hyper-link information Access and usage information WWW provides rich sources of data for data mining

Why Mine the Web? Enormous wealth of information on Web Financial information (e.g. stock quotes) Book/CD/Video stores (e.g. Amazon) Restaurant information (e.g. Zagats) Car prices (e.g. Carpoint) Lots of data on user access patterns Web logs contain sequence of URLs accessed by users Possible to mine interesting nuggets of information People who ski also travel frequently to Europe Tech stocks have corrections in the summer and rally from November until February

Why is Web Mining Different? The Web is a huge collection of documents except for Hyper-link information Access and usage information The Web is very dynamic New pages are constantly being generated Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to Exploit hyper-links and access patterns Be incremental

Web Mining Applications E-commerce (Infrastructure) Generate user profiles Targetted advertizing Fraud Similar image retrieval Information retrieval (Search) on the Web Automated generation of topic hierarchies Web knowledge bases Extraction of schema for XML documents Network Management Performance management Fault management

User Profiling Important for improving customization Provide users with pages, advertisements of interest Example profiles: on-line trader, on-line shopper Generate user profiles based on their access patterns Cluster users based on frequently accessed URLs Use classifier to generate a profile for each cluster Engage technologies Tracks web traffic to create anonymous user profiles of Web surfers Has profiles for more than 35 million anonymous users

Internet Advertizing Ads are a major source of revenue for Web portals (e.g., Yahoo, Lycos) and E-commerce sites Plenty of startups doing internet advertizing Doubleclick, AdForce, Flycast, AdKnowledge Internet advertizing is probably the “hottest” web mining application today

Fraud With the growing popularity of E-commerce, systems to detect and prevent fraud on the Web become important Maintain a signature for each user based on buying patterns on the Web (e.g., amount spent, categories of items bought) If buying pattern changes significantly, then signal fraud HNC software uses domain knowledge and neural networks for credit card fraud detection

Retrieval of Similar Images Given: A set of images Find: All images similar to a given image All pairs of similar images Sample applications: Medical diagnosis Weather predication Web search engine for images E-commerce 12

Problems with Web Search Today Today’s search engines are plagued by problems: the abundance problem (99% of info of no interest to 99% of people) limited coverage of the Web (internet sources hidden behind search interfaces) Largest crawlers cover < 18% of all web pages limited query interface based on keyword-oriented search limited customization to individual users

Problems with Web Search Today Today’s search engines are plagued by problems: Web is highly dynamic Lot of pages added, removed, and updated every day Very high dimensionality

Improve Search By Adding Structure to the Web Use Web directories (or topic hierarchies) Provide a hierarchical classification of documents (e.g., Yahoo!) Searches performed in the context of a topic restricts the search to only a subset of web pages related to the topic Yahoo home page Recreation Business Science News Travel Sports Companies Finance Jobs

Automatic Creation of Web Directories In the Clever project, hyper-links between Web pages are taken into account when categorizing them Use a bayesian classifier Exploit knowledge of the classes of immediate neighbors of document to be classified Show that simply taking text from neighbors and using standard document classifiers to classify page does not work Inktomi’s Directory Engine uses “Concept Induction” to automatically categorize millions of documents

Service Provider Network Network Management Objective: To deliver content to users quickly and reliably Traffic management Fault management Service Provider Network Router Server

Why is Traffic Management Important? While annual bandwidth demand is increasing ten-fold on average, annual bandwidth supply is rising only by a factor of three Result is frequent congestion at servers and on network links during a major event (e.g., princess diana’s death), an overwhelming number of user requests can result in millions of redundant copies of data flowing back and forth across the world Olympic sites during the games NASA sites close to launch and landing of shuttles

Traffic Management Key Ideas Akamai, Inktomi Dynamically replicate/cache content at multiple sites within the network and closer to the user Multiple paths between any pair of sites Route user requests to server closest to the user or least loaded server Use path with least congested network links Akamai, Inktomi

Service Provider Network Traffic Management Congested link Congested server Request Router Server Service Provider Network

Traffic Management Need to mine network and Web traffic to determine What content to replicate? Which servers should store replicas? Which server to route a user request? What path to use to route packets? Network Design issues Where to place servers? Where to place routers? Which routers should be connected by links? One can use association rules, sequential pattern mining algorithms to cache/prefetch replicas at server

Fault Management Fault management involves Quickly identifying failed/congested servers and links in network Re-routing user requests and packets to avoid congested/down servers and links Need to analyze alarm and traffic data to carry out root cause analysis of faults Bayesian classifiers can be used to predict the root cause given a set of alarms

Web Mining Issues Size Diverse types of data Grows at about 1 million pages a day Google indexes 9 billion documents Number of web sites Netcraft survey says 72 million sites (http://news.netcraft.com/archives/web_server_survey.html) Diverse types of data Images Text Audio/video XML HTML

Total Sites Across All Domains August 1995 - October 2007 Number of Active Sites Total Sites Across All Domains August 1995 - October 2007

Systems Issues Web data sets can be very large 6/10/2018 Systems Issues Web data sets can be very large Tens to hundreds of terabytes Cannot mine on a single server! Need large farms of servers How to organize hardware/software to mine multi-terabye data sets Without breaking the bank! Dr. Navneet Goyal, BITS,Pilani