Web-Content Mining -Akanksha Dombe. Specifies  The WWW is huge, widely distributed, global information service centre for  Information services: news,

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

The Internet and the Web
Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)
Chapter 5: Introduction to Information Retrieval
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Information Retrieval in Practice
Chapter 12: Web Usage Mining - An introduction
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Web Mining Research: A Survey
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
Web Mining Research: A Survey
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Data Mining By Archana Ketkar.
Web Usage Mining - W hat, W hy, ho W Presented by:Roopa Datla Jinguang Liu.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Data Mining – Intro.
Overview of Web Data Mining and Applications Part I
Overview of Search Engines
Section 13.1 Add a hit counter to a Web page Identify the limitations of hit counters Describe the information gathered by tracking systems Create a guest.
Web 2.0: Concepts and Applications 2 Publishing Online.
Data Mining Techniques
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Chapter 7 DATA, TEXT, AND WEB MINING Pages , 311, Sections 7.3, 7.5, 7.6.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Chapter 1 Introduction to Data Mining
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.
Data Mining By Dave Maung.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
9-1 Chapter 9 The Internet.
Presenter: Shanshan Lu 03/04/2010
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
OWL Representing Information Using the Web Ontology Language.
Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services
Chapter 8: Web Analytics, Web Mining, and Social Analytics
WebMiningResearchASurvey Web Mining Research: A Survey Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Computer Science Department University.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Data mining in web applications
Search Engine Architecture
Web Mining Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services Discovering.
Web Mining Ref:
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Text & Web Mining 9/22/2018.
Restrict Range of Data Collection for Topic Trend Detection
Information Retrieval
Data Warehousing and Data Mining
Web Mining Department of Computer Science and Engg.
Web Mining Research: A Survey
Presentation transcript:

Web-Content Mining -Akanksha Dombe

Specifies  The WWW is huge, widely distributed, global information service centre for  Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc.  Hyper-link information  Access and usage information  WWW provides rich sources of data for data mining

The Web: Opportunities & Challenges 1.The amount of information on the Web is huge 2.The coverage of Web information is very wide and diverse 3.Information/data of almost all types exist on the Web 4.Much of the Web information is semi-structured 5.Much of the Web information is linked 6.Much of the Web information is redundant

The Web: Opportunities & Challenges 7.The Web is noisy 8.The Web is also about services 9.The Web is dynamic 10.Above all, the Web is a virtual society 11.The Web consists of surface Web and deep Web.  Surface Web: pages that can be browsed using a browser.  Deep Web: databases that can only be accessed through parameterized query interfaces

What is Web Data ? What is Web Data ?  Web data is 1.Web content –text,image,records,etc. 2.Web structure –hyperlinks,tags,etc. 3.Web usage –http logs,app server logs,etc. 4.Intra-page structures 5.Inter-page structures 6.Supplemental data 1.Profiles 2.Registration information 3.Cookies

Web Mining  Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services  Web mining is the application of data mining techniques to find interesting and potentially useful knowledge from web data  Web mining is the application of data mining techniques to extract knowledge from web data, including web documents, hyperlinks between documents, usage logs of web sites, etc.

Web Mining Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services Discovering useful information from the World-Wide Web and its usage patterns My Definition: Using data mining techniques to make the web more useful and more profitable (for some) and to increase the efficiency of our interaction with the web

Why Mine the Web?  Enormous wealth of information on Web  Financial information (e.g. stock quotes)  Book/CD/Video stores (e.g. Amazon)  Restaurant information  Car prices  Lots of data on user access patterns  Web logs contain sequence of URLs accessed by users  Possible to mine interesting nuggets of information  People who ski also travel frequently to Europe  Tech stocks have corrections in the summer and rally from November until February

 The Web is a huge collection of documents except for  Hyper-link information  Access and usage information  The Web is very dynamic  New pages are constantly being generated  Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to  Exploit hyper-links and access patterns  Be incremental Why is Web Mining Different?

Web Mining: Subtasks Web Mining: Subtasks  Resource finding  Retrieving intended documents  Information selection/pre-processing  Select and pre-process specific information from selected documents  Generalization  Discover general patterns within and across web sites  Analysis  Validation and/or interpretation of mined patterns

Web Mining Issues  Size  Grows at about 1 million pages a day  Google indexes 9 billion documents  Number of web sites  Netcraft survey says 72 million sites  (  Diverse types of data  Images  Text  Audio/video  XML  HTML

 E-commerce (Infrastructure)  Generate user profiles  Targetted advertizing  Fraud  Similar image retrieval  Information retrieval (Search) on the Web  Automated generation of topic hierarchies  Web knowledge bases  Extraction of schema for XML documents  Network Management  Performance management  Fault management Web Mining Applications

Web Mining Taxonomy

Web Data Mining  Use of data mining techniques to automatically discover interesting and potentially useful information from Web documents and services.  Web mining may be divided into three categories: 1. Web content mining 2. Web structure mining 3. Web usage mining

What is “Web Content mining?”

Web Content Mining  Discovery of useful information from web contents / data / documents  Web data contents: 1. text, 2. image, 3.audio, 4.video, 5.metadata and 6.hyperlinks

Web Content Mining  Examine the contents of web pages as well as result of web searching  Can be thought of as extending the work performed by basic search engines  Search engines have crawlers to search the web and gather information, indexing techniques to store the information, and query processing support to provide information to the users  Web Content Mining is: the process of extracting knowledge from web contents

Web Content Mining  It provides no information about structure of content that we are searching for and no information about various categories of documents that are found.  Need more sophisticated tools for searching or discovering Web content.

Web Content mining  Discovering useful information from contents of Web pages.  Web content is very rich consisting of textual, image, audio, video etc and metadata as well as hyperlinks.  The data may be unstructured (free text) or structured (data from a database) or semi-structured (html) although much of the Web is unstructured.

Web Content Data Structure  Unstructured – free text  Semi-structured – HTML  More structured – Table or Database generated HTML pages  Multimedia data – receive less attention than text or hypertext

Web Content mining  Web content mining is related to data mining and text mining  It is related to data mining because many data mining techniques can be applied in Web content mining.  It is related to text mining because much of the web contents are texts.  Web data are mainly semi-structured and/or unstructured, while data mining is structured and text is unstructured.

Web Content Data Structure  Web content consists of several types of data  Text, image, audio, video, hyperlinks.  Unstructured – free text  Semi-structured – HTML  More structured – Data in the tables or database generated HTML pages  Note: much of the Web content data is unstructured text data.

Semi-structured Data  Content is, in general, semi-structured  Example:  Title  Author  Publication_Date  Length  Category  Abstract  Content

Web Content Mining: IR View  Unstructured Documents  Bag of words, or phrase-based feature representation  Features can be boolean or frequency based  Features can be reduced using different feature selection techniques  Word stemming, combining morphological variations into one feature

Web Content Mining: IR View  Semi-Structured Documents  Uses richer representations for features, based on information from the document structure (typically HTML and hyperlinks)  Uses common data mining methods (whereas unstructured might use more text mining methods)

Web Content Mining: DB View  Tries to infer the structure of a Web site or transform a Web site to become a database  Better information management  Better querying on the Web  Can be achieved by:  Finding the schema of Web documents  Building a Web warehouse  Building a Web knowledge base  Building a virtual database

Web Content Mining: DB View  Mainly uses the Object Exchange Model (OEM)  Represents semi-structured data (some structure, no rigid schema) by a labeled graph  Process typically starts with manual selection of Web sites for content mining  Main application: building a structural summary of semi-structured data (schema extraction or discovery)

Tech for Web Content Mining  Classifications  Clustering  Association

Web Content Mining : Topics  Structured data extraction  Unstructured text extraction  Sentiment classification, analysis and summarization of consumer reviews  Information integration and schema matching  Knowledge synthesis  Template detection and page segmentation

Structured Data Extraction  Most widely studied research topic  A large amount of information on the Web is contained in regularly structured data objects (retrieved from databases)Such Web data records are important they often present the essential information of their host pages, e.g., lists of products and services

Structured Data Extraction  Applications: integrated and value-added services, e.g., Comparative shopping, meta-search & query, etc

Structured Data Extraction :Approaches 1.Wrapper Generation 2.Wrapper Induction or Wrapper Learning 3.Automatic Approach

Structured Data Extraction :Approaches  Wrapper Generation Write an extraction program for each website based on observed format patterns  Labor intensive & time consuming

35

36

CS511, Bing Liu, UIC 37

 Automatic Approach  Structured data objects on the web are normally database records  Retrieved from databases & displayed in web pages with fixed templates  Find patterns / grammars from the web pages & then use them to extract data  e. g. IEPAD, MDR, ROADRUNNER, EXALG etc 38

 Wrapper Induction or Wrapper Learning  Main technique currently  The user first manually labels a set of trained pages  A learning system then generates rules from the training pages  The resulting rules are then applied to extract target items from web pages  e.g. WIEN, Stalker, BWI, WL etc 39

 Supervised Learning  Supervised learning is a ‘machine learning’ technique for creating a function from training data.  Documents are categorized  The output can predict a class label of the input object (called classification).  Techniques used are  Nearest Neighbor Classifier  Feature Selection  Decision Tree

 Removes terms in the training documents which are statistically uncorrelated with the class labels  Simple heuristics  Stop words like “a”, “an”, “the” etc.  Empirically chosen thresholds for ignoring “too frequent” or “too rare” terms  Discard “too frequent” and “too rare terms”

Examples of Discovered Patterns  Association rules  98% of AOL users also have E-trade accounts  Classification  People with age less than 40 and salary > 40k trade on- line  Clustering  Users A and B access similar URLs  Outlier Detection  User A spends more than twice the average amount of time surfing on the Web

 Important for improving customization  Provide users with pages, advertisements of interest  Example profiles: on-line trader, on-line shopper  Generate user profiles based on their access patterns  Cluster users based on frequently accessed URLs  Use classifier to generate a profile for each cluster  Engage technologies  Tracks web traffic to create anonymous user profiles of Web surfers  Has profiles for more than 35 million anonymous users

 Ads are a major source of revenue for Web portals (e.g., Yahoo, Lycos) and E-commerce sites  Plenty of startups doing internet advertizing  Doubleclick, AdForce, Flycast, AdKnowledge  Internet advertizing is probably the “hottest” web mining application today

 Scheme 1:  Manually associate a set of ads with each user profile  For each user, display an ad from the set based on profile  Scheme 2:  Automate association between ads and users  Use ad click information to cluster users (each user is associated with a set of ads that he/she clicked on)  For each cluster, find ads that occur most frequently in the cluster and these become the ads for the set of users in the cluster

 Use collaborative filtering (e.g. Likeminds, Firefly)  Each user Ui has a rating for a subset of ads (based on click information, time spent, items bought etc.)  Rij - rating of user Ui for ad Aj  Problem: Compute user Ui’s rating for an unrated ad Aj A1A2A3 ? Internet Advertizing

 Key Idea: User Ui’s rating for ad Aj is set to Rkj, where Uk is the user whose rating of ads is most similar to Ui’s  User Ui’s rating for an ad Aj that has not been previously displayed to Ui is computed as follows:  Consider a user Uk who has rated ad Aj  Compute Dik, the distance between Ui and Uk’s ratings on common ads  Ui’s rating for ad Aj = Rkj (Uk is user with smallest Dik)  Display to Ui ad Aj with highest computed rating Internet Advertizing

 With the growing popularity of E-commerce, systems to detect and prevent fraud on the Web become important  Maintain a signature for each user based on buying patterns on the Web (e.g., amount spent, categories of items bought)  If buying pattern changes significantly, then signal fraud  HNC software uses domain knowledge and neural networks for credit card fraud detection

 Given:  A set of images  Find:  All images similar to a given image  All pairs of similar images  Sample applications:  Medical diagnosis  Weather predication  Web search engine for images  E-commerce

 QBIC, Virage, Photobook  Compute feature signature for each image  QBIC uses color histograms  WBIIS, WALRUS use wavelets  Use spatial index to retrieve database image whose signature is closest to the query’s signature  WALRUS decomposes an image into regions  A single signature is stored for each region  Two images are considered to be similar if they have enough similar region pairs

Query image

 Today’s search engines are plagued by problems:  the abundance problem (99% of info of no interest to 99% of people)  limited coverage of the Web (internet sources hidden behind search interfaces)  Largest crawlers cover < 18% of all web pages  limited query interface based on keyword- oriented search  limited customization to individual users

 Today’s search engines are plagued by problems:  Web is highly dynamic  Lot of pages added, removed, and updated every day  Very high dimensionality

 Use Web directories (or topic hierarchies)  Provide a hierarchical classification of documents (e.g., Yahoo!)  Searches performed in the context of a topic restricts the search to only a subset of web pages related to the topic RecreationScienceBusinessNews Yahoo home page SportsTravelCompaniesFinanceJobs

 In the Clever project, hyper-links between Web pages are taken into account when categorizing them  Use a bayesian classifier  Exploit knowledge of the classes of immediate neighbors of document to be classified  Show that simply taking text from neighbors and using standard document classifiers to classify page does not work  Inktomi’s Directory Engine uses “Concept Induction” to automatically categorize millions of documents

 Objective: To deliver content to users quickly and reliably Traffic management Fault management Service Provider Network Router Server

 While annual bandwidth demand is increasing ten- fold on average, annual bandwidth supply is rising only by a factor of three  Result is frequent congestion at servers and on network links  during a major event (e.g., princess diana’s death), an overwhelming number of user requests can result in millions of redundant copies of data flowing back and forth across the world  Olympic sites during the games  NASA sites close to launch and landing of shuttles

 Key Ideas  Dynamically replicate/cache content at multiple sites within the network and closer to the user  Multiple paths between any pair of sites  Route user requests to server closest to the user or least loaded server  Use path with least congested network links  Akamai, Inktomi

Service Provider Network Router Server Request Congested server Congested link

 Need to mine network and Web traffic to determine  What content to replicate?  Which servers should store replicas?  Which server to route a user request?  What path to use to route packets?  Network Design issues  Where to place servers?  Where to place routers?  Which routers should be connected by links?  One can use association rules, sequential pattern mining algorithms to cache/prefetch replicas at server

 Fault management involves  Quickly identifying failed/congested servers and links in network  Re-routing user requests and packets to avoid congested/down servers and links  Need to analyze alarm and traffic data to carry out root cause analysis of faults  Bayesian classifiers can be used to predict the root cause given a set of alarms

Total Sites Across All Domains August October 2007

 Web data sets can be very large  Tens to hundreds of terabytes  Cannot mine on a single server!  Need large farms of servers  How to organize hardware/software to mine multi-terabye data sets  Without breaking the bank!

 Structured Data  Unstructured Data  OLE DB offers some solutions!

 Pages contain information  Links are ‘roads’  How do people navigate the Internet   Web Usage Mining (clickstream analysis)  Information on navigation paths available in log files  Logs can be mined from a client or a server perspective

 Why analyze Website usage?  Knowledge about how visitors use Website could  Provide guidelines to web site reorganization; Help prevent disorientation  Help designers place important information where the visitors look for it  Pre-fetching and caching web pages  Provide adaptive Website (Personalization)  Questions which could be answered  What are the differences in usage and access patterns among users?  What user behaviors change over time?  How usage patterns change with quality of service (slow/fast)?  What is the distribution of network traffic over time?

 Analog – Web Log File Analyser  Gives basic statistics such as  number of hits  average hits per time period  what are the popular pages in your site  who is visiting your site  what keywords are users searching for to get to you  what is being downloaded 

 Content is, in general, semi-structured  Example:  Title  Author  Publication_Date  Length  Category  Abstract  Content

 Many methods designed to analyze structured data  If we can represent documents by a set of attributes we will be able to use existing data mining methods  How to represent a document?  Vector based representation (referred to as “bag of words” as it is invariant to permutations)  Use statistics to add a numerical dimension to unstructured text

 A document representation aims to capture what the document is about  One possible approach:  Each entry describes a document  Attribute describe whether or not a term appears in the document

 Another approach:  Each entry describes a document  Attributes represent the frequency in which a term appears in the document

 Stop Word removal: Many words are not informative and thus  Irrelevant for document representation the, and, a, an, is, of, that, …  Stemming: reducing words to their root form (Reduce dimensionality)  A document may contain several occurrences of words like fish, fishes, fisher, and fishers. But would not be retrieved by a query with the keyword “fishing”  Different words share the same word stem and should be represented with its stem, instead of the actual word “Fish”