Overview of Web Mining 2000. 3. Doheon Lee School of Computer and Information Chonnam National University

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Web Mining.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Data Mining Techniques Cluster Analysis Induction Neural Networks OLAP Data Visualization.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen D ö rre, Peter Gerstl, and Roland Seiffert.
Web Mining Research: A Survey
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Towards Semantic Web Mining Bettina Berndt Andreas Hotho Gerd Stumme.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
Web Mining Research: A Survey
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Discovery of Aggregate Usage Profiles for Web Personalization
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
12 -1 Lecture 12 User Modeling Topics –Basics –Example User Model –Construction of User Models –Updating of User Models –Applications.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Web Data Mining and Applications Part I
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
Tag-based Social Interest Discovery
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Ontology-Based Information Extraction: Current Approaches.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
The Internet 8th Edition Tutorial 4 Searching the Web.
Web Mining By:- Vineeta 8pgc18 M.Tech (II Semester)
Data Mining By Dave Maung.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Srivastava J., Cooley R., Deshpande M, Tan P.N.
Data Mining for Web Intelligence Presentation by Julia Erdman.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Search Tools and Search Engines Searching for Information and common found internet file types.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Web Usage Mining A case study of the GoMercer.com website Martin Zhao Mar 16, 2007.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
WEB USAGE MINING Web Usage Mining 1. Contents Web Usage Mining 2  Web Mining  Web Mining Taxonomy  Web Usage Mining  Web analysis tools  Pattern.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Smart Web Search Agents Data Search Engines >> Information Search Agents - Traditional searching on the Web is done using one of the following three: -
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
WebMiningResearchASurvey Web Mining Research: A Survey Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Computer Science Department University.
Information Organization: Overview
Discovering User Access Patterns on the World-Wide Web
Information Retrieval
Data Warehousing and Data Mining
Introduction to Information Retrieval
Web Mining Research: A Survey
Information Organization: Overview
Introduction to Search Engines
Presentation transcript:

Overview of Web Mining Doheon Lee School of Computer and Information Chonnam National University

Doheon Lee Table of Contents What is Web Mining? Resource Discovery Information Extraction Categorization Clustering Web Usage Mining Case Studies (IBM and Semio) Concluding Remarks

Doheon Lee What is Web Mining? Web mining can be defined as the automated discovery of useful information from the World Wide Web documents(and services). Web Resource Discovery Information Extraction Categorization Clustering Web Usage Mining Database Query Processing Classification Clustering Association Cf. Web Content Mining vs. Web Usage Mining

Doheon Lee Resource Discovery Search engine – Automatic creation of searchable indices of Web documents – Lycos, WebCrawler, Alta Vista, ALIWEB, etc Meta search engine – It posts keyword queries to multiple searchable indices in parallel; it then collates and prunes the responses returned, aiming to provide users with a manageable amount of high- quality information – MetaCrawler Automatic text categorization technology

Doheon Lee Resource Discovery (Cont ’ d) Personalized Web Agents – Web agents learn user preferences and discover Web information sources based on there preferences, and those of other individuals with similar interest – WebWatcher, PAINT, Skskill & Webert, GroupLens, Firefly, etc Web Query Systems – W3QL: It combines structure queries based on the organization of hypertext documents, and content queries based on information retrieval techniques – WebLog: Logic-based query language – Lorel, UnQL: Query languages based on a labeled graph data model – TSIMMIS: It generates an integrated database representation from Web information.

Doheon Lee Information Extraction From Web documents – Harvest: It knows how to find author and title information in Latex documents, and how to strip position information from Postscript files – FAQ-Finders: The user poses a question in natural language and the text of the question is used to search the FAQ files for a matching question From Web services – Internet Learning Agent(ILA): It extracts information such as phone numbers and addresses from the Internet server Whois and from the personnel directories of a dozen universities – ShopBot: It takes as input the address of a store ’ s home page as well as knowledge about a product domain, and learns how to shop at the store.

Doheon Lee ShopBot Domain-independent comparison-shopping agent It autonomously learns how to shop at different vendors. It does not use full-fledged NLP, rather uses heuristic search, pattern matching, and inductive learning. Phase 1: Learning phase – Starting from the root page of a store, it finds forms for searchable indices. – For each form, it applies test queries, and constructs vendor descriptions. – To analyze query result pages, it applies heuristic rules. Phase 2: Shopping phase – Based on the vendor descriptions, it extract product descriptions such as prices.

Doheon Lee Categorization Conventional text categorization – Support Vector Machines (SVM) – k-Nearest Neighbor Classifier – Neural Network Approaches – Linear Least Square Fit (LLSF) Mapping – Na ï ve Bayes Classifier Limitations on applying to web categorization – Diverse Vocabulary – Hyperlinks – (Intra) Structural Characteristics – Cf. 87% accuracy on the Reuters data set is reduced to 32% accuracy on a Yahoo! document set.

Doheon Lee Clustering Grouping Web documents based on their semantic relationships (e.g. HyPursuit at MIT) An algorithm starts with a set where each original document represents an independent cluster. It iteratively reduces the number of clusters by merging the two most relevant clusters. It uses pair-wise evaluation of component clusters to compute the relevance of two compound clusters. The relevance of the compound clusters is the minimal relevance between any of these pairs

Doheon Lee Clustering (Cont ’ d) Relevance between two documents – Content-Based The number of common terms –Term frequency –Document size factor –Document frequency (hard to compute) – Link Structure-Based The number of common ancestors The number of common descendants The number of direct paths between two documents Cf. Shortest path between two documents

Doheon Lee Web Usage Mining Analysis of Web access log, referral log, user profiles to obtain Web usage information. Preprocessing – Data cleaning, user identification, actual path identification, transaction identification, session identification – Local cashes and proxy servers make them difficult. Pattern discovery – Association rules, sequential patterns, classification rules, clustering analysis Analysis of discovered patterns – Visualization(WebWiz), OLAP, query language(WEBMINER)

Doheon Lee Patterns in Web Usage Association rules – 40% of clients who accessed the Web page with URL /company/product1, also accessed /company/product2. – 30% of clients who accessed /company/special, placed an online order in /company/product1. Sequential patterns – 30% of clients who visited /company/products, had done a search in Yahoo, within the past week on keyword w. – 60% of clients who placed an online order in /company/product1, also placed an online order in /company/product4 within 15 days. Classification rules – Clients from state or government agencies who visit the site tend to interested in the page /company/product1. – 50% of clients who placed an online order in /company/product2, were in the age group and lived on the West Coast.

Doheon Lee A General Architecture for Web Usage Mining From R. Cooley, et al, “Web Mining: Information and Pattern Discovery on the World Wide Web”, ICTAI97

Doheon Lee IBM Intelligent Miner for Text Extract key information from text – Language identification based on a set of training documents in the languages – Feature extraction based on Information Quotient(IQ) Names of people, organizations, places –Linguistically motivated heuristics that exploit typography and other regularities of languages Multiword terms –Heuristics, which are based on a dictionary containing part-of-speech information for English words, involve doing simple pattern matching in order to find expressions having the noun phrase structures. Abbreviations Dates, currency amounts Organize documents by subject – Hierarchical clustering based on lexical affinities – Cf. Overlap of single words vs. semantic analysis Find the predominant themes in a collection of documents Search for relevant documents using flexible queries – Boolean queries with wild cards, free text queries, hybrid queries

Doheon Lee Semio ’ s Automatic Taxonomy Building Three groups of layers in Semio Taxonomy – Ontology: The highest level of the directory. These levels are primarily containers for other categories, not for specific documents. The topmost level is provided by the directory owner, while subsequent levels are provided from the Semio Topic Library – Taxonomy: Semio Builder automatically generates two levels of taxonomy structure using a patented techniques based on computational semiotics. – Thesaurus: It contains “ related to ” links between concepts in the collection. Semio Builder automatically generates “ related to ” links.

Doheon Lee Concluding Remarks Diverse types of Web Mining targets Data preparation for Web Mining Parallel and scalable Web Mining solutions Capturing common operators