Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overview of Web Mining 2000. 3. Doheon Lee School of Computer and Information Chonnam National University

Similar presentations


Presentation on theme: "Overview of Web Mining 2000. 3. Doheon Lee School of Computer and Information Chonnam National University"— Presentation transcript:

1 Overview of Web Mining 2000. 3. Doheon Lee School of Computer and Information Chonnam National University mailto:dhlee@chonnam.ac.kr

2 Doheon Lee Table of Contents What is Web Mining? Resource Discovery Information Extraction Categorization Clustering Web Usage Mining Case Studies (IBM and Semio) Concluding Remarks

3 Doheon Lee What is Web Mining? Web mining can be defined as the automated discovery of useful information from the World Wide Web documents(and services). Web Resource Discovery Information Extraction Categorization Clustering Web Usage Mining Database Query Processing Classification Clustering Association Cf. Web Content Mining vs. Web Usage Mining

4 Doheon Lee Resource Discovery Search engine – Automatic creation of searchable indices of Web documents – Lycos, WebCrawler, Alta Vista, ALIWEB, etc Meta search engine – It posts keyword queries to multiple searchable indices in parallel; it then collates and prunes the responses returned, aiming to provide users with a manageable amount of high- quality information – MetaCrawler Automatic text categorization technology

5 Doheon Lee Resource Discovery (Cont ’ d) Personalized Web Agents – Web agents learn user preferences and discover Web information sources based on there preferences, and those of other individuals with similar interest – WebWatcher, PAINT, Skskill & Webert, GroupLens, Firefly, etc Web Query Systems – W3QL: It combines structure queries based on the organization of hypertext documents, and content queries based on information retrieval techniques – WebLog: Logic-based query language – Lorel, UnQL: Query languages based on a labeled graph data model – TSIMMIS: It generates an integrated database representation from Web information.

6 Doheon Lee Information Extraction From Web documents – Harvest: It knows how to find author and title information in Latex documents, and how to strip position information from Postscript files – FAQ-Finders: The user poses a question in natural language and the text of the question is used to search the FAQ files for a matching question From Web services – Internet Learning Agent(ILA): It extracts information such as phone numbers and e-mail addresses from the Internet server Whois and from the personnel directories of a dozen universities – ShopBot: It takes as input the address of a store ’ s home page as well as knowledge about a product domain, and learns how to shop at the store.

7 Doheon Lee ShopBot Domain-independent comparison-shopping agent It autonomously learns how to shop at different vendors. It does not use full-fledged NLP, rather uses heuristic search, pattern matching, and inductive learning. Phase 1: Learning phase – Starting from the root page of a store, it finds forms for searchable indices. – For each form, it applies test queries, and constructs vendor descriptions. – To analyze query result pages, it applies heuristic rules. Phase 2: Shopping phase – Based on the vendor descriptions, it extract product descriptions such as prices.

8 Doheon Lee Categorization Conventional text categorization – Support Vector Machines (SVM) – k-Nearest Neighbor Classifier – Neural Network Approaches – Linear Least Square Fit (LLSF) Mapping – Na ï ve Bayes Classifier Limitations on applying to web categorization – Diverse Vocabulary – Hyperlinks – (Intra) Structural Characteristics – Cf. 87% accuracy on the Reuters data set is reduced to 32% accuracy on a Yahoo! document set.

9 Doheon Lee Clustering Grouping Web documents based on their semantic relationships (e.g. HyPursuit at MIT) An algorithm starts with a set where each original document represents an independent cluster. It iteratively reduces the number of clusters by merging the two most relevant clusters. It uses pair-wise evaluation of component clusters to compute the relevance of two compound clusters. The relevance of the compound clusters is the minimal relevance between any of these pairs

10 Doheon Lee Clustering (Cont ’ d) Relevance between two documents – Content-Based The number of common terms –Term frequency –Document size factor –Document frequency (hard to compute) – Link Structure-Based The number of common ancestors The number of common descendants The number of direct paths between two documents Cf. Shortest path between two documents

11 Doheon Lee Web Usage Mining Analysis of Web access log, referral log, user profiles to obtain Web usage information. Preprocessing – Data cleaning, user identification, actual path identification, transaction identification, session identification – Local cashes and proxy servers make them difficult. Pattern discovery – Association rules, sequential patterns, classification rules, clustering analysis Analysis of discovered patterns – Visualization(WebWiz), OLAP, query language(WEBMINER)

12 Doheon Lee Patterns in Web Usage Association rules – 40% of clients who accessed the Web page with URL /company/product1, also accessed /company/product2. – 30% of clients who accessed /company/special, placed an online order in /company/product1. Sequential patterns – 30% of clients who visited /company/products, had done a search in Yahoo, within the past week on keyword w. – 60% of clients who placed an online order in /company/product1, also placed an online order in /company/product4 within 15 days. Classification rules – Clients from state or government agencies who visit the site tend to interested in the page /company/product1. – 50% of clients who placed an online order in /company/product2, were in the 20-25 age group and lived on the West Coast.

13 Doheon Lee A General Architecture for Web Usage Mining From R. Cooley, et al, “Web Mining: Information and Pattern Discovery on the World Wide Web”, ICTAI97

14 Doheon Lee IBM Intelligent Miner for Text Extract key information from text – Language identification based on a set of training documents in the languages – Feature extraction based on Information Quotient(IQ) Names of people, organizations, places –Linguistically motivated heuristics that exploit typography and other regularities of languages Multiword terms –Heuristics, which are based on a dictionary containing part-of-speech information for English words, involve doing simple pattern matching in order to find expressions having the noun phrase structures. Abbreviations Dates, currency amounts Organize documents by subject – Hierarchical clustering based on lexical affinities – Cf. Overlap of single words vs. semantic analysis Find the predominant themes in a collection of documents Search for relevant documents using flexible queries – Boolean queries with wild cards, free text queries, hybrid queries

15 Doheon Lee Semio ’ s Automatic Taxonomy Building Three groups of layers in Semio Taxonomy – Ontology: The highest level of the directory. These levels are primarily containers for other categories, not for specific documents. The topmost level is provided by the directory owner, while subsequent levels are provided from the Semio Topic Library – Taxonomy: Semio Builder automatically generates two levels of taxonomy structure using a patented techniques based on computational semiotics. – Thesaurus: It contains “ related to ” links between concepts in the collection. Semio Builder automatically generates “ related to ” links.

16 Doheon Lee Concluding Remarks Diverse types of Web Mining targets Data preparation for Web Mining Parallel and scalable Web Mining solutions Capturing common operators


Download ppt "Overview of Web Mining 2000. 3. Doheon Lee School of Computer and Information Chonnam National University"

Similar presentations


Ads by Google