Advanced Topics in Data Mining: Web Mining. Web Mining.

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

Data Mining Techniques Association Rule
Association rules and frequent itemsets mining
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Association Rules Spring Data Mining: What is it?  Two definitions:  The first one, classic and well-known, says that data mining is the nontrivial.
Back to Table of Contents
Nadia Andreani Dwiyono DESIGN AND MAKE OF DATA MINING MARKET BASKET ANALYSIS APLICATION AT DE JOGLO RESTAURANT.
Chase Repp.  knowledge discovery  searching, analyzing, and sifting through large data sets to find new patterns, trends, and relationships contained.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Techniques Cluster Analysis Induction Neural Networks OLAP Data Visualization.
Chapter 5: Mining Frequent Patterns, Association and Correlations
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
Chapter 12: Web Usage Mining - An introduction
Mining Time-Series Databases Mohamed G. Elfeky. Introduction A Time-Series Database is a database that contains data for each point in time. Examples:
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
FPtree/FPGrowth. FP-Tree/FP-Growth Algorithm Use a compressed representation of the database using an FP-tree Then use a recursive divide-and-conquer.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
Web Usage Mining - W hat, W hy, ho W Presented by:Roopa Datla Jinguang Liu.
Mining Association Rules
Mining Association Rules
Performance and Scalability: Apriori Implementation.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
WEB ANALYTICS Prof Sunil Wattal. Business questions How are people finding your website? What pages are the customers most interested in? Is your website.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Information Retrieval from Data Bases for Decisions Dr. Gábor SZŰCS, Ph.D. Assistant professor BUTE, Department Information and Knowledge Management.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
Association Rules. 2 Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping.
Data Mining Techniques Sequential Patterns. Sequential Pattern Mining Progress in bar-code technology has made it possible for retail organizations to.
Ch5 Mining Frequent Patterns, Associations, and Correlations
Lecture 9: Knowledge Discovery Systems Md. Mahbubul Alam, PhD Associate Professor Dept. of AEIS Sher-e-Bangla Agricultural University.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Discovering RFM Sequential Patterns From Customers’ Purchasing Data 中央大學資管系 陳彥良 教授 Date: 2015/10/14.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
Log files presented to : Sir Adnan presented by: SHAH RUKH.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一.
Srivastava J., Cooley R., Deshpande M, Tan P.N.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Data Mining Find information from data data ? information.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Association Analysis (3)
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Data mining in web applications
Reducing Number of Candidates
DATA MINING © Prentice Hall.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Association rule mining
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Market Basket Analysis and Association Rules
Association Rule Mining
Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
Frequent-Pattern Tree
SpeedTracer: A Web usage mining and analysis tool
Market Basket Analysis and Association Rules
Mining Path Traversal Patterns with User Interaction for Query Recommendation 龚赛赛
Presentation transcript:

Advanced Topics in Data Mining: Web Mining

Web Mining

Applications are ported to the Web at rapid pace On-line services, such as America Online (AOL), and CompuServe (merged to AOL), are anxious to know user access patterns; not just “search” in the Web How Amazon does it? Understanding Web user behavior is important –It can improve Web page organization –It can increase Web server performance –It can exploit Web advertising –It can increase business opportunity

Amazon Web Page Association Rules

More Information Desired Collect statistical information (page hits) only, which is insufficient since: –The hit frequency of a page depends not only on its content but also on its location –The number of users accessing a page is not available –Information on what pages accessed together is not available Data mining in the Web (Web Mining) –Web Access Pattern Collection –Web User Pattern Mining

Web Access Pattern Collection Server-Based Data Collection –Who are visiting a given Web site and what are they doing Agent-Based Data Collection –What are the Web sites a particular user has visited?

Server-Based Data Collection Examine the logs collected by HTTPd –Access Log (IP, Time, Access Data), Referred Log (A  B), Error Log, … –We can combining some of them for our use if necessary Problems –The use of proxy servers –The effect of caching

Server-Based Data Collection

Access Log IP/Domain NameTimeAccess Data

Referred Log 不考慮 Caching 的問題

Server-Based Data Collection Have to be done in accordance with technology advances –The use of Active Server Pages (Session ID available) The use of proxy servers The effect of caching –HTTPd 1.1 Limitation –Can only capture the user behavior when they are within this site

Agent-Based Data Collection Understanding individual Web behavior needs client-based data collection Results are useful –Better Personalized Service –Improved Web Page Organization –Better Pricing Policies Methods –Applets can only read/write files in their source servers a big security constraint –Using Active Components (ActiveX Control) and PlugIns APCS (Access Pattern Collection Server)

APCS

Agent-Based Data Collection Very difficult to do for non-registered users in the current Web environment –We have to be conducted with users’ consent Very dependent upon available Web technologies

Web User Pattern Mining Web user pattern mining is to discover user access patterns in Web servers Pattern discovery and analysis tools –Some existing Web tools provide mechanisms for reporting user activity in the servers –Web Trends ( –Open Market ( –Net.Genesis (

Path Traversal Patterns Mining Mining path traversal patterns in a distributed information providing environment (WWW) where documents or objects are linked together (via hyperlinks) to facilitate interactive access Solution procedure consists of three steps: –Convert the original sequence of log data into a set of maximal forward references (MF) Filter out the effect of some backward references –Mainly made for ease of traveling and concentrate on mining meaningful user access sequences –Some objects are visited because of their locations rather than their content –Determine the frequent traversal patterns, i.e., large reference sequences, from the maximal forward references obtained –Determine the maximal reference sequences from large reference sequences (Trivial)

Step1: MF References Suppose the traversal log contains the following traversal path for a user: –A, B, C, D, C, B, E, G, H, G, W, A, O, U, O, V The set of maximal forward references is {ABCD, ABEGH, ABEGW, AOU, AOV} When backward references occur, a forward reference path terminate.

Step1: Another Example

Step1: Arrange Database Encoding

Step1: Database Reduction Database Reduction

Step2: Find Frequent Reference Sequences Two algorithms for finding Frequent Traversal Patterns (Frequent Reference Sequences, Frequent Consecutive Subsequences) –Full-Scan (FS) Algorithm FS utilizes key ideas of the DHP algorithm –Selective-Scan (SS) Algorithm SS reduces the number of database scans

Full-Scan (FS) Algorithm Scan DB-1 Generate L1 & Hash Table

Scan DB-1 h(x,y) = [ ( order of x ) * 23 + ( order of y ) ] mod 17

Generate C2

Generate L2 & Reduce DB Scan DB-2

Generate L2 & Reduce DB Scan DB-2

Generate C3, L3 & Reduce DB Scan DB-3

Generate C4, L4 & Reduce DB Scan DB-4

Selective-Scan (SS) Algorithm Scan DB-3

Step 3: Generate Frequent Traversal Patterns Maximal Reference Sequences

WAP-Mine Algorithm The key consideration is how to facilitate the tedious support counting and candidate generating operations in the mining procedure Given Web Access Sequence database WAS and a support threshold , mine the complete set of  -patterns of WAS User IDWeb Access Sequence 100abdac 200eaebcac 300babfaec 400afbacfc WAS

WAP-Mine Algorithm (1)Scan WAS once,find all frequent-1 events (2)Scan WAS again,construct a WAP-tree (3)Recursively mine the WAP-tree using conditional search Access patterns

Find All Frequent-1 Events User IDWeb Access Sequence 100abdac 200eaebcac 300babfaec 400afbacfc ItemSupportFrequency a4 b4 c4 d1 e2 f2 Min_Sup=75% User IDWeb Access SequenceFrequent Subsequence 100abdacabac 200eaebcacabcac 300babfaecbabac 400afbacfcabacc

WAP-Tree Construction Using frequent events to register all count information for further mining User IDFrequent Subsequence 100abac 200abcac 300babac 400abacc

Mining Web Access Patterns from WAP-Tree SequenceCount aba2 ab1 abca1 ab baba1 abac1 aba Conditional Sequence Based on c SequenceCount aba1 abca1 baba1 abac1 ItemSupFrequency a4 b4 c2 Generate Web Access Patterns: ac, bc

Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on ac SequenceCount ab3 b1 bab1 b SequenceCount ab3 bab1 ItemSupFrequency a4 b4 Generate Web Access Patterns: aac, bac

Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on bac SequenceCount a3 ba1 ItemSu p Frequent a4 b1 Generate Web Access Patterns: abac

Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on abac No Web Access Patterns are Generated SequenceCount a4

Mining for Web Transactions To capture Web customer buying behavior –It is not just market basket transaction for the set of items bought by a customer in a single purchase (Association Rules) –It is not just Web user travel patterns (Path Traversal Patterns) –It is an extension from path traversal patterns Exploring the relationship between traveling and buying

Mining for Web Transactions Web Transaction Algorithm WR (Web-transaction-Record) Web Transaction Records Algorithm WTM, MTS PJ, MTS PC Frequent Transaction Patterns Web Transaction Association Rules

Mining for Web Transactions Web-transaction-Record (WR) Algorithm –Extract meaningful Web transaction records from the given Web transaction WTM (Web Transaction Mining) Algorithm –Mining Web Transaction Patterns MTS (Maximal Transaction Segment) Algorithms are the improvement versions of WTM

Mining for Web Transactions

WTM Algorithm It joins the purchased itemsets for generating candidate transaction patterns WTM employs a two-level hash tree, called Web transaction tree, to store candidate transaction patterns –WTM hashes not only each item but also each purchase in the path

WTM Algorithm S{i7}, J{i8}, Q{i10}ASJLQ G{i5}ABFG D{i3}ABD 400 S{i7}, J{i8}, L{i9}ASJL B{i1}, G{i5}ABFG B{i1}, E{i4}ABCE 300 S{i7}, Q{i10}ASJLQ B{i1}, C{i2}, E{i4}ABCE 200 S{i7}, L{i9}ASJL B{i1}, H{i6}ABFGH B{i1}, C{i2}, E{i4}ABCE 100 PurchasePathWT_ID Web Transaction DATABASE

Support Count WT_IDPathPurchase 100 ABCEB{i1}, C{i2}, E{i4} ABFGHB{i1}, H{i6} ASJLS{i7}, L{i9} 200 ABCEB{i1}, C{i2}, E{i4} ASJLQS{i7}, Q{i10} PathPurchaseSupport Count ABB{i1}2 ABCC{i2}2

WTM Algorithm 2Q{i10}ASJLQ 2L{i9}ASJL 2J{i8}ASJ 4S{i7}AS 1H{i6}ABFGH 2G{i5}ABFG 3E{i4}ABCE 1D{i3}ABD 2C{i2}ABC 3B{i1}AB Sup.PurchasePath C1 PathPurchaseSup. ABB{i1}3 ABCC{i2}2 ABCEE{i4}3 ABFGG{i5}2 ASS{i7}4 ASJJ{i8}2 ASJLL{i9}2 ASJLQQ{i10}2 T1 Support Count >= 2

WTM Algorithm 3B{i1} E{i4}ABCE 2B{i1} C{i2}ABC Sup.PurchasePath 0B{i1} J{i8}ASJ 0B{i1} S{i7}AS 0L{i9} Q{i10}ASJLQ 1J{i8} Q{i10}ASJLQ C2 2C{i2} E{i4}ABCE 2S{i7} Q{i10}ASJLQ 2S{i7} L{i9}ASJL 2S{i7} J{i8}ASJ 3B{i1} E{i4}ABCE 2B{i1} C{i2}ABC Sup.PurchasePath T2 Support Count >= 2

WTM Algorithm 2B{i1} C{i2} E{i4}ABCE Sup.PurchasePath T3 2B{i1} C{i2} E{i4}ABCE Sup.PurchasePath C3 Support Count >= 2

WTM Disadvantages WTM may generate a lot of unqualified candidate transaction patterns without utilizing the paths of frequent transaction patterns This will degrade the performance

MTS PJ Algorithm Algorithm MTS PJ uses maximal transaction segment that contains frequent transaction patterns and the maximal path, to solve the unqualified candidate transaction pattern problem MTS PJ generalizes candidate transaction patterns only when the leaf node of the Web transaction tree is reached

MTS PJ Algorithm S{i7}, J{i8}, Q{i10}ASJLQ G{i5}ABFG D{i3}ABD 400 S{i7}, J{i8}, L{i9}ASJL B{i1}, G{i5}ABFG B{i1}, E{i4}ABCE 300 S{i7}, Q{i10}ASJLQ B{i1}, C{i2}, E{i4}ABCE 200 S{i7}, L{i9}ASJL B{i1}, H{i6}ABFGH B{i1}, C{i2}, E{i4}ABCE 100 PurchasePathWT_ID Web Transaction DATABASE A F E C BS D H G Q L J

MTS PJ Algorithm 2Q{i10}ASJLQ 2L{i9}ASJL 2J{i8}ASJ 4S{i7}AS 1H{i6}ABFGH 2G{i5}ABFG 3E{i4}ABCE 1D{i3}ABCD 2C{i2}ABC 3B{i1}AB Sup.PurchasePath C1 PathPurchaseSup. ABB{i1}3 ABCC{i2}2 ABCEE{i4}3 ABFGG{i5}2 ASS{i7}4 ASJJ{i8}2 ASJLL{i9}2 ASJLQQ{i10}2 T1 Support Count >= 2 F G J L Q S E C A B

MTS PJ Algorithm B{i1} C{i2} E{i4} ABCE Maximal Transaction Segment B{i1} G{i5}ABFG Sup.PurchasePath C2 S{i7} J{i8} L{i9} Q{i10} ASJLQ Maximal Transaction Segment C2 Sup.PurchasePath L{i9} Q{i10}ASJLQ J{i8} Q{i10}ASJLQ S{i7} Q{i10}ASJLQ J{i8} L{i9}ASJL S{i7} L{i9}ASJL S{i7} J{i8}ASJ C2 B{i1} C{i2}ABC C{i2} E{i4}ABCE B{i1} E{i4}ABCE Sup.PurchasePathB{i1} G{i5} ABFG Maximal Transaction Segment F G J L Q S E C A B

MTS PJ Algorithm C2 PathPurchaseSup. ABCB{i1} C{i2}2 ABCEB{i1} E{i4}3 ABCEC{i2} E{i4}2 ABFGB{i1} G{i5}1 ASJS{i7} J{i8}2 ASJLS{i7} L{i9}2 ASJLJ{i8} L{i9}1 ASJLQS{i7} Q{i10}2 ASJLQJ{i8} Q{i10}1 ASJLQL{i9} Q{i10}0 PathPurchaseSup. ABCB{i1} C{i2}2 ABCEB{i1} E{i4}3 ABCEC{i2} E{i4}2 ASJS{i7} J{i8}2 ASJLS{i7} L{i9}2 ASJLQS{i7} Q{i10}2 T2

MTS PJ Algorithm J L Q S E C A B B{i1} C{i2} E{i4} ABCE Maximal Transaction Segment 2 B{i1} C{i2} E{i4} ABCE Sup.PurchasePath C3

MTS PC Algorithm 2Q{i10}ASJLQ 2L{i9}ASJL 2J{i8}ASJ 4S{i7}AS 1H{i6}ABFGH 2G{i5}ABFG 3E{i4}ABCE 1D{i3}ABCD 2C{i2}ABC 3B{i1}AB Sup.PurchasePath C1 PathPurchaseSup. ABB{i1}3 ABCC{i2}2 ABCEE{i4}3 ABFGG{i5}2 ASS{i7}4 ASJJ{i8}2 ASJLL{i9}2 ASJLQQ{i10}2 T1 Support Count >= 2 F G J L Q S E C A B MTSPC utilizes the LC (Large Count) to Filter Candidates

MTS PC Algorithm F G J L Q S E C A B 1E{i4} 1C{i2} 1B{i1} ABCE LCItemMaximal Path Maximal Transaction Segment K=1 |I| = 3 > 1 (K-1) C2 2B{i1} C{i2}ABC 2 C{i2} E{i4}ABCE 3 B{i1} E{i4}ABCE Sup.PurchasePath Maximal Transaction Segment Maximal PathItemLC ASJLQ S{i7}1 J{i8}1 L{i9}1 Q{i10}1 |I| = 4 > 1 C2 Sup.PurchasePath 0L{i9} Q{i10}ASJLQ 1J{i8} Q{i10}ASJLQ 2S{i7} Q{i10}ASJLQ 1J{i8} L{i9}ASJL 2S{i7} L{i9}ASJL 2S{i7} J{i8}ASJ Maximal Transaction Segment Maximal PathItemLC ABFG B{i1}1 G{i5}1 |I| = 2 > 1 1B{i1} G{i5}ABFG Sup.PurchasePath C2

MTS PC Algorithm C2 PathPurchaseSup. ABCB{i1} C{i2}2 ABCEB{i1} E{i4}3 ABCEC{i2} E{i4}2 ABFGB{i1} G{i5}1 ASJS{i7} J{i8}2 ASJLS{i7} L{i9}2 ASJLJ{i8} L{i9}1 ASJLQS{i7} Q{i10}2 ASJLQJ{i8} Q{i10}1 ASJLQL{i9} Q{i10}0 PathPurchaseSup. ABCB{i1} C{i2}2 ABCEB{i1} E{i4}3 ABCEC{i2} E{i4}2 ASJS{i7} J{i8}2 ASJLS{i7} L{i9}2 ASJLQS{i7} Q{i10}2 T2

MTS PC Algorithm Maximal Transaction Segment Maximal PathItemLC ASJLQ S{i7}3 J{i8}1 L{i9}1 Q{i10}1 |I| = 3 > 2 2E{i4} 2C{i2} 2B{i1} ABCE LCItemMaximal Path Maximal Transaction Segment K=2 |I| = 1 < 2 J L Q S E C A B B{i1} C{i2} E{i4} ABCE PurchasePath C3 No Generations PathPurchaseSup. ABCB{i1} C{i2}2 ABCEB{i1} E{i4}3 ABCEC{i2} E{i4}2 ASJS{i7} J{i8}2 ASJLS{i7} L{i9}2 ASJLQS{i7} Q{i10}2 T2

Mining for Web Transactions = 2 = 3 We can derive E{4}> –support_count( E{4}>) = 2 –confidence( E{4}>) =

Summary Data mining in the Web is an area of growing importance –In particular, the emerging of EC –More and more applications will benefit from the knowledge from data mining Web Mining = Web Data Collection + Traditional Data Mining? Important Issues –Incremental Web Mining