Smart Miner: A New Framework for Mining Large Scale Web Usage Data

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Web Mining.
Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)
Google News Personalization: Scalable Online Collaborative Filtering
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Data e Web Mining Paolo Gobbo
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Chapter 12: Web Usage Mining - An introduction
Xyleme A Dynamic Warehouse for XML Data of the Web.
Algorithms and Problem Solving-1 Algorithms and Problem Solving.
Algorithms and Problem Solving. Learn about problem solving skills Explore the algorithmic approach for problem solving Learn about algorithm development.
Chapter 2: Algorithm Discovery and Design
Aho-Corasick String Matching An Efficient String Matching.
Discovery of Aggregate Usage Profiles for Web Personalization
Web Usage Mining - W hat, W hy, ho W Presented by:Roopa Datla Jinguang Liu.
Chapter 2: Algorithm Discovery and Design
Chapter 2: Algorithm Discovery and Design
Query Log Analysis Naama Kraus Slides are based on the papers: Andrei Broder, A taxonomy of web search Ricardo Baeza-Yates, Graphs from Search Engine Queries.
Prof. Vishnuprasad Nagadevara Indian Institute of Management Bangalore
By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
A Web Crawler Design for Data Mining
Chapter 2: Algorithm Discovery and Design Invitation to Computer Science, C++ Version, Third Edition.
南台科技大學 資訊工程系 A web page usage prediction scheme using sequence indexing and clustering techniques Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2010/10/15.
An Iterative Heuristic for State Justification in Sequential Automatic Test Pattern Generation Aiman H. El-MalehSadiq M. SaitSyed Z. Shazli Department.
Log files presented to : Sir Adnan presented by: SHAH RUKH.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Srivastava J., Cooley R., Deshpande M, Tan P.N.
1 Murat Ali Bayır Middle East Technical University Department of Computer Engineering Ankara, Turkey A New Reactive Method for Processing Web Usage Data.
Java server pages. A JSP file basically contains HTML, but with embedded JSP tags with snippets of Java code inside them. A JSP file basically contains.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.
1 Channel Coding (III) Channel Decoding. ECED of 15 Topics today u Viterbi decoding –trellis diagram –surviving path –ending the decoding u Soft.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Self-Organized Web Usage Regularities. Problems of foraging information on WWW Slow accession Difficulty in finding useful information is related to balkanization.
Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
An Energy-Efficient Approach for Real-Time Tracking of Moving Objects in Multi-Level Sensor Networks Vincent S. Tseng, Eric H. C. Lu, & Kawuu W. Lin Institute.
Chapter 2: Algorithm Discovery and Design Invitation to Computer Science.
Improvement of Apriori Algorithm in Log mining Junghee Jaeho Information and Communications University,
Introduction to Machine Learning, its potential usage in network area,
Data mining in web applications
Hashing (part 2) CSE 2011 Winter March 2018.
OPERATING SYSTEMS CS 3502 Fall 2017
Algorithms and Problem Solving
PIWIK JUNIOR TIDAL ASSOCIATE PROF., WEB SERVICES & MULTIMEDIA LIBRARIAN NEW YORK CITY COLLEGE OF TECHNOLOGY, CUNY.
Chapter 7. Classification and Prediction
Effective Prediction of Web-user Accesses: A Data Mining Approach
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
A Study of Group-Tree Matching in Large Scale Group Communications
A paper on Join Synopses for Approximate Query Answering
COSC160: Data Structures Linked Lists
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
CIS 488/588 Bruce R. Maxim UM-Dearborn
Pei Fan*, Ji Wang, Zibin Zheng, Michael R. Lyu
Aiman H. El-Maleh Sadiq M. Sait Syed Z. Shazli
Algorithms and Problem Solving
Effective Prediction of Web-user Accesses: A Data Mining Approach
AGMLAB Information Technologies
Korea University of Technology and Education
Data Structures & Algorithms
Discovery of Significant Usage Patterns from Clickstream Data
M. Kezunovic (P.I.) S. S. Luo D. Ristanovic Texas A&M University
Computational Advertising and
Kyriakos Kritikos and Dimitris Plexousakis ICS-FORTH
Presentation transcript:

Smart Miner: A New Framework for Mining Large Scale Web Usage Data Murat Ali Bayır Ismail Hakkı Toroslu Ahmet Coşar Güven Fidan Sponsored BY

Outline Motivation and Contributions Smart Miner Framework Session Construction Problem Pattern Discovery Distributed Implementation of Smart Miner Experimental Results Decision Support System Application

Motivation and Contributions Why we design Smart Miner? Mining Web Usage Logs is important process for improving commercial web sites structurally and semantically. Due to security reasons and changes in the internal structure of the web sites, site owners may want to keep their own data and process their server logs offline. (installation code based systems are not attractive) For the web sites with large visitor population, the problem should be handled by distributed system due to size of incoming click through data everyday. Site owners may want to execute intelligent filters over usage data to detect specific type of user behavior or analyzing usage of specific structure in the web site. To solve the problems addressed above we have proposed Smart Miner framework.

Motivation and Contributions Implemented Scalable Framework for WUM Modeled Session Construction problem as Graph Problem and proposed Smart-SRA. Proposed Efficient Solution for Pattern Discovery Proposed new Web User Model for Simulation purposes Map/Reduce version of Session Construction and Pattern Discovery Show scalability framework over very large real data Decision Support System Application

Outline Motivation and Contributions Smart Miner Framework Session Construction Problem Pattern Discovery Distributed Implementation of Smart Miner Experimental Results Decision Support System Application

Smart Miner Framework Session Construction Pattern Discovery Pattern Analysis and Post Processing Server Logs Web User Sessions Rules and Patterns Interesting Knowledge Site Topology Optional Semantic Data Optional

Session Construction Most important phase of WUM since the later phases depends on the sessions generated by this phase. General Session (Definition): a sequence of web page requests by single user on particular web domain for a specific time period. Two Traditional Approaches for Session Construction from Server Logs (called Reactive Approaches) Time Oriented Approach Navigation Oriented Approach

Session Construction Time Oriented Approaches do not consider link information. Type-1: Duration of a discovered session is limited with a threshold 1 Type-2: Time spend on any page limited with threshold 2 Navigation Oriented Approaches considers link information. Adding page PN+1 to a session [P1, P2, …, PN] is performed as follows: If PN has a links towards or referrer of PN+1 (1) [P1, P2, …, PN, PN+1] Else Pk is the nearest page satisfying (1) for PN+1 add backward browser movement until Pk [P1, P2, …, PN, PN-1, PN-2, …, Pk, PN+1]

Drawbacks of Time and Navigation Oriented Heuristics Session Construction Drawbacks of Time and Navigation Oriented Heuristics Time Oriented Approach has no link information Backward Movements are major problem of Navigation Oriented Heuristics. Recent studies HCI area addressed that majority of backward movements occurs due to location of pages rather than content. Backward Movements cause performance problem in later application phases like recommender systems and page pre-fetching. We propose new session model called Link Based Session Model to solve problems mentioned here.

Link Based Session Model Session Construction Link Based Session Model A Link based Session contains a set of navigation sequences which corresponds to paths on the web graph. If S= {S1, S2,..…..Sn} is a session with n navigation sequences and Sx = <Px1, … Pxi, Pxi+1, … Pxn> must satisfy the following rules:   Timestamp Ordering Rule: i: 1 i<n, Txi < Txi+1 ∀i: 1 i<n, (Txi+1 - Txi ) r (page stay time) (Txn – Tx1 )  δ (session duration time) Topology Rule: = true i:1 i<n, Link[Pxi, Pxi+1] = true OR Ref[Pxi, Pxi+1] = true Maximality Rule: Sx S there exist no Sy such that Sy S and Sx  Sy = (subsequence relation similar to substring relation)

Smart-SRA P13 P1 P20 P23 P49 P34 Smart-SRA (Smart Session Reconstruction Algorithm) is the new algorithm that produce link based sessions from web usage logs Smart-SRA has 2 phases. In Phase-1 Candidate Sessions are constructed from web usage logs by applying page stay and session duration time criteria. In Phase-2 Candidate Sessions are partitioned into maximal navigation sequences satisfying properties of link based session model. (Main Phase). All navigation sequences generated by Smart-SRA may overlap but they are maximal. For the candidate session [P1, P13, P49, P34, P23] two navigation sequences are generated: [P1, P13, P49, P34] [P1, P13, P49, P23]

Complexity in Extreme Cases Smart-SRA Complexity in Extreme Cases The complexity of Smart-SRA is exponential on specific cases which contains crawler like sessions including Breath First Order of web pages.   Level-1 Pages L1 Level-2 L2 Level-k Lk For the candidate session [{Level-1}, {Level-2}, …. {Level-k}] the number of navigation sequences are L1 x L2 x … Lk. , and the complexity is also exponential in terms of average length candidate sessions (len) and out degree of pages in candidate sessions (out). ~ O(len)out For the candidate sessions including depth first order of web pages the complexity is linear. O(len)  

Complexity in Average Case Smart-SRA Complexity in Average Case The average complexity of Smart-SRA is polynomial O(len 2 out2) (Proof): Phase-2 of Smart-SRA removes all pages in candidate session at each step that does not have referrer with smaller time stamp. We define an expectation function which gives the number of web pages subtracted from candidate session at each iteration: Here p(x) is the probability of removing page P from the candidate session (depends on probability of no links before any page) and f(x) is contribution of each page to sum (clearly 1). By using uniform probability model p(x) is (1-(out)/len)k-1 for k-th page in session. Complexity becomes polynomial under this assumption. In fact out tends to be very small in real user sessions and it is efficient to run Smart-SRA as we shown in experiments sections.  

Outline Motivation and Contributions Smart Miner Framework Session Construction Problem Pattern Discovery Distributed Implementation of Smart Miner Experimental Results Decision Support System Application

Pattern Discovery We have implemented Sequential Apriori with topological constraint, the important feature of our approach is listed below: Each pattern satisfy topology constraint. Pk of Pattern P, (k<|P|), Link[Pk][Pk+1]=TRUE. String matching must be satisfied in support checks. [1, 2, 3] does not support pattern <1, 3> although 3 comes after 1 in both of them. However, it supports <1, 2>. While calculating the supports, pattern specific statistical attributes of web pages is calculated such as average visit time of Pk in pattern P.

Pattern Discovery Sequential Apriori Inputs: Sessions constructed by Smart-SRA Minimum support for frequent patterns At each iteration we join length-k patterns with length-1 patterns (by using topological constraint) If the support of the length-k+1 pattern is greater than the required support, it becomes a supported pattern. In addition, the new length-k+1 pattern becomes maximal. The extended length-k pattern and the appended length-1 pattern become non-maximal.

Outline Motivation and Contributions Smart Miner Framework Session Construction Problem Pattern Discovery Distributed Implementation of Smart Miner Experimental Results Decision Support System Application

Distributed Implementation of Smart Miner We have implemented Map/Reduce version of Session Construction and Pattern Discovery phases over the cluster of 100 computers. In Map/Reduce, large computations are represented by sequence of map and reduce operators. In order to employ map/reduce, the data should be stored in the format of key/value pairs.

Distributed Implementation of Smart Miner Session Construction Mapper: read usage data containing Url and IP fields, create key: <domain(Url),IP> Reducer: get all urls belonging to same domain and same IP on same machine, execute Smart-SRA 2 Phased Pattern Discovery First Phase: Mapper: Navigation Sequences are distributed to different machines by using random hash function with respect to number of idle machines N Reducer: Frequency of each candidate pattern on each node is calculated. Second Phase: Mapper: <candidate pattern, frequency> pair is emitted (N times) Reducer: N pairs of same candidate pattern reduced to same machine, frequencies are merged to calculate support.

Outline Motivation and Contributions Smart Miner Framework Session Construction Problem Pattern Discovery Distributed Implementation of Smart Miner Experimental Results Decision Support System Application

Experimental Results Accuracy Experiments shows the correlation between patterns generated from sessions of heuristics and patterns generated from the real Link Based sessions that is provided by evaluation framework. These experiments also analyzes effect navigation behaviors on sessions. How the backward movements and discarding link information affect the performance of NO and TO approaches in terms of link based session evaluation Simulated Data Real Data HPC Experiments shows the scalability of the framework on huge size data.

Experimental Results Agent Simulator Agent Simulator is implemented to model web user navigation, and analyze basic navigation behavior on the effects of sessions. Our agent simulator generates sessions wrt link based model Following user behaviors are implemented: Session Termination (terminating whole session) Link From Previous Page to next Page Link From Current Page New Initial Page (start of new navigation sequence, regardless of entrance type, write address bar or enter from search engine).

Experimental Results Evaluation Framework For Evaluation, agent simulator generates correct sessions (including correct navigation sequences with known referrer) and server logs without any referrer (worst case data, only <url, IP, page> and topology given outside). Each method, Smart-SRA, TO and NO heuristics are executed and patterns generated from sessions of each heuristics are compared with correct patterns. Agent Simulator Correct Sessions Pattern Discovery Patterns Session Construction Generated Candidate Evaluate RESULT Site Topology Server Logs (No Referrer)

Experimental Results Accuracy Metric We have two sets Maximal Patterns of Agent simulator MPA Maximal Patterns of Each Heuristics MPH Accuracy is defined by geometric mean of recall and precision Recall and Precision is calculated as:

Accuracy Results on Simulated Data Experimental Results Accuracy Results on Simulated Data We define two parameters for representing navigation behavior (LPP/LPC) Link from Previous Page Prob. over Link from Current Page Prob. Bigger values leads complex sessions with different overlapping paths (NIP/STP) New Initial page Prob. over Sessions Termination Prob. Bigger values leads complex sessions with new paths may overlap or not overlapped {0.001, 0.0025, 0.005, 0.0075,and 0.01} On the average, the patterns of Smart-SRA is 30% similar to real pattern than patterns of other approaches.

Experimental Results For Evaluation on real data, we have used small sized real web site, AGMLAB company site. We replace agent simulator with client side browser action tracker script in the evaluation framework. From the action tracker logs we obtained real sessions and input logs to heuristics. We have tracked several actions like ‘backward clicks’, ‘open new browser’ etc. On the average, the patterns of Smart-SRA is 35% similar to real patterns. We have also divided real data into two parts and used the patterns of training data to construct HMM for predicting next pages from server in the test set. In this experiment, Smart-SRA is at least 25% better than other approaches in terms of predicting next page.

Performance of Distributed Implementation Experimental Results Performance of Distributed Implementation Our data comes from customers of AGMLAB including web site of biggest telephone service operator having 35 billion number subscriber, their web site is daily visited by more than 200K web user.

Outline Motivation and Contributions Smart Miner Framework Session Construction Problem Pattern Discovery Distributed Implementation of Smart Miner Experimental Results Decision Support System Application

Decision Support System We illustrate the benefits of our framework on Decision Support System application for analyzing site specific navigation behavior of web users. For Decision support systems, we have defined Filter Language over frequent patterns that includes the following 5 primitives; A filter is defined as collection of primitives that has nested structure with AND, OR operations between groups or primitives like: ((P1 AND P2) OR P3 OR P4) A set of filters are executed over frequent patterns by using map reduce job and a set of patterns that evaluates each filter TRUE are listed separately.

Decision Support System By using our filter language one can design a filter that analyzes the followings: Navigation Behavior of web users provided via external traffic through advertisement. Navigation Behavior of web users after visiting given fixed page. Common exit pages of the web site. The patterns leads to common errors (like ending with 404 page error). The set of web pages that are less important in popular pattern in terms of elapsed time.

Conclusion Questions? Comments. 