1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
Machine Learning and Data Mining Clustering
Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Basic Data Mining Techniques
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering Unsupervised learning Generating “classes”
Information Retrieval in Practice
Evaluating Performance for Data Mining Techniques
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Inductive learning Simplest form: learn a function from examples
Text Classification, Active/Interactive learning.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Today Ensemble Methods. Recap of the course. Classifier Fusion
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
CS654: Digital Image Analysis
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.
Data Mining and Decision Support
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Data Mining and Text Mining. The Standard Data Mining process.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Sampath Jayarathna Cal Poly Pomona
Semi-Supervised Clustering
Data Mining Jim King.
Lecture 15: Text Classification & Naive Bayes
Data Mining Lecture 11.
Topic 3: Cluster Analysis
CSE572, CBS598: Data Mining by H. Liu
Revision (Part II) Ke Chen
Information Organization: Clustering
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Revision (Part II) Ke Chen
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
Text Categorization Berlin Chen 2003 Reference:
CSE572: Data Mining by H. Liu
A task of induction to find patterns
Presentation transcript:

1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ. Berlin, Germany

2 Please note These slides use and/or refer to a lot of material available on the Internet. To reduce clutter, credits and hyperlinks are given in the following ways: n Slides adapted from other people‘s materials: at bottom of slide n Pictures, screenshots etc.: URL visible in screenshot or given in PPT „Comments“ field n Literature, software: On the accompanying Web site Thanks to the Internet community! You are invited to re-use these materials, but please give the proper credit.

3 Stages of knowledge discovery discussed in this lecture Application understanding

4 An addendum to the association rules: main interestingness measures of association rules (and a recommendation for postprocessing the result set) n Support of a rule A  B = no. of instances with A and B / no. of all instances n Confidence of a rule A  B = no. of instances with A and B / no. of instances with A = support (A & B) / support (A) n Lift of a rule A  B = support (A & B) / [ support (A) * support (B) ] l What does this measure, and in what numerical interval can it be? n Deleting redundant rules from the result set: l If you have A  B and A & C  B, the second rule is redundant.

5 Agenda Sequence mining: tool WUM (case study “school search”) Classification: method Naïve Bayes (case study “happiness”) Clustering: tool DocumentAtlas (case study “EU proposals”) A very short note on other uses of clustering (e.g. in query mining) Some observations on privacy... Best-practice „design patterns“with open-source tools

6 Demonstration of WUM

7 The site Business understanding / problem definition: * How do users search in this online catalog? * Which search criteria are popular? * Which are efficient? [Berendt & Spiliopoulou, VLDB Journal 2000]

8 The concept hierarchies / site ontology (excerpt) SEITE1-...LI (1st page of a list) or SEITEn-...LI (further page) LA („Land“)SA („Schulart“)SU („Suche“)

9 Sequence mining – one result pattern: successful search for a school in Germany a refinement a repetition a continuation one example pattern select t from node a b, template a * b as t where a.url startswith "SEITE1-" and a.occurrence = 1 and b.url contains "1SCHULE" and b.occurrence = 1 and (b.support / a.support) >= 0.2 (Berendt & Spiliopoulou, VLDB J. 2000) /liste.html?offset=920&ze ilen=20&anzahl=1323&sprac he=de&sw_kategorie=de&ers cheint=&suchfeld=&suchwer t=&staat=de&region=by&sch ultyp=

10 Sequences

11 Generalized sequences, navigation patterns, hits in WUM

12 Aggregated Logs: The basic internal representation in WUM

13 The confi- dence measure for genera- lized sequences

14 Templates in the query language MINT, g-sequences, and navigation patterns

15 Interestingness measures: Support (hits) and confidence

16 Aggregated Logs, queries, and query results

17 The basic idea of the WUM algorithm

18 MINT can express 3 types of constraints (“predicates“)

19 The WUM gseqm algorithm (B predicates)

20 Agenda Sequence mining: tool WUM (case study “school search”) Classification: method Naïve Bayes (case study “happiness”) Clustering: tool DocumentAtlas (case study “EU proposals”) A very short note on other uses of clustering (e.g. in query mining) Some observations on privacy... Best-practice „design patterns“with open-source tools

21 “What makes people happy?” – a corpus-based approach to finding happiness

22 Bayes‘ formula and its use for classification 1. Joint probabilities and conditional probabilities: basics n P(A & B) = P(A|B) * P(B) = P(B|A) * P(A) n  P(A|B) = ( P(B|A) * P(A) ) / P(B) (Bayes´ formula) n P(A) : prior probability of A (a hypothesis, e.g. that an object belongs to a certain class) n P(A|B) : posterior probability of A (given the evidence B) 2. Estimation: n Estimate P(A) by the frequency of A in the training set (i.e., the number of A instances divided by the total number of instances) n Estimate P(B|A) by the frequency of B within the class-A instances (i.e., the number of A instances that have B divided by the total number of class-A instances) 3. Decision rule for classifying an instance: n If there are two possible hypotheses/classes (A and ~A), choose the one that is more probable given the evidence n (~A is „not A“) n If P(A|B) > P(~A|B), choose A n The denominators are equal  If ( P(B|A) * P(A) ) > ( P(B|~A) * P(~A) ), choose A

23 Simplifications and Naive Bayes 4. Simplify by setting the priors equal (i.e., by using as many instances of class A as of class ~A) n  If P(B|A) > P(B|~A), choose A 5. More than one kind of evidence n General formula: n P(A | B 1 & B 2 ) = P(A & B 1 & B 2 ) / P(B 1 & B 2 ) = P(B 1 & B 2 | A) * P(A) / P(B 1 & B 2 ) = P(B 1 | B 2 & A) * P(B 2 | A) * P(A) / P(B 1 & B 2 ) n Enter the „naive“ assumption: B 1 and B 2 are independent given A n  P(A | B 1 & B 2 ) = P(B 1 |A) * P(B 2 |A) * P(A) / P(B 1 & B 2 ) n By reasoning as in 3. and 4. above, the last two terms can be omitted n  If (P(B 1 |A) * P(B 2 |A) ) > (P(B 1 |~A) * P(B 2 |~A) ), choose A n The generalization to n kinds of evidence is straightforward. n These kinds of evidence are often called features in machine learning.

24 Example: Texts as bags of words Common representations of texts n Set: can contain each element (word) at most once n Bag (aka multiset): can contain each word multiple times (most common representation used in text mining) Hypotheses and evidence n A = The blog is a happy blog, the is a spam , etc. n ~A = The blog is a sad blog, the is a proper , etc. n B i refers to the i th word occurring in the whole corpus of texts Estimation for the bag-of-words representation: n Example estimation of P(B 1 |A) : l number of occurrences of the first word in all happy blogs, divided by the total number of words in happy blogs (etc.)

25 WEKA – NaiveBayes and NaiveBayesMultinomial n The WEKA classifier learning scheme NaiveBayesMultinomial implements this model of „the probability that a word occurs in a document given that the document is in that classs“. l Its output is a table giving these probabilities n The WEKA classifier learning scheme NaiveBayes assumes that the attributes are normally distributed. l Needed when the attributes are numerical and not necessarily 0 | 1 l Its output describes the parameters of these normal distributions l Explanation of the annotations of the attributes: n Explanation of the error measures: l

26 The „happiness factor“ of Mihalcea & Liu (2006) “Starting with the features identified as important by the Naïve Bayes classifier (a threshold of 0.3 was used in the feature selection process), we selected all those features that had a total corpus frequency higher than 150, and consequently calculate the happiness factor of a word as the ratio between the number of occurrences in the happy blogposts and the total frequency in the corpus.”  What is the relation to the Naïve Bayes estimators?

27 Agenda Sequence mining: tool WUM (case study “school search”) Classification: method Naïve Bayes (case study “happiness”) Clustering: tool DocumentAtlas (case study “EU proposals”) A very short note on other uses of clustering (e.g. in query mining) Some observations on privacy... Best-practice „design patterns“with open-source tools

28 Clustering by information contained in the objects to be clustered (here: documents contain text) –

29 The basic idea of clustering: group similar things Group 1 Group 2 Attribute 1 Attribute 2 Based on

30 Idea and Applications Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects. n It is also called unsupervised learning. n It is a common and important task that finds many applications. Applications in text analysis / Web content mining, e.g. for search engines: n Structuring search results n Suggesting related pages n Automatic directory construction/update n Finding near identical/duplicate pages Applications in Web usage mining n Customer/user segmentation n User segmentation for recommender systems / personalization Based on

31 Concepts in Clustering n “Defining distance between points l Cosine distance (which you already know) l Overlap distance n A good clustering is one where l (Intra-cluster distance) the sum of distances between objects in the same cluster are minimized, l (Inter-cluster distance) while the distances between different clusters are maximized l Objective to minimize: F(Intra,Inter) n Clusters can be evaluated with “internal” as well as “external” measures l Internal measures are related to the inter/intra cluster distance l External measures are related to how representative are the current clusters to “true” classes –See entropy and F-measure Based on

32 Inter/Intra Cluster Distances Intra-cluster distance (Sum/Min/Max/Avg) the (absolute/squared) distance between - All pairs of points in the cluster OR - Between the centroid and all points in the cluster OR - Between the “medoid” and all points in the cluster Inter-cluster distance Sum the (squared) distance between all pairs of clusters Where distance between two clusters is defined as: - distance between their centroids/medoids - (Spherical clusters) - Distance between the closest pair of points belonging to the clusters - (Chain shaped clusters) From

33 How hard is clustering? One idea is to consider all possible clusterings, and pick the one that has best inter and intra cluster distance properties Suppose we are given n points, and would like to cluster them into k-clusters n How many possible clusterings? Too hard to do it brute force or optimally Solution: Iterative optimization algorithms –Start with a clustering, iteratively improve it (eg. K-means) From

34 Classical clustering methods Partitioning methods n k-Means (and EM), k-Medoids Hierarchical methods n agglomerative, divisive, BIRCH Model-based clustering methods From

35 K-means Works when we know k, the number of clusters we want to find Idea: n Randomly pick k points as the “centroids” of the k clusters n Loop: l For each point, put the point in the cluster to whose centroid it is closest l Recompute the cluster centroids l Repeat loop (until there is no change in clusters between two consecutive iterations.) Iterative improvement of the objective function: Sum of the squared distance from each point to the centroid of its cluster From

36 K Means Example ( K=2) For a more complex simulation, see Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids Reassign clusters Converged! Based on

37 A map of documents, grouped by their topics

38 DocumentAtlas: A two-step procedure 1. Latent semantic indexing: Project documents into a semantic space (dimensionality reduction and identification of commonalities even if vocabulary is different) 2. Multidimensional scaling: Project that space into 2D, preserving the distances as well as possible n Input: a set of documents n Output: a „document map“

39 Agenda Sequence mining: tool WUM (case study “school search”) Classification: method Naïve Bayes (case study “happiness”) Clustering: tool DocumentAtlas (case study “EU proposals”) A very short note on other uses of clustering (e.g. in query mining) Some observations on privacy... Best-practice „design patterns“with open-source tools

40 Clustering by information contained in the objects to be clustered (here: documents contain text) –

41 Clustering by information associated with the objects to be clustered (here: photos are associated with tags) –

42 Clustering by information associated with the objects to be clustered (here: queries are associated with document texts) – (1)

43 Clustering by information associated with the objects to be clustered... (2) – Baeza-Yates, Query Mining, ECIR Create instances of past ( query – result set ) combinations 2. Cluster them by the textual similarity of the (viewed) result documents 3. Use this to recommend a better / an additional query Result set 1 Query 1 Result set 2 Query 2 New user recommend

44 Ranking by similarity and popularity: Examples

45 Agenda Sequence mining: tool WUM (case study “school search”) Classification: method Naïve Bayes (case study “happiness”) Clustering: tool DocumentAtlas (case study “EU proposals”) A very short note on other uses of clustering (e.g. in query mining) Some observations on privacy... Best-practice „design patterns“with open-source tools

46 Internet users are worried about their privacy... (results from a meta- study of 30 questionnaire-based studies [TK03])

47... but are they really? An online shop with a difference [Berendt, Günther, & Spiekermann, Communications of the ACM,2005]

48 Privacy-related behaviour Shopping for jackets Shopping for cameras [Berendt, Data Mining and Knowledge Discovery, 2002], [Berendt, Postproc. WebKDD 2002]

49 Finding: People are willing to exchange privacy for personalization benefits n Users would provide, in return for personalized content, information on their name (88%), education (88%), age (86%), hobbies (83%), salary (59%), or credit card number (13%). n 27% of Internet users think tracking allows the site to provide information tailored to specific users. n 73% of online users find it useful if site remembers basic information such as name and address. n People are willing to give information to receive a personalized online experience: 51% or 40%, depending on the study. [TK03]

50 User-centric evaluation: An experimental investigation of the effect of explaining the personalization-privacy tradeoff [KT05] compared the effects of traditional privacy statements with that of a contextualized explanation on users ’ willingness to answer questions about themselves and their (product) preferences. In the contextualized-explanation condition, participants n answered 8.3% more questions (gave at least one answer) (p<0.001), n gave 19.6% more answers (p<0.001), n purchased 33% more often (p<0.07), stated that their data had helped the Web store to select better books (p<0.035) – even though the recommendations were static and identical for both groups. [KT05] compared the effects of traditional privacy statements with that of a contextualized explanation on users ’ willingness to answer questions about themselves and their (product) preferences. In the contextualized-explanation condition, participants n answered 8.3% more questions (gave at least one answer) (p<0.001), n gave 19.6% more answers (p<0.001), n purchased 33% more often (p<0.07), stated that their data had helped the Web store to select better books (p<0.035) – even though the recommendations were static and identical for both groups. (screenshot from Teltzrow, M. & Kobsa, A. (2004). Communication of Privacy and Personalization in E-Business. In Proceedings of the Workshop “WHOLES: A Multiple View of Individual Privacy in a Networked World”, Stockholm, Sweden.

51 But what is privacy? Is it only about data protection? Phillips, D.J “Privacy Policy and PETs: The Influence of Policy Regimes on the Development and Social Implications of Privacy Enhancing Technologies.” New Media & Society 6(6): n freedom from intrusion n construction of the public/private divide n separation of identities n protection from surveillance (the right to choose belonging)

52 Also: whose privacy? Stakeholders and privacy interests: a (partially) fictitious example users of the system: n passengers n system administrators Other stakeholders: n airport administration n airport security n airlines n duty-free shop

53 Different privacy interests of the different stakeholders

54 Agenda Sequence mining: tool WUM (case study “school search”) Classification: method Naïve Bayes (case study “happiness”) Clustering: tool DocumentAtlas (case study “EU proposals”) A very short note on other uses of clustering (e.g. in query mining) Some observations on privacy... Best-practice „design patterns“with open-source tools

55 In the preparation of a log file (recommendations for open-source tools are shown in green) 1. Use qualitative methods for application understanding (read!) 2. Inspect the site and the URLs for data understanding 1. Generate Analog reports for getting base statistics of usage 2. Build concept system / hierarchy and mapping: URLs  concepts (notation: WUMprep regex) 3. Use WUMprep for data preparation 1. Remove unwanted entries (pictures etc.) 2. Sessionize 3. Remove robots 4. Replace URLs by concepts 5. (Build a database) 4. Use WEKA for modelling 1. Transform log file into ARFF (WUMprep4WEKA) 2. Cluster, classify, find association rules, Use WUM for modelling 6. Select patterns based on objective interestingness measures (support, confidence, lift,...) and on subjective interestingness measures (unexpected? Application-relevant?) 7. Present results in tabular, textual and graphical form (use Excel,...) 8. Interpret the results 9. Make recommendations for site improvement etc.

56 In the case study: 1. Use qualitative methods for application understanding (read!) 2. Inspect the site and the URLs for data understanding 1. Generate Analog reports for getting base statistics of usage 2. Build concept system / hierarchy and mapping: URLs  concepts (notation: WUMprep regex) 3. Use WUMprep for data preparation 1. Remove unwanted entries (pictures etc.) 2. Sessionize 3. Remove robots 4. Replace URLs by concepts 5. (Build a database) 4. Use WEKA for modelling 1. Transform log file into ARFF (WUMprep4WEKA) 2. Cluster, classify, find association rules, Use WUM for modelling 6. Select patterns based on objective interestingness measures (support, confidence, lift,...) and on subjective interestingness measures (unexpected? Application-relevant?) 7. Present results in tabular, textual and graphical form (use Excel,...) 8. Interpret the results 9. Make recommendations for site improvement etc. done

57 The preparation of texts (e.g., for an automatic version of step 2.2.) n Is quite involved when done properly (a good introduction to preprocessing for text mining can be found in Grobelnik, M., & Mladenic, D. Text Mining Tutorial. ) n However, as a first step, you can also use the raw text of documents (generated with only a few of the tools in the TextGarden library).

58 Thank you!