Presentation is loading. Please wait.

Presentation is loading. Please wait.

KDD Overview Xintao Wu.

Similar presentations


Presentation on theme: "KDD Overview Xintao Wu."— Presentation transcript:

1 KDD Overview Xintao Wu

2 What is data mining? Data mining is Patterns must be:
extraction of useful patterns from data sources, e.g., databases, texts, web, images, etc. Patterns must be: valid, novel, potentially useful, understandable

3 Classic data mining tasks
Classification: mining patterns that can classify future (new) data into known classes. Association rule mining mining any rule of the form X  Y, where X and Y are sets of data items. Clustering identifying a set of similarity groups in the data

4 Classic data mining tasks (contd)
Sequential pattern mining: A sequential rule: A B, says that event A will be immediately followed by event B with a certain confidence Deviation detection: discovering the most significant changes in data Data visualization CS583, Bing Liu, UIC

5 Why is data mining important?
Huge amount of data How to make best use of data? Knowledge discovered from data can be used for competitive advantage. Many interesting things that one wants to find cannot be found using database queries, e.g., “find people likely to buy my products”

6

7 Related fields Data mining is an multi-disciplinary field:
Machine learning Statistics Databases Information retrieval Visualization Natural language processing etc.

8 Association Rule: Basic Concepts
Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items E.g., 98% of people who purchase tires and auto accessories also get automotive services done

9 Rule Measures: Support and Confidence
Customer buys both Find all the rules X  Y with minimum confidence and support support, s, probability that a transaction contains {X  Y } confidence, c, conditional probability that a transaction having X also contains Y Customer buys diaper Customer buys beer Let minimum support 50%, and minimum confidence 50%, we have A  C (50%, 66.6%) C  A (50%, 100%)

10 Applications Market basket analysis: tell me how I can improve my sales by attaching promotions to “best seller” itemsets. Marketing: “people who bought this book also bought…” Fraud detection: a claim for immunizations always come with a claim for a doctor’s visit on the same day. Shelf planning: given the “best sellers,” how do I organize my shelves?

11 Mining Frequent Itemsets: the Key Step
Find the frequent itemsets: the sets of items that have minimum support A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules.

12 The Apriori Algorithm Join Step: Ck is generated by joining Lk-1with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;

13 The Apriori Algorithm — Example
Database D L1 C1 Scan D C2 C2 L2 Scan D C3 L3 Scan D

14 Example of Generating Candidates
L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4={abcd}

15 Criticism to Support and Confidence
Example 1: (Aggarwal & Yu, PODS98) Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal play basketball  eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball  not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

16 Criticism to Support and Confidence (Cont.)
We need a measure of dependent or correlated events If Corr < 1 A is negatively correlated with B (discourages B) If Corr > 1 A and B are positively correlated P(AB)=P(A)P(B) if the itemsets are independent. (Corr = 1) P(B|A)/P(B) is also called the lift of rule A => B (we want positive lift!)

17 Classification—A Two-Step Process
Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur

18 Classification by Decision Tree Induction
A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree

19 Some probability... Entropy info(S) = -  (freq(Ci,S)/|S|) log (freq(Ci,S)/|S|) S = cases freq(Ci,S) = # cases in S that belong to Ci Prob(“this case belongs to Ci”) = freq(Ci,S)/|S| Gain Assume attribute A divide set T into Ti. i =1,…,m info(T_new) =  |Ti|/S info(Ti) gain(A) = info (T) - info(T_new)

20 Example Info(T) (9 play, 5 don’t) info(T) = -9/14log(9/14) /14log(5/14) = 0.94 (bits) Test: outlook infoOutlook = Test Windy infowindy= 5/14 (-2/5 log(2/5)-3/5 log(3/5))+ 7/14(-4/7log(4/7)-3/7 log(3/7)) 4/14 (-4/4 log(4/4)) + +7/14(-5/7log(5/7)-2/7log(2/(7)) 5/14 (-3/5 log(3/5) - 2/5 log(2/5)) gainOutlook = = 0.3 = 0.278 gainWindy = = 0.662 = 0.64 (bits) Windy is a better test

21 Bayesian Classification: Why?
Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

22 Bayesian Theorem Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem MAP (maximum posteriori) hypothesis Practical difficulty: require initial knowledge of many probabilities, significant computational cost

23 Naïve Bayes Classifier (I)
A simplified assumption: attributes are conditionally independent: Greatly reduces the computation cost, only count the class distribution.

24 Naive Bayesian Classifier (II)
Given a training set, we can compute the probabilities

25 Example E ={outlook = sunny, temp = [64,70], humidity= [65,70], windy = y} = {E1,E2,E3,E4} Pr[“Play”/E] = (Pr[E1/Play] x Pr[E2/Play] x Pr[E3/Play] x Pr[E4/Play] x Pr[Play]) / Pr[E] = (2/9x 3/9 x 3/9 x 4/9x 9/14)/Pr[E] = 0.007/Pr[E] Pr[“Don’t”/E] = (3/5 x 2/5 x 1/5 x 3/5 x 5/14)/Pr[E] = 0.010/Pr[E] With E: Pr[“Play”/E] = 41 %, Pr[“Don’t”/E] = 59 %

26 Bayesian Belief Networks (I)
Family History Smoker (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) LC 0.8 0.5 0.7 0.1 LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9 The conditional probability table for the variable LungCancer PositiveXRay Dyspnea Bayesian Belief Networks

27 What is Cluster Analysis?
Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

28 What Is Good Clustering?
A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

29 Requirements of Clustering in Data Mining
Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability

30 Major Clustering Approaches
Partitioning algorithms: Construct various partitions and then evaluate them by some criterion Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion Density-based: based on connectivity and density functions Grid-based: based on a multiple-level granularity structure Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

31 Partitioning Algorithms: Basic Concept
Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

32 The K-Means Clustering Method
Given k, the k-means algorithm is implemented in 4 steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. Assign each object to the cluster with the nearest seed point. Go back to Step 2, stop when no more new assignment.

33 The K-Means Clustering Method
Example

34 Comments on the K-Means Method
Strength Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes

35 Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e agglomerative (AGNES) divisive (DIANA)

36 More on Hierarchical Clustering Methods
Major weakness of agglomerative clustering methods do not scale well: time complexity of at least O(n2), where n is the number of total objects can never undo what was done previously Integration of hierarchical with distance-based clustering BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters CURE (1998): selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a specified fraction CHAMELEON (1999): hierarchical clustering using dynamic modeling

37 Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as density-connected points Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Several interesting studies: DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98) CLIQUE: Agrawal, et al. (SIGMOD’98)

38 Grid-Based Clustering Method
Using multi-resolution grid data structure Several methods STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997) WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) CLIQUE: Agrawal, et al. (SIGMOD’98) Self-Similar Clustering Barbará & Chen (2000)

39 Model-Based Clustering Methods
Attempt to optimize the fit between the data and some mathematical model Statistical and AI approach Conceptual clustering A form of clustering in machine learning Produces a classification scheme for a set of unlabeled objects Finds characteristic description for each concept (class) COBWEB (Fisher’87) A popular a simple method of incremental conceptual learning Creates a hierarchical clustering in the form of a classification tree Each node refers to a concept and contains a probabilistic description of that concept

40 COBWEB Clustering Method
A classification tree

41 Summary Association rule and frequent set mining
Classification: decision tree, bayesian network, SVM, etc. Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods Other data mining tasks


Download ppt "KDD Overview Xintao Wu."

Similar presentations


Ads by Google