Download presentation
Presentation is loading. Please wait.
Published byHester Skinner Modified over 9 years ago
1
Chapter 20 Data Analysis and Mining
2
2 n Decision Support Systems Obtain high-level information out of detailed information stored in (DB) transaction-processing systems for decision making Data Analysis and OLAP (OnLine Analytical Processing ) Develop tools/techniques for generating summarized data from very large DBs for data analysis Data Warehousing Integrate data from multiple sources with a unified schema at a single site, where materialized views are created Data Mining Adopt AI/statistical analysis techniques for knowledge discovery in very large DBs
3
3 Decision Support Systems n Decision-support systems are used to make business decisions, often based on data collected by on-line transaction- processing systems. n Examples of business decisions: What items to stock? What insurance premium to change? To whom to send advertisements? n Examples of data used for making decisions Retail sales transaction details Customer profiles (income, age, gender, etc.)
4
4 Data Mining n Data mining (DM) is the process of semi-automatically analyzing large databases to find useful patterns n Prediction (a DM technique) based on past history Predict if a credit card applicant poses a good credit risk, using some attributes (income, job type, age) & past payment history Predict if a pattern of phone calling card usage is likely to be fraudulent n Some examples of prediction mechanisms: Classification Given a new item whose class is unknown, predict to which class it belongs Regression formulae Given a set of mappings for an unknown function, predict the function result for a new parameter value
5
5 Classification Rules n Classification rules help assign new objects to classes e.g., given a new automobile insurance applicant, should (s)he be classified as low risk, medium risk, or high risk? n Classification rules for above example could use a variety of data, such as educational level, salary, age, etc. person P, P.degree = “masters” and P.income > 100,000 P.credit = excellent person P, P.degree = “bachelors” and (P.income 75,000 and P.income 100,000) P.credit = good n Rules are not necessarily exact: there may be some misclassifications n Classification rules can be shown compactly as a decision tree
6
6 Decision Tree
7
7 Construction of Decision Trees n Training set: a data sample in which the classification is already known. n Greedy top-down generation of decision trees Each internal node of the tree partitions the data into groups based on a partitioning attribute, and a partitioning condition (on samples) for the node Leaf node: all (or most) of the items at the node belong to the same class, or all attributes have been considered, and no further partitioning is possible n Decision trees can also be represented as sets of IF-THEN rules
8
8 Construction of Decision Trees n Example. A decision tree for the concept Play_Tennis The instance (Outlook = Sunny, Humidity = High, Temperature = Hot, Wind = Strong) is classified as a negative instance (i.e., predicting PlayTennis = no)
9
9 Decision-Tree Induction: Attribute Selection n Information gain (Gain) measure: select the test attribute at each node N in the decision tree T The attribute A with the highest Gain (or greatest entropy reduction) is chosen as the test/partitioning attribute of N A reflects the least randomness or impurity in the partition n Expected Information Needed (I), of the entire sample data set, to classify a given sample is given by I(s 1, s 2, …, s m ) = - p i log 2 p i where s i (1 i m) is the number of data samples in a subset of S with s data samples, which yields class C i, m is the number of distinct classes: C 1, C 2, … C m p i is the probability that an arbitrary sample belongs to C i, i.e., p i = s i /s, and log 2 is used as information is encoded in bits m i =1
10
10 Decision-Tree Induction: Entropy n Use entropy to determine expected info. of an attribute n Entropy (or expected information) E of an attribute A Let v be the number of distinct values {a 1, a 2, …, a v } of A Suppose A partitions S into v subsets, {S 1, S 2, …, S v }, where S i (1 i v) is the number of distinct values a i of A in S If A is the selected node label, i.e., the best attribute for splitting, then {S 1, …, S v } correspond to the labeled branches from A E(A) = ( I(s 1 j, s 2 j, …, s m j ) ) where (s 1 j +s 2 j +…+ s m j ) / s is the weight of s j (having a j of A) in S, s i j (1 i m) is the number of data samples in C i in S j, The smaller the E(A) value, the greater the purity of the subset partition, i.e., {S 1, S 2, …, S v } v j =1 s 1 j + s 2 j + … + s m j s
11
11 Decision-Tree Induction: Information Gain n Given the entropy of A: E(A) = ( I(s 1 j, s 2 j, …, s m j ) ) I(s 1 j, s 2 j, …, s m j ) is defined as I(s 1 j, s 2 j, …, s m j ) = - p i j log 2 p i j where p i j = s i j / |S j |, the probability that a sample in S j is in C i n The information gain of attribute A, i.e., the information gained by using A (as a node label), is defined as Gain(A) = I(s 1, s 2, …, s m ) – E(A) Gain(A) is the expected reduction in entropy using values of A The attribute with the highest information gain is chosen for the given set of samples S (in a recursive manner) v j =1 s 1 j + s 2 j + … + s m j s m i =1
12
12 Decision-Tree Induction n Example. Given the following training (sample) data, use the Decision Tree to predict customers who buy computers Two classes of samples {yes, no}, i.e., S 1 = ‘yes’, S 2 = ‘no’, & m = 2 S = 14; I(s 1, s 2 ) = I(9, 5) = - 9/14 log 2 9/14 - 5/14 log 2 5/14 = 0.94 which is the expected info. needed of all Next, compute the entropy of each attribute, i.e., age, income, etc. Consider Age with 3 distinct classes, i.e., “ 40” |S 1 | = 9 ; |S 2 | = 5, m i =1 since I(s 1, s 2, …, s m ) = - p i log 2 p i,
13
13 Example: Consider Age Since E(A) = ( I(s 1 j, s 2 j, …, s m j )), and I(s 1 j, s 2 j, …, s m j ) = - p i log 2 p i, Age = “ 30”: s 1 1 = 2, s 2 1 = 3, I(s 1 1, s 2 1 ) = -2/5 log 2 2/5 - 3/5 log 2 3/5 = 0.971 Age = “31.. 40”: s 1 2 = 4, s 2 2 = 0, I(s 1 2, s 2 2 ) = -4/4 log 2 4/4 - 0/4 log 2 0/4 = 0 Age = “> 40”: s 1 3 = 3, s 2 3 = 2, I(s 1 3, s 2 3 ) = -3/5 log 2 3/5 - 2/5 log 2 2/5 = 0.971 E(Age) = 5/14 I(s 1 1, s 2 1 ) + 4/14 I(s 1 2, s 2 2 ) + 5/14 I(s 1 3, s 2 3 ) = 0.694 Gain(Age) = I(s 1, s 2 ) – E(Age) = 0.94 – 0.694 = 0.246 v j =1 s 1 j + s 2 j + … + s m j s m i =1
14
14 Example: Consider Credit_Rating Two classes of samples {yes, no}: S 1 (yes), S 2 = no, I(s 1, s 2 ) = 0.94 Since E(A) = ( I(s 1 j, s 2 j, …, s m j )), and I(s 1 j, s 2 j, …, s m j ) = - p i log 2 p i, Credit_Rating = “Fair”: s 1 1 = 6 and s 2 1 = 2, I(s 1 1, s 2 1 ) = - 6/8 log 2 6/8 - 2/8 log 2 2/8 = 0.81 Credit_Rating = “Excellent”: s 1 2 = 3 and s 2 2 = 3 I(s 1 2, s 2 2 ) = - 3/6 log 2 3/6 - 3/6 log 2 3/6 = 1 E(Credit_Rating) = 8/14 I(s 1 1, s 2 1 ) + 6/14 I(s 1 2, s 2 2 ) = 0.89 Gain(Credit_Rating) = I(s 1, s 2 ) – E(Credit_Rating) = 0.94 – 0.89 = 0.05 2 j =1 s 1 j + s 2 j + … + s m j s m i =1
15
15 Example: Consider Income Two classes of samples {yes, no}: S 1 (yes), S 2 = no, I(s 1, s 2 ) = 0.94 Since E(A) = ( I(s 1 j, s 2 j, …, s m j ) = - p i log 2 p i ) Income = “High”: s 1 1 = 2 and s 2 1 = 2, I(s 1 1, s 2 1 ) = - 2/4 log 2 2/4 - 2/4 log 2 2/4 = 1 Income = “Low”: s 1 2 = 3 and s 2 2 = 1, I(s 1 2, s 2 2 ) = - 3/4 log 2 3/4 - 1/4 log 2 1/4 = 0.81 Income = “Medium”: s 1 2 = 4 and s 2 2 = 2 I(s 1 2, s 2 2 ) = - 4/6 log 2 4/6 - 2/6 log 2 2/6 = 0.92 E(Income) = 4/14 1 + 4/14 0.81 + 6/14 0.92 = 0.91 Gain(Income) = I(s 1, s 2 ) – E(Income) = 0.94 – 0.91 = 0.03 3 j =1 s 1 j + s 2 j + … + s m j s m i =1
16
16 Example: Consider Student Two classes of samples {yes, no}: S 1 (yes), S 2 = no, I(s 1, s 2 ) = 0.94 Since E(A) = ( I(s 1 j, s 2 j, …, s m j ) = - p i log 2 p i ) Student = “Yes”: s 1 1 = 6 and s 2 1 = 1, I(s 1 1, s 2 1 ) = - 6/7 log 2 6/7 - 1/7 log 2 1/7 = 0.59 Student = “No”: s 1 2 = 3 and s 2 2 = 4, I(s 1 2, s 2 2 ) = - 3/7 log 2 3/7 - 4/7 log 2 4/7 = 0.98 E(Student) = 7/14 0.59 + 7/14 0.98 = 0.78 Gain(Student) = I(s 1, s 2 ) – E(Student) = 0.94 – 0.78 = 0.15 2 j =1 s 1 j + s 2 j + … + s m j s m i =1
17
17 Decision-Tree Induction: Example Since Gain(Income) = 0.03, Gain(Student) = 0.15, Gain(Rating) = 0.05, and Gain(Age) = 0.246, A node is created and labeled with Age Branches are grown for each for the attribute’s values A leaf node (in the ‘Yes’ Class) Age is the chosen attribute (to split)
18
18 Decision-Tree Induction: Example Consider the leftmost branch. Since Gain(Student) = 0.971, Gain(Income) = 0.571, and Gain(Rating) = 0.02, A node is created and labeled with Student Branches are grown for each for the attribute’s values Income Student Rating Class High No Fair No High No Excellent No Medium No Fair No Low Yes Fair Yes Medium Yes Excellent Yes Income Rating Class High Fair No High Excellent No Medium Fair No Income Rating Class Low Fair Yes Medium Excellent Yes No Yes Student A leaf node (in the ‘No’ Class) A leaf node (in the ‘Yes’ Class) Student is the chosen attribute (to split)
19
19 Decision-Tree Induction: Example n Example. Given the following training (sample) data, the constructed final Decision Tree to predict customers who buy computers is
20
20 Association Rules n Retail shops are often interested in associations between different items that people buy. Someone who buys bread is quite likely also to buy milk A person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts n Associations information can be used in several ways e.g., when a customer buys a particular book, an online shop may suggest associated books n Association rules: bread milk. DB-Concepts, OS-Concepts Networks Left hand side: antecedent; right hand side: consequent An association rule must have an associated population; the population consists of a set of instances e.g., each transaction (sale) at a shop is an instance, and the set of all transactions is the population
21
21 Association Rules (Cont.) n Rules have an associated support, as well as an associated confidence n Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule e.g., suppose only 0.001% of all purchases include milk and screw drivers. The support for the rule is milk screw drivers is low n Confidence is a measure of how often the consequent is true when the antecedent is true. e.g., the rule bread milk has a confidence of 80%, if 80% of the purchases that include bread also include milk
22
22 Finding Association Rules n We are generally only interested in association rules with reasonably high support (e.g., support 2%) n Naïve algorithm 1. Consider all possible sets of relevant items 2. For each set find its support (i.e., count how many transactions purchase all items in the set) Large itemsets: sets with sufficiently high support 3. Use large itemsets to generate association rules From itemset S generate the rule (S - s) s for each s S Support of rule = support(S) Confidence of rule = support(s) support(S)
23
23 Finding Support n Determine support of itemsets via a single pass on set of transactions Large itemsets: sets with a high count at the end of the pass n If memory not enough to hold all counts for all itemsets, use multiple passes considering only some itemsets in each pass n Optimization: Once an itemset is eliminated because its count (support) is too small, none of its supersets needs to be considered. n The a priori technique to find large itemsets: Pass 1: count support of all sets with just 1 item. Eliminate those items with low support Pass i: candidates include every set of i items such that all its i - 1 item subsets are large Count support of all candidates; stop, if there are no candidates
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.