Section 5 Data Mining.

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

An Introduction to Data Mining
Data Mining Techniques Association Rule
Association Rules Spring Data Mining: What is it?  Two definitions:  The first one, classic and well-known, says that data mining is the nontrivial.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
ICS 421 Spring 2010 Data Mining 1 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/6/20101Lipyeow Lim.
Data Quality Class 9. Rule Discovery Decision and Classification Trees Association Rules.
DATA MINING CS157A Swathi Rangan. A Brief History of Data Mining The term “Data Mining” was only introduced in the 1990s. Data Mining roots are traced.
1 Data Warehousing. 2 Data Warehouse A data warehouse is a huge database that stores historical data Example: Store information about all sales of products.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Week 9 Data Mining System (Knowledge Data Discovery)
Data Mining Knowledge Discovery in Databases Data 31.
Data Mining By Archana Ketkar.
Data Mining Adrian Tuhtan CS157A Section1.
Mining Association Rules
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
Data Mining CS 157B Section 2 Keng Teng Lao. Overview Definition of Data Mining Application of Data Mining.
Beyond Opportunity; Enterprise Miner Ronalda Koster, Data Analyst.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Basic Data Mining Techniques
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
Data Mining Techniques
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
TCU Dept. of Computer Science CRESCENT Database Issues in Smart Homes Pervasive Intelligent Environments Spring 2004 March 2, 2004.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining By Fu-Chun (Tracy) Juang. What is Data Mining? ► The process of analyzing LARGE databases to find useful patterns. ► Attempts to discover.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Introduction of Data Mining and Association Rules cs157 Spring 2009 Instructor: Dr. Sin-Min Lee Student: Dongyi Jia.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Instructor : Prof. Marina Gavrilova. Goal Goal of this presentation is to discuss in detail how data mining methods are used in market analysis.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Association Rule Mining
DATA MINING By Cecilia Parng CS 157B.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Data Mining and Decision Support
Elsayed Hemayed Data Mining Course
Academic Year 2014 Spring Academic Year 2014 Spring.
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.
Monday, February 22,  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.
Chapter 26: Data Mining Prepared by Assoc. Professor Bela Stantic.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Introduction to Data Mining Mining Association Rules Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Mining Functionalities
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Data Mining: Concepts and Techniques
Adrian Tuhtan CS157A Section1
Data Analysis.
I don’t need a title slide for a lecture
Presentation transcript:

Section 5 Data Mining

Section Content 5.1 Introduction 5.2 Knowledge Discovery 5.3 Association Rules 5.4 Sequential Patterns 5.5 Classification and Regression 5.6 Other Forms of Data Mining 5.7 Applications of Data Mining CA306 Data Mining

5.1 Data Mining Introduction the discovery of new information in terms of patterns or rules from huge amounts of data mining tools should identify these patterns, rules and trends with minimal user input data mining is related to statistics: exploratory data analysis artificial intelligence: knowledge discovery and machine learning techniques from machine learning, statistics, neural networks and genetic algorithms are used due to the vastness of the amount of data, efficiency/scalability of data mining algorithms is a key issue CA306 Data Mining

Data Mining and Data Warehousing The goal of data warehousing is to support decision making with data. Data mining can help in conjunction with a data warehouse with certain types of decisions. Data mining helps to extract new patterns/rules that cannot be found by merely querying or processing data. Aggregated or summarised collections of data in warehouses improves the efficiency of data mining in these cases. The potential use of data mining needs to be considered early in the design of a data warehouse. CA306 Data Mining

Sections Covered 5.1 Introduction 5.2 Knowledge Discovery 5.3 Association Rules 5.4 Sequential Patterns 5.5 Classification and Regression 5.6 Other Forms of Data Mining 5.7 Applications of Data Mining CA306 Data Mining

5.2 Knowledge Discovery Data mining is part of the knowledge discovery process: data selection data cleansing enrichment data transformation / encoding data mining reporting and display Example: Database: Transaction database for a goods retailer Client data: name, zip code, phone, date of purchase, item code, price, quantity, total amount CA306 Data Mining

Knowledge Discovery - Example New knowledge can be discovered from the client data data selection: data about specific items or categories of items items from stores in specific regions data cleansing: correct incorrect zip codes eliminate records with incorrect phone numbers enrichment: add additional information age, income, credit rating of client data transformation: reduce the amount of data group items into product categories group zip codes into regions CA306 Data Mining

Data Mining - Knowledge Discovery Data mining might discover co-occurrences - items that are typically bought together association rules - when a customer buys video equipment, he/she also buys another electronic gadget sequential patters - when a customer buys a camera, then within 3 months he/she buys photographic supplies classification trees - customers can be classified by frequency of visits, types of finance used, etc. combined with statistics about the classes This information can then be used to for example optimise store locations run promotions plan seasonal marketing strategies Talk about real applications such as supermarkets, credit card etc. CA306 Data Mining

Goals of Data Mining Prediction Identification show how certain attributes within the data will behave in the future example: predict what customers will buy under certain discounts example: predict sales volume for some period Identification data patterns can be used to identify the existence of an item, an event, or an activity example: detecting intruders by the commands they execute Typical Q: What are the four goals of data mining? CA306 Data Mining

Goals of Data Mining Classification Optimisation partition data such that different classes or categories can be identified example: customers can be categorised into regular and infrequent shoppers, into discount-seeking customers etc. categorisation - e.g. into food categories - can reduce the complexity of data mining Optimisation optimise the use of limited resources (time, space, money, etc) example: what are the best products to spend our money on over the next three months? CA306 Data Mining

Types of Knowledge Discovered Co-occurrences collection of items/actions/events that occur together example: items that are bought together by a consumer in a shop Association rules correlation of a set of items with another range of values for another set of variables example: when someone buys bread, he/she is likely to buy cheese Classification hierarchies create a hierarchy of classes from an existing set of events or transactions example: customers might be divided into a credit worthiness hierarchy based on their previous credit transactions Exam Q: What are the types of knowledge discovered? Provide examples. CA306 Data Mining

Types of Knowledge Discovered Sequential patterns search for a sequence of events or actions example: a patient that underwent cardiac surgery and later developed high blood urea, is likely to suffer from kidney problems Patterns within time series detection of similarities within positions of the time series example: a pattern in a time series of stock market prices may be used to predict employment rates Categorisation and segmentation partition a set of events of items into segments/categories/classes example: treatment data on a disease can be partitioned into groups based on the side effects that are caused CA306 Data Mining

Counting Co-occurrences The problem is to count co-occurring itemsets - motivated by market basket analysis. A database of consumer transactions forms the basis transaction: a single visit to a store, an order at a virtual store (Web site), or a single order through a mail-order catalog a transaction consists of a transaction ID, customer ID, date, item and quantity The goal is to identify items that are typically purchased together. This can be used to improve the layout of shops or catalogs. CA306 Data Mining

Frequent Itemsets (1) Consider the following transaction table: Transaction Customer Date Items bought 101 12 11/09 milk, bread, juice 792 13 12/09 milk, juice 1130 14 14/09 milk, eggs 1735 13 14/09 bread, coffee, biscuits Items bought in one visit are already grouped together into itemsets. Support of an itemset: the fraction of transactions that contain all items in the itemset Examples {milk, juice} has a support of 50 % {bread, coffee} has a support of 25 % CA306 Data Mining

Frequent Itemsets (2) Large itemsets are itemsets that have a certain minimum support, i.e. are itemsets that occur frequently. Example: for a minimum support of 40%, the large itemsets are {milk, juice}, {milk}, {juice}, {bread} Proposition: every subset of a large itemset is also a large itemset Algorithm: large itemsets can be computed incrementally start with itemsets of cardinality 1 that have the required support CA306 Data Mining

Sections Covered 5.1 Introduction 5.2 Knowledge Discovery 5.3 Association Rules 5.4 Sequential Patterns 5.5 Classification and Regression 5.6 Other Forms of Data Mining 5.7 Applications of Data Mining CA306 Data Mining

5.3 Association Rules A database can be regarded as a collection of transactions. Each transaction involves a set of items. Example: the items in a basket that a shopper uses in a supermarket Transaction Time Items bought 101 6:35 milk, bread, juice 792 7:38 milk, juice 1130 8:05 milk, eggs 1735 8:40 bread, coffee, biscuits CA306 Data Mining

Association Rules An association rule is of form X => Y where X and Y are two disjoint sets of items Example: for sets of goods as itemsets X and Y, the expression X => Y means that if a customer buys X, he/she is also likely to buy Y. if the customer buys milk, he/she is also likely to buy juice. The support for a rule X => Y is the percentage of transactions that hold all of the items in the union X  Y. Examples: Milk => Juice has 50% support Bread => Juice has 25% support CA306 Data Mining

Association Rules The confidence of a rule X => Y is the percentage (fraction) of all transactions including X that also include Y. Example: the rule Milk => Juice has confidence 66.7% that means that 2/3 of all transactions with milk also include juice Note that support and confidence might be different. The goal is to discover rules with a certain minimum support and confidence. These rules can be used for prediction: for a rule Pen => Ink offer discounts on pens and you might increase ink sales. CA306 Data Mining

Association Rules How to compute these rules? Generate large itemsets (itemsets with a certain minimum support) For each large itemset X, generate all rules with a certain minimum confidence (mconf): for X and Y  X, let Z = X - Y (divide X into Y and Z) if support(X) / support(Y) > mconf then Y => Z is a valid rule the confidence of rule Y => Z is defined as support(X) / support(Y) Example: for X={milk, juice} and Y={milk}  {milk, juice}, let Z={juice} X, Y, Z have support 50%, 75% and 50%, resp. (support for itemsets 5.14) for mconf=40% {milk} => {juice} is a valid rule with confidence 66.7% ( 50/75 ) Note that the X,Y,Z calculations are are based on support for itemsets (see slide 5.14) CA306 Data Mining

Generating Association Rules In principle, generating rules based on large itemsets and their support is straightforward. Computing all large itemsets and their support creates an efficiency problem if the number of items is very high. If m is the number of items, then 2m is the number of different itemsets. Example: a typical supermarket might have several thousands of items. Computing the support of all itemsets might take a long time. Reducing the combinatorial search space is therefore important - the following properties can be used: subsets of large itemsets are large extensions of small itemsets are small If m = 3, then itemsets = 8: use simple binary example {0,0,0} to {1,1,1} CA306 Data Mining

Association Rules - Algorithms Outline of an algorithm that finds large itemsets: Step 1: test the support for itemsets of length 1 - called 1-itemsets - by scanning the database; discard those that do not meet the minimum requirement. Step 2: extend large 1-itemsets into 2-itemsets by appending one item each time (this generates all itemsets of length two); test the support and eliminate all 2-itemsets that do not meet the minumum support. Step 3: repeat the above steps: extend (k-1)-itemsets into k-itemsets. CA306 Data Mining

Association Rules among Hierarchies Items might be divided among disjoint hierarchies based on some classification, e.g. Beverage can be divided into Juice and Milk Associations might occur among the hierarchies of items. Example: healthy frozen yoghurt => bottled water Particularly interesting are associations across hierarchies. this kind of information can be used to arrange different kinds of items in a supermarket CA306 Data Mining

Negative Associations Negative associations are more difficult to detect than positive associations. Example: 60% of customers who buy crisps do not buy bottled water. There are usually more negative associations than positive ones. The majority of itemset combinations do not occur in databases. Finding interesting negative associations can be difficult. CA306 Data Mining

Association Rules - Additional Considerations Sampling: For very large databases, sampling improves efficiency. Truly representative samples can help to find most of the rules. The danger is that false positives might be discovered (large itemsets that are not truly large); true positives might be missing. Other problems: Cardinality of itemsets and volume of transactions can be very high. Variablity of transactions (geographical, season) makes sampling difficult. Multiple classifications along different dimensions. CA306 Data Mining

Sections Covered 5.1 Introduction 5.2 Knowledge Discovery 5.3 Association Rules 5.4 Sequential Patterns 5.5 Classification and Regression 5.6 Other Forms of Data Mining 5.7 Applications of Data Mining CA306 Data Mining

5.4 Sequential Patterns Sequential patterns are based on sequences of itemsets. Assume transactions to be ordered by time. Example: transactions in a supermarket {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} may be based on three visits of a customer A subsequence of a sequence is obtained by deleting one or more itemsets. let {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} be the orginal sequence {milk, bread, juice} ; {bread, eggs} is a subsequence {milk, bread, juice} ; {milk, coffee, biscuits} is a subsequence CA306 Data Mining

Support for Sequences A sequence {a1, ... , am} is contained in another sequence S if S has a subsequence {b1, ..., bn} such that ai  bi for 1 <= i <= n Example: {milk, bread} ; {coffee, biscuits} is contained in {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} The support of a sequence S is the percentage of a set of given sequences that contain S as a subsequence. CA306 Data Mining

Discovery of Patterns in Time Series Time series are sequences of events. An event might be a fixed type of transaction. Example: closing price of a stock or fund each day. Analysis of time series: find period of time in which the stock did not fluctuate more than 1% find period (week/month/quarter) with the greatest loss identify stocks with similar behaviour CA306 Data Mining

Sections Covered 5.1 Introduction 5.2 Knowledge Discovery 5.3 Association Rules 5.4 Sequential Patterns 5.5 Classification and Regression 5.6 Other Forms of Data Mining 5.7 Applications of Data Mining CA306 Data Mining

5.5 Classification and Regression Classification Rules Regression Tree-structured Rules CA306 Data Mining

Discovery of Classification Rules Classification means defining/identifying a function that maps an object into one of many possible classes. Example: a bank wants to classify loan applicants into “loanworthy” and “not loanworthy” a classification rule could define the classification not loanworthy: current monthly debt obligation exceeds 25% of monthly net income loanworthy: otherwise loanworthiness is a dependent, categorical attribute In general there is one rule (set) per class (var1 in range1) and ... and (varn in rangen) => object O in class C1 var1 , ..., varn are the predictor attributes Create a class C; define one rule for that class; determine what the extent of that class is i.e. what items belong to that class. Categorical means we are classifying or categorising. CA306 Data Mining

Support and Confidence Again we can define support and confidence for these rules. The support for a classification condition C is the percentage of tuples that satisfy C. The support for a rule C1 => C2 is the support for the condition C1  C2. (C1 AND C2 is the set of objects in both C1 and C2.) Consider those tuples that satisfy condition C1. The confidence for a rule C1 => C2 is the percentage of such rules that also satisfy condition C2. C1  C2 means LOGICAL AND CA306 Data Mining

Regression Regression is similar to classification, except that the dependent variable is numerical (and not categorical). Rules (such as classification rules) can be regarded as functions. A regression rule is a function that maps variables into a target class variable. Example: LabTest(patientID, test1, ... , testn) the values in that relation result from a series of lab tests the target variable P is the probability of survival - a numerical variable the regression rule: (test1 in range1) and ... and (testn in rangen) => P = x the regression function is P = f(test1, ... , testn) Test1 to testn must provide an overall value P CA306 Data Mining

Regression (2) If P appears as a function y = f(x1, ... , xn) and f is linear in the domain variables, then the process of deriving f from a given set of tuples <x1, ... , xn, y> is called linear regression. Linear regression is a common statistical technique. CA306 Data Mining

Tree-Structured Rules Specific classification and regression rules shall now be examined. These are rules that can be represented as trees - called classification trees or decision trees. These trees are typically the output of the data mining activity. Each path from a root to a leaf node represents one classification rule. Example: Insurance risk determination for motor insurance Age <= 25 > 25 Car Type NO sports family YES NO If greater than 25, then no risk. If less than 25, then if Sports car then risk. CA306 Data Mining

Decision Trees A decision tree is a graphical representation of a collection of classification rules. Each node in the tree is labelled with a predictor or splitting attribute. Each outgoing edge of an internal node is labelled with a predicate that involves the splitting attribute. Each leaf node is labelled with a value of the depending attribute. A classification rule can be associated with each leaf node - constructed as the conjunction of the predicates: Age <= 25 and Car Type = sports for the YES-leaf Decision trees are constructed in two phases: growth phase: create tree based on specialised rules from an input database (relation) pruning phase: reduce tree size by generalising rules CA306 Data Mining

Sections Covered 5.1 Introduction 5.2 Knowledge Discovery 5.3 Association Rules 5.4 Sequential Patterns 5.5 Classification and Regression 5.6 Other Forms of Data Mining 5.7 Applications of Data Mining CA306 Data Mining

5.6 Other Types of Data Mining Neural Networks Genetic Algorithms Clustering and Segmentation CA306 Data Mining

Neural Networks Techniques from artificial intelligence can be used to generalise regression. Neural networks provide an iterative method to carry out this generalised regression. Neural networks use a curve-fitting approach to infer a function from a set of samples. This process is based on learning: a test sample is the initial input, the system then incrementally infers functions based on more samples Neural networks can be applied to classification problems. Modelling time series with neural networks is difficult. CA306 Data Mining

Genetic Algorithms (1) Genetic algorithms (GA) are a class of randomised search procedures for adaptive and robust search over a wide range of search topologies. Principle: Genetic algorithms extend the idea of characterising human DNA by a four-letter alphabet (A,C,T,G). Construction: Devise an alphabet that allows the encoding of a solution to the decision problem in terms of strings of that alphabet. Usage: Study the cutting and combination of strings (compare natural reproduction and evolution). New generations of individuals (solutions) are generated and assessed - survival of the fittest. CA306 Data Mining

Genetic Algorithms (2) Generation of solutions - comparison with other techniques. GA search uses a set of solutions during each generation rather than a single solution. The search in the string-space represents a much larger parallel search in the space of encoded solutions. The memory of the search completed is represented solely by the set of solutions available for generation. A GA is a randomised algorithm since search mechanisms use probabilistic operators. While progressing from one generation to the next, a GA finds near-optimal balance between knowledge acquisition and exploitation by manipulating encoded solutions. CA306 Data Mining

Clustering and Segmentation Clustering is about identification and classification. Clustering tries to identify categories (or clusters) to which a data object can be mapped. The categories can be disjoint or might overlap; they might be organised into trees. A related problem: multivariate probability density functions. CA306 Data Mining

Sections Covered 5.1 Introduction 5.2 Knowledge Discovery 5.3 Association Rules 5.4 Sequential Patterns 5.5 Classification and Regression 5.6 Other Forms of Data Mining 5.7 Applications of Data Mining CA306 Data Mining

5.7 Applications of Data Mining Decision-making contexts: marketing: analysis of customer behaviour based on buying patterns; determination of marketing strategies (store locations, advertising campaigns, etc); segmentation of customers, stores, products. finance: analysis of creditworthiness of clients; performance analysis of finance investments; evaluation of financing options; fraud detection. CA306 Data Mining

Applications Manufacturing: Health care: optimisation of resources (machines, manpower, material); optimal design of manufacturing process, shop-floor layout, etc. Health care: analysis of effectiveness of certain treatments; optimisation of processes in a hospital; analysing side effects of drugs; relating patient wellness and doctor qualifications. CA306 Data Mining