Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Data Mining with Case Studies

Similar presentations


Presentation on theme: "Introduction to Data Mining with Case Studies"— Presentation transcript:

1 Introduction to Data Mining with Case Studies
* 07/16/96 Introduction Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006. *

2 Objectives What is data mining? Why data mining? What applications?
* 07/16/96 Objectives What is data mining? Why data mining? What applications? What techniques? What process? What software? 27 November 2008 ©GKGupta *

3 Definition Data mining may be defined as follows:
data mining is a collection of techniques for efficient automated discovery of previously unknown, valid, novel, useful and understandable patterns in large databases. The patterns must be actionable so they may be used in an enterprise’s decision making. 27 November 2008 ©GKGupta

4 * 07/16/96 What is Data Mining? Efficient automated discovery of previously unknown patterns in large volumes of data. Patterns must be valid, novel, useful and understandable. Businesses are mostly interested in discovering past patterns to predict future behaviour. A data warehouse, to be discussed later, can be an enterprise’s memory. Data mining can provide intelligence using that memory. 27 November 2008 ©GKGupta *

5 * 07/16/96 Examples amazon.com uses associations. Recommendations to customers are based on past purchases and what other customers are purchasing. A store in USA “Just for Feet” has about 200 stores, each carrying up to 6000 shoe styles, each style in several sizes. Data mining is used to find the right shoes to stock in the right store. More examples in case studies to be discussed later. 27 November 2008 ©GKGupta *

6 * 07/16/96 Data Mining We assume we are dealing with large data, perhaps Gigabytes, perhaps in Terabytes. Although data mining is possible with smaller amount of data, bigger the data, higher the confidence in any unknown pattern that is discovered. There is considerable hype about data mining at the present time and Gartner Group has listed data mining as one of the top ten technologies to watch. Question: How many books could one store in one Terabyte of memory? 27 November 2008 ©GKGupta *

7 * 07/16/96 Why Data Mining Now? Growth in generation and storage of corporate data – information explosion Need for sophisticated decision making – current database systems are Online Transaction Processing (OLTP) systems. The OLTP data is difficult to use for such applications. Why? Evolution of technology – much cheaper storage, easier data collection, better database management, to data analysis and understanding. 27 November 2008 ©GKGupta *

8 Information explosion
* 07/16/96 Information explosion Database systems are being used since the 1960s in the Western countries (perhaps since 1980s in India). These systems have generated mountains of data. Point of sale terminals and bar codes on many products, railway bookings, educational institutions, huge number of mobile phones, electronic commerce, all generate data. Government is now collecting a lot of information. 27 November 2008 ©GKGupta *

9 Information explosion
* 07/16/96 Information explosion Internet banking via networked computers and ATMs. Credit and debit cards. Medical data, doctors, hospitals. Transportation, Indian railways, automatic toll collection on toll roads, growing air travel. Passports, NRI visas, Other visas, NRI money transfers. Question: Can you think of other examples of data collection? 27 November 2008 ©GKGupta *

10 Information explosion
* 07/16/96 Information explosion Many adults in India generate: Mobile phone transactions. More than 300 million phones in India, reportedly growing at the rate of 10,000 new ones every hour! Mobile companies must save information about calls. Growing middle class with growing number of credit and debit card transactions. About 25m credit cards and 70m debit cards in Annual growth rate about 30% and 40% respectively. Could be 55m credit cards and 200m debit cards in 2010 resulting in perhaps 500m transactions annually. 27 November 2008 ©GKGupta *

11 Information explosion
* 07/16/96 Information explosion India has some huge enterprises, for example Indian railways, perhaps the busiest network in the world with 2.5m employees, 10,000 locomotives, 10,000 passenger trains daily, 10,000 freight trains daily and 20m passengers daily. Growing airline traffic with more than ten airlines. Perhaps 30m passengers annually. Growing number of motor vehicles – registration, insurance, driver license Internet surfing records 27 November 2008 ©GKGupta *

12 * 07/16/96 OLTP As noted earlier, most enterprise database systems were designed in the 1970’s or 1980’s and were mainly designed to automate some of the office procedures e.g. order entry, student enrolment, patient registration, airline reservations. These are well structured repetitive operations easily automated. 27 November 2008 ©GKGupta *

13 Decision Making Need for business memory and intelligence.
* 07/16/96 Decision Making Need for business memory and intelligence. Need to serve customers better by learning from past interactions. OLTP data is not a good basis for maintaining an enterprise memory. The intelligence hidden in data could be the secret weapon in a competitive business world but given the information explosion not even a small fraction could be looked at by human eye. Question: Why OLTP is not good for maintaining an enterprise memory? 27 November 2008 ©GKGupta *

14 OLTP vs Decision Making
* 07/16/96 OLTP vs Decision Making Clerical view of data focuses on details required for day-to-day running of an enterprise. Management view of data focuses on summary data to identify trends, challenges and opportunities. The detailed data view is the operational view while the management view is decision-support view. Comparison of the two views: 27 November 2008 ©GKGupta *

15 Operational vs Management View
* 07/16/96 Operational vs Management View Operational Decision-Support Users – Admin staff Users – Management Day–to–day work Decision support Application oriented Subject oriented Current data Historical data Detailed Overall view – summaries Simple queries Complex queries Predetermined queries Ad hoc queries Update/Select Only Select Real–time Not real–time 27 November 2008 ©GKGupta *

16 Evolution of Technology
* 07/16/96 Evolution of Technology Corporate data growth accompanied by decline in the cost of storage and processing. PC motherboard performance, measured in MHz/$, is currently doubling every 27 ± 2 months. Next slide using logarithmic scale shows that disk is now about 10GB per US dollar and the following slide shows that sales of disk storage is growing exponentially. Look at computing trends at Question: How much is the cost of 100GB disk? What is the cost of a PC and what is its CPU performance? 27 November 2008 ©GKGupta *

17 Decline in Hard Drive cost
* 07/16/96 Decline in Hard Drive cost 27 November 2008 ©GKGupta *

18 Growth in Worldwide Disk Capacity
* 07/16/96 Growth in Worldwide Disk Capacity 27 November 2008 ©GKGupta *

19 Evolution of Technology
* 07/16/96 Evolution of Technology Question: What do the graphs in the last two slides tell us? What scales are used in them? What was the pink line is the first graph? 27 November 2008 ©GKGupta *

20 Evolution of Technology
* 07/16/96 Evolution of Technology Database technology has improved over the years. Data collection is often much better and cheaper now The need for analyzing and synthesizing information is growing in a fiercely competitive business environment of today. 27 November 2008 ©GKGupta *

21 * 07/16/96 New applications Sophisticated applications of modern enterprises include: - sales forecasting and analysis - marketing and promotion planning - business modeling OLTP is not designed for such applications. Also, large enterprises operate a number of database systems and then it is necessary to integrate information for decision making applications. Question: Why OLTP cannot be used for sales forecasting and analysis? 27 November 2008 ©GKGupta *

22 * 07/16/96 Why Data Mining Now? As noted earlier, the reasons may be summarized as: Accumulation of large amounts of data Increased affordable computing power enabling data mining processing Statistical and learning algorithms Availability of software Strong business competition 27 November 2008 ©GKGupta *

23 * 07/16/96 Large amount of data Already discussed that many enterprises have large amounts of data accumulated over 30+ years. Noted earlier that some enterprises collect information for analysis, for example, supermarkets in USA offer loyalty cards in exchange for shopper information. Loyalty cards in Australia also collect information using a reward system. 27 November 2008 ©GKGupta *

24 Growth of cards A recent survey in USA found that the percentages of US adults using the following types of cards were: Credit cards - 88%; ATM cards - 60% Membership cards - 58% Debit cards - 35% Prepaid cards - 35% Loyalty cards - 29% Question: What kind of data do these cards generate? 27 November 2008 ©GKGupta

25 Affordable computing power
* 07/16/96 Affordable computing power Data mining is usually computationally intensive. Dramatic reduction in the price of computer systems, as noted earlier, is making it possible to carry out data mining without investing huge amounts of resources in hardware and software. In spite of affordable computing power, using data mining can be resources intensive. 27 November 2008 ©GKGupta *

26 * 07/16/96 Algorithms A variety of statistical and learning algorithms have been available in fields like statistics and artificial intelligence that have been adapted for data mining. With new focus on data mining, new algorithms are being developed. 27 November 2008 ©GKGupta *

27 Availability of Software
* 07/16/96 Availability of Software Large variety of DM software is now available. Some more widely used software is: IBM - Intelligent Miner and more SAS - Enterprise Miner Silicon Graphics - MineSet Oracle - Thinking Machines - Darwin Angoss - knowledgeSEEKER 27 November 2008 ©GKGupta *

28 Strong Business Competition
* 07/16/96 Strong Business Competition Growth in service economies. Almost every business is a service business. Service economies are information rich and very competitive. Consider the telecommunications environment in Australia. About 20 years ago, Telstra was a monopoly. The field is now very competitive. Mobile phone market in India is also very competitive. 27 November 2008 ©GKGupta *

29 Applications In finance, telecom, insurance and retail:
* 07/16/96 Applications In finance, telecom, insurance and retail: Loan/credit card approval market segmentation fraud detection better marketing trend analysis market basket analysis customer churn Web site design and promotion 27 November 2008 ©GKGupta *

30 Loan/Credit card approvals
* 07/16/96 Loan/Credit card approvals In a modern society, a bank does not know its customers. Only knowledge a bank has is their information stored in the computer. Credit agencies and banks collect a lot of customers’ behavioural data from many sources. This information is used to predict the chances of a customer paying back a loan. 27 November 2008 ©GKGupta *

31 * 07/16/96 Market Segmentation Large amounts of data about customers contains valuable information The market may be segmented into many subgroups according to variables that are good discriminators Not always easy to find variables that will help in market segmentation 27 November 2008 ©GKGupta *

32 * 07/16/96 Fraud Detection Very challenging since it is difficult to define characteristics of fraud. Often based on detecting changes from the norm. In statistics, it is common to throw out the outliers but in data mining it may be useful to identify them since they could either be due to errors or perhaps fraud. 27 November 2008 ©GKGupta *

33 * 07/16/96 Better Marketing When customers buy new products, other products may be suggested to them when they are ready. As noted earlier, in mail order marketing for example, one wants to know: - will the customer respond? - will the customer buy and how much? - will the customer return purchase? - will the customer pay for the purchase? 27 November 2008 ©GKGupta *

34 * 07/16/96 Better Marketing It has been reported that more than 1000 variable values on each customer are held by some mail order marketing companies. The aim is to “lift” the response rate. 27 November 2008 ©GKGupta *

35 * 07/16/96 Trend analysis In a large company, not all trends are always visible to the management. It is then useful to use data mining software that will identify trends. Trends may be long term trends, cyclic trends or seasonal trends. 27 November 2008 ©GKGupta *

36 Market Basket Analysis
* 07/16/96 Market Basket Analysis Aims to find what the customers buy and what they buy together This may be useful in designing store layouts or in deciding which items to put on sale Basket analysis can also be used for applications other than just analysing what items customers buy together 27 November 2008 ©GKGupta *

37 * 07/16/96 Customer Churn In businesses like telecommunications, companies are trying very hard to keep their good customers and to perhaps persuade good customers of their competitors to switch to them. In such an environment, businesses want to find which customers are good, why customers switch and what makes customers loyal. Cheaper to develop a retention plan and retain an old customer than to bring in a new customer. 27 November 2008 ©GKGupta *

38 * 07/16/96 Customer Churn The aim is to get to know the customers better so you will be able to keep them longer. Given the competitive nature of businesses, customers will move if not looked after. Also, some businesses may wish to get rid of customers that cost more than they are worth e.g. credit card holders that don’t use the card, bank customers with very small amount of money in their accounts. 27 November 2008 ©GKGupta *

39 * 07/16/96 Web site design A Web site is effective only if the visitors easily find what they are looking for. Data mining can help discover affinity of visitors to pages and the site layout may be modified based on this information. 27 November 2008 ©GKGupta *

40 * 07/16/96 Data Mining Process Successful data mining involves careful determining the aims and selecting appropriate data. The following steps should normally be followed: Requirements analysis Data selection and collection Cleaning and preparing data Data mining exploration and validation Implementing, evaluating and monitoring Results visualisation 27 November 2008 ©GKGupta *

41 Requirements Analysis
* 07/16/96 Requirements Analysis The enterprise decision makers need to formulate goals that the data mining process is expected to achieve. The business problem must be clearly defined. One cannot use data mining without a good idea of what kind of outcomes the enterprise is looking for. If objectives have been clearly defined, it is easier to evaluate the results of the project. 27 November 2008 ©GKGupta *

42 Data Selection and Collection
* 07/16/96 Data Selection and Collection Find the best source databases for the data that is required. If the enterprise has implemented a data warehouse, then most of the data could be available there. Otherwise source OLTP systems need to be identified and required information extracted and stored in some temporary system. In some cases, only a sample of the data available may be required. 27 November 2008 ©GKGupta *

43 Cleaning and Preparing Data
* 07/16/96 Cleaning and Preparing Data This may not be an onerous task if a data warehouse containing the required data exists, since most of this must have already been done when data was loaded in the warehouse. Otherwise this task can be very resource intensive, perhaps more than 50% of effort in a data mining project is spent on this step. Essentially a data store that integrates data from a number of databases may need to be created. When integrating data, one often encounters problems like identifying data, dealing with missing data, data conflicts and ambiguity. An ETL (extraction, transformation and loading) tool may be used to overcome these problems. 27 November 2008 ©GKGupta *

44 Exploration and Validation
* 07/16/96 Exploration and Validation Assuming that the user has access to one or more data mining tools, a data mining model may be constructed based on the enterprise’s needs. It may be possible to take a sample of data and apply a number of relevant techniques. For each technique the results should be evaluated and their significance interpreted. This is likely to be an iterative process which should lead to selection of one or more techniques that are suitable for further exploration, testing and validation. 27 November 2008 ©GKGupta *

45 Implementing, Evaluating and Monitoring
* 07/16/96 Implementing, Evaluating and Monitoring Once a model has been selected and validated, the model can be implemented for use by the decision makers. This may involve software development for generating reports or for results visualisation and explanation for managers. If more than one technique is available for the given data mining task, it is necessary to evaluate the results and choose the best. This may involve checking the accuracy and effectiveness of each technique. 27 November 2008 ©GKGupta *

46 Implementing, Evaluating and Monitoring
* 07/16/96 Implementing, Evaluating and Monitoring Regular monitoring of the performance of the techniques that have been implemented is required. Every enterprise evolves with time and so must the data mining system. Monitoring may from time to time to lead to the refinement of tools and techniques that have been implemented. 27 November 2008 ©GKGupta *

47 Results Visualisation
* 07/16/96 Results Visualisation Explaining the results of data mining to the decision makers is an important step. Most DM software includes data visualisation modules which should be used in communicating data mining results to the managers. Clever data visualisation tools are being developed to display results that deal with more than two dimensions. The visualisation tools available should be tried and used if found effective for the given problem. 27 November 2008 ©GKGupta *

48 Data Mining Process – Another Approach
* 07/16/96 Data Mining Process – Another Approach The last few slides presented one approach. Another approach that also includes six steps has been proposed by CRISP–DM (Cross–Industry Standard Process for Data Mining) developed by an industry consortium. The six steps are: 27 November 2008 ©GKGupta *

49 CRISP–DM Steps The six CRISP–DM steps are: Business understanding
* 07/16/96 CRISP–DM Steps The six CRISP–DM steps are: Business understanding Data understanding Data preparation Modelling Evaluation Deployment 27 November 2008 ©GKGupta *

50 * 07/16/96 CRISP–DM Steps The six steps proposed in CRISP–DM are similar to the six steps proposed earlier. . The CRIS–DM steps are shown in the following figure. Question: Compare the two sets of steps, one given in previous few slides and the CRISP-DM approach. Which approach is better? 27 November 2008 ©GKGupta *

51 CRISP Data Mining Model
* 07/16/96 CRISP Data Mining Model 27 November 2008 ©GKGupta *

52 Data Mining Techniques
* 07/16/96 Data Mining Techniques Although data mining is a new field, it uses many techniques developed years ago in other fields Machine learning, statistics, artificial intelligence, etc These techniques are in some cases modified to deal with large amounts of data 27 November 2008 ©GKGupta *

53 Data Mining Techniques
* 07/16/96 Data Mining Techniques Data mining includes a large number of techniques including concept/class description, association analysis, classification and prediction, cluster analysis, outlier analysis etc. Expression and visualization of data mining results is a challenging task. Privacy issues also need to be considered. 27 November 2008 ©GKGupta *

54 Data Mining Tasks Association analysis Classification and prediction
* 07/16/96 Data Mining Tasks Association analysis Classification and prediction Cluster analysis Web data mining Search Engines Data warehouse and OLAP Others, for example, Sequential patterns and Time-series analysis, not covered in this book 27 November 2008 ©GKGupta *

55 * 07/16/96 Association Analysis Association analysis involves discovery of relationships or correlations among a set of items. Discovering that personal loans are repaid with 80% confidence when the person owns his home. The classical example is the one where a store discovered that people buying nappies tend also to buy beer. 27 November 2008 ©GKGupta *

56 * 07/16/96 Associations The association rules are often written as X → Y meaning that whenever X appears Y also tends to appear. X and Y may be collection of attributes. A supermarket like Woolworths may have several thousand items and many millions of transactions a week (i.e. Gigabytes of data each week). Note that the quantities of items bought is ignored. 27 November 2008 ©GKGupta *

57 Classification and Prediction
* 07/16/96 Classification and Prediction A set of training objects each with a number of attribute values are given to the classifier. The classifier formulates rules for each class in the training set so that the rules may be used to classify new objects. Some techniques do not require training data. Classification may be used for predicting the class label of data objects. Number of techniques including decision tree and neural network. 27 November 2008 ©GKGupta *

58 * 07/16/96 Cluster Analysis Similar to classification in that the aim is to build clusters such that each of them is similar within itself but is dissimilar to others. Clustering does not rely on class-labeled data objects. Based on the principle of maximizing the intracluster similarity and minimizing the intercluster similarity. 27 November 2008 ©GKGupta *

59 Web data mining The Web revolution has had a profound impact on the way we search and find information at home and at work. From its beginning in the early 1990s, the web has grown to more than ten billion pages in 2008 (estimates vary), perhaps even more by the time you are looking at this slide. Web usage, Web content and Web structure are discussed in Chapter 5. 27 November 2008 ©GKGupta

60 Search engines Normally the search engine databases of Web pages are built and updated automatically by Web crawlers. When one searches the Web using one of the search engines, one is not searching the entire Web. Instead one is only searching the database that has been compiled by the search engine. There are a number of challenging problems related to search engines that are discussed in Chapter 6 including how to assign a ranking to each Web page that is retrieved in response to a user query. 27 November 2008 ©GKGupta

61 Data Warehousing and OLAP
Data warehousing is a process by which an enterprise collects data from the whole enterprise to build a single version of the truth. This information is useful for decision makers and may also be used for data mining. A data warehouse can be of real help in data mining since data cleaning and other problems of collecting data would have already been overcome. OLAP (Online Analytical Processing) tools are decision support tools that are often built on top of a data warehouse or another database. OLAP goes further than traditional query and report tools in that a decision maker already has a hypothesis which he/she is trying to test. 27 November 2008 ©GKGupta

62 Data Warehousing and OLAP
Data mining is somewhat different than OLAP since in data mining a hypothesis is not being tested. Instead data mining is used to uncover novel patterns in the data. 27 November 2008 ©GKGupta

63 * 07/16/96 Before Data Mining To define a data mining task, one needs to answer the following: What data set do I want to mine? What kind of knowledge do I want to mine? What background knowledge could be useful? How do I measure if the results are interesting? How do I display what I have discovered? 27 November 2008 ©GKGupta *

64 * 07/16/96 Task-relevant Data The whole database may not be required since it may be that we only want to study something specific e.g. trends in postgraduate students - countries they come from - degree program they are doing - their age? - time they take to finish the degree - scholarship they have they been awarded May need to build a database subset before data mining can be done. 27 November 2008 ©GKGupta *

65 Task-relevant Data Data collection is non-trivial.
* 07/16/96 Task-relevant Data Data collection is non-trivial. OLTP data is not useful since it is changing all the time. In some cases, data from more than one database may be needed. 27 November 2008 ©GKGupta *

66 * 07/16/96 Preprocessing A data mining process would normally involve preprocessing Often data mining applications use data warehousing One approach is to pre-mine the data, warehouse it, then carry out data mining The process is usually iterative and can take years of effort for a large project 27 November 2008 ©GKGupta *

67 * 07/16/96 Data Preprocessing Preprocessing is very important although often considered too mundane to be taken seriously Preprocessing may also be needed after the data warehouse phase Data reduction may be needed to transform very high dimensional data to a lower dimensional data 27 November 2008 ©GKGupta *

68 Data Preprocessing Feature Selection Use sampling? Normalization
* 07/16/96 Data Preprocessing Feature Selection Use sampling? Normalization Smoothing Dealing with duplicates, missing data Dealing with time-dependent data 27 November 2008 ©GKGupta *

69 * 07/16/96 Background knowledge Background information may be useful in the discovery process. For example, concept hierarchies or relationships between data may be useful in data mining. For postgraduate degrees, we may wish to look at all Masters degrees and all doctorate degrees separately. 27 November 2008 ©GKGupta *

70 * 07/16/96 Measuring interest Data mining process may generate many patterns. We cannot look at all of them and so need some way to separate uninteresting results from the interesting ones. This may be based on simplicity of pattern, rule length, or level of confidence. 27 November 2008 ©GKGupta *

71 * 07/16/96 Visualization We must be able to display results so that they are easy to understand. Display may be a graph, pie chart, tables etc. Some displays are better than others for a given kind of knowledge. 27 November 2008 ©GKGupta *

72 Guidelines for Successful Data Mining
• The data must be available • The data must be relevant, adequate and clean • There must be a well-defined problem • The problem should not be solvable by means of ordinary query or OLAP tools • The results must be actionable 27 November 2008 ©GKGupta

73 Guidelines for Successful Data Mining
Use a small team with a strong internal integration and a loose management style. Carry out a small pilot project before a major data mining project. Identify a clear problem owner responsible for the project. Could be someone in a sales or marketing. This will benefit the external integration. Question: Why each of the above guidelines is important for success? 27 November 2008 ©GKGupta

74 Guidelines for Successful Data Mining
Try to realise a positive return on investment within 6 to 12 months. The whole data mining project should have the support of the top management of the company. Question: Why each of the above guidelines is important for success? 27 November 2008 ©GKGupta

75 * 07/16/96 Data Mining Software As noted earlier, a large variety of DM software is now available. Some more widely used software is: IBM - Intelligent Miner and more SAS - Enterprise Miner Silicon Graphics - MineSet Oracle - Thinking Machines - Darwin Angoss - knowledgeSEEKER 27 November 2008 ©GKGupta *

76 Choosing Data Mining Software
* 07/16/96 Choosing Data Mining Software Many factors need to be considered if purchasing significant software: Product and vendor information Total cost of ownership Performance Functionality and modularity Training and support Reporting facilities and visualization Usability Question: Which one of the above is the most important? Why? 27 November 2008 ©GKGupta *

77 * 07/16/96 References D. Hand, H. Mannila and P. Smyth, Principles of Data Mining, MIT Press, 2001. J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, The Web site for this book is I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, The Web site for this book is Dhar, V. and Stein, R., 1997, Seven methods for transforming corporate data into business intelligence, Prentice Hall. 27 November 2008 ©GKGupta *

78 * 07/16/96 References U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996 M.S. Chen, J. Han, and P.S. Yu, Data Mining: An Overview from a Database Perspective, IEEE Transactions on Knowledge and Data Engineering, 8(6), pp , 1996. Berry, M. and Linoff, G., 1997, Data mining techniques for marketing, sales and support, John Wiley & Sons. Berry, M. and Linoff, G., 1999, Mastering data mining, John Wiley & Sons. 27 November 2008 ©GKGupta *


Download ppt "Introduction to Data Mining with Case Studies"

Similar presentations


Ads by Google