Data Mining Lecture 1: Introduction to Data Mining Manuel Penaloza, PhD
2 Introduction to Data Mining Society produces huge amounts of data daily — Retail Store – POS data on customer purchases — Banks – Collection of customer service calls — Telecommunications – Phone call records (mobile and house-based calls) — Medicine – Genomic data collected on the structure of genes — Government – Law enforcement data, income tax data — Others: (Transactional) data from Sports, Schools, Research, Search engines, etc.
3 What is Data Mining (DM)? It is the process of discovering hidden relationships and patterns in large data sets — It can also predict the outcome of a future observation Data mining is an interdisciplinary field — It is an extension to statistical analysis — It uses techniques from: – Statistics – Machine learning – Pattern recognition – Database technology – Visualization – High-performance computing
4 Questions answered by DM Extracting useful information from a dataset that answer: — Which CC customers are most profitable? — Which loan applicants are high-risk? — Which customer will respond to a planned promotion? — How do we detect phone card fraud? — How do customer profile change over time? — Which customers do prefer product A over product B? — What is the revenue prediction for next year? — Which students are most likely to transfer than others? — Which tax payer may be cheating the system? — Who is most likely to violate a probation sentence? — What is the predicted outcome for some treatment?
5 Data sources Relational Databases — Transactional data with many tables Data warehouses — Historical data, aggregated and updated periodically Files — In special format (e.g., CSV) or proprietary binary Internet or electronic mail — HTML, XML, web search results, s Scientific, research — Seismology, remote sensing, etc.
6 Example: Health System Characteristics of the Health System: — Personal medical records (GP, specialists, etc.) — Billing records — Hospital data (surgery, admission, etc.) Questions: — Are MD's following the procedures? — Which patient may have an adverse drug reactions? — Are people committing frauds? — Which patient are most likely to get cancer?
7 Case study: E-commerce A person buys book from Amazon.com Objective: Recommend other books this person is likely to buy Amazon may do clustering or sequential pattern analysis based on books bought by other people Data analyzed: —“Customer who bought “Data Mining: Practical Machine Learning Tools and Techniques” also bought “Introduction to Data Mining” Recommendations have been successful for Amazon — Increasing buyer’s satisfaction and purchases
8 What motivated data mining? Growth in data collection Presence of data warehouses with reliable data Competitive pressure to increase sales The development of commercial off the shelves (COTS) data mining software — Examples: XLMiner, Insightful Miner, SAS, SPSS Growth of computing power and storage capacity High dimensionality of the data Heterogeneous and complex data Limitation of humans
9 Insightful Miner TM 7: GUI *Figures taken from the Insightful Miner 7 Guide
10 Creating Models Create a network of pipelined components — By dragging and dropping components
11 Choosing a data mining system They have different functionality or methodology Selection determined by: — Type of operating system used in your organization — The data sources handle by the tool: –ASCII text files, relational databases, XML data — The data mining functions and methods offered — Scalability of the system –Row and column scalability — Visualization tools available — Graphical user interface that guides the execution of the methods — Integration with other information systems — Cost and performance
12 Data Mining in Databases Current applications include data mining modules Example: — Database management systems such as Oracle and MS SQL Server — CRM (Customer Relationship Management) Advantages for Database systems: — One Stop shopping — Minimize data movement and conversion Disadvantages for Database systems: — Limited to DM methods available in the system — Data extractions and transformations may not be powerful enough
13 Standard data mining life cycle CRISP (Cross-Industry Standard Process) It is an iterative process with phase dependencies IT consists of six (6) phases: see for more information
14 CRISP_DM Cross-industry standard developed in 1996 — Analysts from SPSS/ISL, NCR, Daimler-Benz, OHRA Funding from European Commission Important Characteristics: — Non-proprietary — Application/Industry neutral — Tool neutral — General problem-solving process — Process with six phases but missing: –Saving results and updating the model
15 CRISP-DM Phases (1) Business Understanding — Understand project objectives and requirements — Formulation of a data mining problem definition Data Understanding — Data collection — Evaluate the quality of the data — Perform exploratory data analysis Data Preparation — Clean, prepare, integrate, and transform the data — Select appropriate attributes and variables
16 CRISP-DM Phases (2) Modeling — Select and apply appropriate modeling techniques — Calibrate model parameters to optimize results — If necessary, return to data preparation phase to satisfy model's data format Evaluation — Determine if model satisfies objectives set in phase 1 — Identify business issues that have not been addressed Deployment — Organize and present the model to the “user” — Put model into practice — Set up for continuous mining of the data
17 Data mining tasks (1) Classification — Predict the categorical value of a target (dependent) variable based on the values of other attributes — Target variable is partitioned into classes — It predicts class membership of a new observation — Examples: Which drug should be prescribed for older patients with low sodium/potassium ratios? Estimation — Similar to classification except target variable is numeric —That is, predicting a numeric value — Example: Estimate the blood pressure of a person based on his/her age, gender, body mass index, etc.
18 Data mining tasks (2) Prediction — Similar to estimation except that results lie in the future —Example: Predict the price of a stock 3 months into the future Clustering — Grouping similar records together — Example: Find patients with similar profiles Associations — Uncover rules that indicates the association between two or more attributes — Find out which items are purchased together
19 Task: Classification Build a model that learns to predict the class from pre-labeled instances or observations — Many approaches: Regression, Decision Trees, Neural Networks Given a set of points from classes what is the class of new point ? * Diagram taken fromwww.kdnuggets.com/data_mining_course/index.htmlwww.kdnuggets.com/data_mining_course/index.html
20 Task: Clustering Find grouping of instances given un-labeled data * Diagram taken fromwww.kdnuggets.com/data_mining_course/index.htmlwww.kdnuggets.com/data_mining_course/index.html
21 DM looks easy Data Data Mining Method Regression Decision Tree Neural Network … Association Rules Model - But it is not easy - Real-world is complicate
22 Methods and Techniques Cluster Analysis (tasks: clustering) Association Rules (tasks: association) Decision trees (tasks: prediction, classification) Neural networks (tasks: prediction, classification) K-nearest neighbor (tasks: prediction, classification, clustering) Regression analysis (task: estimation, prediction) Confidence interval estimation (task: estimation)
23 Fallacies of Data Mining (1) Fallacy 1: There are data mining tools that automatically find the answers to our problem — Reality: There are no automatic tools that will solve your problems “while you wait” Fallacy 2: The DM process require little human intervention — Reality: The DM process require human intervention in all its phases, including updating and evaluating the model by human experts Fallacy 3: Data mining have a quick ROI — Reality: It depends on the startup costs, personnel costs, data source costs, and so on
24 Fallacies of Data Mining (2) Fallacy 4: DM tools are easy to use — Reality: Analysts must be familiar with the model Fallacy 5: DM will identify the causes to the business problem — Reality: DM tool only identify patterns in your data, analysts must identify the cause Fallacy 6: Data mining will clean up a data repository automatically — Reality: Sequence of transformation tasks must be defined by an analysts during early DM phases * Fallacies described by Jen Que Louie, President of Nautilus Systems, Inc.
25 In summary, Problems suitable for Data Mining: —Require to discover knowledge to make right decisions —Current solutions are not adequate —Expected high-payoff for the right decisions —Have accessible, sufficient, and relevant data —Have a changing environment IMPORTANT: — ENSURE privacy if personal data is used! —Not every data mining application is successful!
26 Main References Ian Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques, 2 nd edition, Morgan Kaufmann Publishers Daniel LaRose. Discovering Knowledge in Data: An Introduction to Data Mining, Wiley Publication Pang-Ning Tang et. al. Introduction to Data Mining, Addison Wesley Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers Online data mining course offered by KDnuggets TM at Engineering Statistics Handbook available online at
27 Exercise #1 CRISP-DM is not the only DM process, do a quick search on the Internet for another process. Describe any similarity and differences with CRISP-DM. Determine how data mining could help a web search engine company like Google in its operation? — Identify one or more objectives. — Which data mining task(s) could help this company?