Data Mining Tools Overview Business Intelligence for Managers
Data Mining Definition Revisited Analysis of large quantities of data Knowledge discovery in databases Extracting implicit, previously unknown information from large volumes of raw data
Instances and Features Typically, the database will be a collection of instances Each instance will have values for a given set of features From database theory: instances are rows, features are columns
Classification Supervised learning Suppose instances have been categorized into classes and the database includes this categorization Goal: using the “knowledge” in the database, classify a given instance
Classifiers Classifier… feature values category X1 X2 X3 Xn Y DB collection of instances with known categories
Classifier intelligence A classifier’s intelligence will be based on a dataset consisting of instances with known categories Typical goal of a classifier: predict the category of a new instance that is rationally consistent with the dataset
BI Examples A loans officer in a bank uses a system that automatically approves or disapproves a loan application based on previous loan applications and decisions An admissions officer in a university uses a system that automatically makes an admission decision (accept, reject, wait-list), based on previous applicants’ data and decisions made on them
Data mining method example: k - nearest neighbors For a given instance T, get the top k database instances that are “nearest” to T Select a reasonable distance measure Inspect the category of these k instances, choose the category C that represent the most instances Conclude that T belongs to category C
Clustering (Chapter 5 of text) Unsupervised learning Classes/categories are not known, but unexpected groupings (clusters) are discovered Clustering provides insight into the population segments
Clustering Feature 1 Feature 2
Goal of Clustering Input: the database of instances, and possibly some predetermined number of clusters Output: the same database of instances partitioned into clusters
BI Examples After clustering the current university student population, it was discovered that there is a large group of female marketing majors coming from a particular exclusive school who tend to get high grades business response: focus recruitment on that school; push the university’s marketing program Customer segment characteristics and spending patterns can direct business strategies
Data mining method example: k-means Guess the number of clusters (k) Guess cluster centers from the samples (these will be called centroids) Determine cluster membership based on the distance from the centroids Repeatedly refine the centroids by getting the average (mean) of the members of each cluster
Summary Two sub-areas of data mining: supervised (classification) and unsupervised (clustering) learning methods For both types of methods, intelligent systems can be created to support business decision making