Data mining and statistical learning, lecture 1b Outline The five pillars of data mining Supervised and unsupervised learning Data mining and statistical learning, lecture 1b
Data mining and statistical learning, lecture 1b The process of Selecting Exploring Modifying Modeling Assessing large amounts of data to uncover previously unknown patterns Data mining and statistical learning, lecture 1b
Data mining and statistical learning, lecture 1b SEMMA Sample the data by creating one or more data tables Explore the data by searching for: (i) anticipated relationships and trends; (ii) unanticipated relationships and trends; (iii) anomalies Modify the data by transforming variables and combining existing variables into new variables Model the data by searching for a combination of the data that reliably predicts a desired outcome Assess the data by evaluating the usefulness and reliability of the findings from the data mining process Data mining and statistical learning, lecture 1b
Sample the data and create data tables Cases and variables Objects and attributes Data mining and statistical learning, lecture 1b
Data mining and statistical learning, lecture 1b Examine anticipated relationships: electricity consumption and temperature Data mining and statistical learning, lecture 1b
Data mining and statistical learning, lecture 1b Examine the presence of outliers: Total nitrogen concentrations in Swedish rivers determined by two different methods Data mining and statistical learning, lecture 1b
Data mining and statistical learning, lecture 1b Modifying inputs Transforming inputs or outputs Combining existing variables into new variables: Aggregating inputs Reducing the dimension of the inputs Data mining and statistical learning, lecture 1b
Model selection: credit scoring Candidate predictors: Age Sex Income Marital status Education Savings Loans Payment records Houseowner . Subset selection aims to produce a model that is interpretable and has possibly lower prediction error Data mining and statistical learning, lecture 1b
Bias, Variance and Model Complexity High Bias Low Variance Low Bias High Variance Test sample Prediction error Training sample Low High Model complexity Data mining and statistical learning, lecture 1b
Data mining and statistical learning, lecture 1b Supervised learning (prediction, classification) We have a training set of data, in which we observe the outcome and feature measurements for a set of objects Using this data we build a prediction model, or learner, which will enable us to predict the outcome for new unseen objects Unsupervised learning (association analysis, clustering) We observe only the features and have no measurements of the outcome. Our task is to describe how the data are organized and clustered Data mining and statistical learning, lecture 1b Hastie, Tibshirani, and Friedman: The elements of statistical learning
Statistical learning problems – some examples Supervised learning (prediction, classification) Predict tomorrow’s electricity consumption, from weather forecasts and calendar records (season, weekday, holiday) Identify the numbers in a handwritten ZIP code, from a digitized image Unsupervised learning (association analysis) Identify buying patterns that can be used to design sales promotions Data mining and statistical learning, lecture 1b
Supervised learning: statistical terminology Prediction of one or more outputs using observations of one or more inputs Statistical terminology Inputs = Predictors Independent variables Explanatory variables Outputs = Responses Dependent variables Data mining and statistical learning, lecture 1b
Data mining and statistical learning, lecture 1b Naming convention Regression Prediction of quantitative outputs using one or more inputs Classification Prediction of qualitative outputs using observations of one or more inputs Data mining and statistical learning, lecture 1b
Prediction by learning from data Assume that we have a data set which shows the outcome (response) y for a set of investigated objects with features x1, …, xp Prediction by learning from data implies that we derive a function that can be used to foresee the outcome for new objects (with known or observed features) Data mining and statistical learning, lecture 1b
Some major types of quantitative prediction models Linear or nonlinear regression models with i.i.d. error terms Time series regression models with stochastic noise Transfer function models Data mining and statistical learning, lecture 1b