Data mining and statistical learning, lecture 1b

Data mining and statistical learning, lecture 1b
Outline The five pillars of data mining Supervised and unsupervised learning Data mining and statistical learning, lecture 1b

The process of Selecting Exploring Modifying Modeling Assessing large amounts of data to uncover previously unknown patterns Data mining and statistical learning, lecture 1b

SEMMA Sample the data by creating one or more data tables Explore the data by searching for: (i) anticipated relationships and trends; (ii) unanticipated relationships and trends; (iii) anomalies Modify the data by transforming variables and combining existing variables into new variables Model the data by searching for a combination of the data that reliably predicts a desired outcome Assess the data by evaluating the usefulness and reliability of the findings from the data mining process Data mining and statistical learning, lecture 1b

Sample the data and create data tables
Cases and variables Objects and attributes Data mining and statistical learning, lecture 1b

Examine anticipated relationships: electricity consumption and temperature Data mining and statistical learning, lecture 1b

Examine the presence of outliers: Total nitrogen concentrations in Swedish rivers determined by two different methods Data mining and statistical learning, lecture 1b

Modifying inputs Transforming inputs or outputs Combining existing variables into new variables: Aggregating inputs Reducing the dimension of the inputs Data mining and statistical learning, lecture 1b

Model selection: credit scoring
Candidate predictors: Age Sex Income Marital status Education Savings Loans Payment records Houseowner . Subset selection aims to produce a model that is interpretable and has possibly lower prediction error Data mining and statistical learning, lecture 1b

Bias, Variance and Model Complexity
High Bias Low Variance Low Bias High Variance Test sample Prediction error Training sample Low High Model complexity Data mining and statistical learning, lecture 1b

Supervised learning (prediction, classification) We have a training set of data, in which we observe the outcome and feature measurements for a set of objects Using this data we build a prediction model, or learner, which will enable us to predict the outcome for new unseen objects Unsupervised learning (association analysis, clustering) We observe only the features and have no measurements of the outcome. Our task is to describe how the data are organized and clustered Data mining and statistical learning, lecture 1b Hastie, Tibshirani, and Friedman: The elements of statistical learning

Statistical learning problems – some examples
Supervised learning (prediction, classification) Predict tomorrow’s electricity consumption, from weather forecasts and calendar records (season, weekday, holiday) Identify the numbers in a handwritten ZIP code, from a digitized image Unsupervised learning (association analysis) Identify buying patterns that can be used to design sales promotions Data mining and statistical learning, lecture 1b

Supervised learning: statistical terminology
Prediction of one or more outputs using observations of one or more inputs Statistical terminology Inputs = Predictors Independent variables Explanatory variables Outputs = Responses Dependent variables Data mining and statistical learning, lecture 1b

Naming convention Regression Prediction of quantitative outputs using one or more inputs Classification Prediction of qualitative outputs using observations of one or more inputs Data mining and statistical learning, lecture 1b

Prediction by learning from data
Assume that we have a data set which shows the outcome (response) y for a set of investigated objects with features x1, …, xp Prediction by learning from data implies that we derive a function that can be used to foresee the outcome for new objects (with known or observed features) Data mining and statistical learning, lecture 1b

Some major types of quantitative prediction models
Linear or nonlinear regression models with i.i.d. error terms Time series regression models with stochastic noise Transfer function models Data mining and statistical learning, lecture 1b

Data mining and statistical learning, lecture 1b

Similar presentations

Presentation on theme: "Data mining and statistical learning, lecture 1b"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data mining and statistical learning, lecture 1b

Similar presentations

Presentation on theme: "Data mining and statistical learning, lecture 1b"— Presentation transcript:

Similar presentations

About project

Feedback