Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%

Slides:

Advertisements

Similar presentations

Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.

Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Imbalanced data David Kauchak CS 451 – Fall 2013.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Missing values problem in Data Mining

Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.

Rutgers CS440, Fall 2003 Review session. Rutgers CS440, Fall 2003 Topics Final will cover the following topics (after midterm): 1.Uncertainty & introduction.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

CMPUT 466/551 Principal Source: CMU

The Comparison of the Software Cost Estimating Methods

Exploratory Data Mining and Data Preparation

Sparse vs. Ensemble Approaches to Supervised Learning

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Three kinds of learning

Aprendizagem baseada em instâncias (K vizinhos mais próximos)

Sparse vs. Ensemble Approaches to Supervised Learning

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.

Randomized Variable Elimination David J. Stracuzzi Paul E. Utgoff.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Dan Simon Cleveland State University

Ensemble Learning (2), Tree and Forest

Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.

Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.

Model Assessment and Selection Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.

EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, April 3, 2000 DingBing.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.

Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.

Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

Today Ensemble Methods. Recap of the course. Classifier Fusion

CLASSIFICATION: Ensemble Methods

Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

CS Inductive Bias1 Inductive Bias: How to generalize on novel data.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

CSC321: Lecture 7:Ways to prevent overfitting

Ensemble Methods in Machine Learning

COT6930 Course Project. Outline Gene Selection Sequence Alignment.

Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.

Data Mining and Decision Support

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

Computational Intelligence: Methods and Applications Lecture 34 Applications of information theory and selection of information Włodzisław Duch Dept. of.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Data Science Algorithms: The Basic Methods

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Basic machine learning background with Python scikit-learn

Machine Learning Feature Creation and Selection

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Machine Learning in Practice Lecture 23

Chapter 7: Transformations

Feature Selection Methods

Evaluating Classifiers

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Special Topic: Missing Values

Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61% of cases  C-Section: –only about 1/2% of attribute values are missing –but 27.9% of cases have at least 1 missing value  UCI machine learning repository: –31 of 68 data sets reported to have missing values

“Missing” Can Mean Many Things  Randomly missing: –usually best case –usually not true  Non-randomly missing  Presumed normal, so not measured  Causally missing: –attribute value is missing because of other attribute values (or because of the outcome value!)

Dealing With Missing Data  Throw away cases with missing values –in some data sets, most cases get thrown away –if missing not random, throwing away cases can bias sample towards certain kinds of cases  Treat “missing” as a new attribute value –what value should we use to code for missing with continuous or ordinal attributes? –if missing causally related to what is being predicted?

Dealing With Missing Values  Marginalize over missing values –Some learning methods handle missing data –Most don’t (including neural nets)  Impute (fill-in) missing values –once filled in, data set is easy to use –if missing values poorly predicted, may hurt performance of subsequent uses of data set

Imputing Missing Values  Fill-in with mean, median, or most common value  Predict missing values using machine learning  Expectation Minimization (EM): Build model of data values (ignore missing vals) Use model to estimate missing values Build new model of data values (including estimated values from previous step) Use new model to re-estimate missing values Re-estimate model Repeat until convergence

Potential Problems  Imputed values may be inappropriate: –in medical databases, if missing values not imputed separately for male and female patients, may end up with male patients with 1.3 prior pregnancies, and female patients with low sperm counts –many of these situations will not be so obvious  If some attributes are difficult to predict, filled-in values may be random (or worse)  Some of the best performing machine learning methods are impractical to use for filling in missing values (neural nets)

Research in Handling Missing Values  Lazy learning: –don’t train a model until you know test case –missing in test case may “shadow” missing values in train set  Better algorithms: –Expectation maximization (EM) –Non-parametric methods (since parametric methods often work poorly when assumptions are violated)  Faster Algorithms: –apply to very large datasets

Special Topic: Feature Selection

Anti-Motivation  Most learning methods implicitly do feature selection: –decision trees: use info gain or gain ratio to decide what attributes to use as tests. many features don’t get used. –neural nets: backprop learns strong connections to some inputs, and near-zero connections to other inputs. –kNN, MBL: weights in Weighted Euclidean Distance determine how important each feature is. weights near zero mean feature is not used. –Bayes nets: statistics in tables allow some features to have little or no effect on model.  So why do we need feature selection?

Motivation

Brute-Force Approach  Try all possible combinations of features  Given N features, 2 N subsets of features –usually too many to try –danger of overfitting  Train on train set, evaluate on test set (or use cross-validation)  Use set of features that performs best on test set(s)

Two Basic Approaches  Wrapper Methods: –give different sets of features to the learning algorithm and see which works better –algorithm dependent  Proxy Methods (relevance determination methods) –determine what features are important or not important for the prediction problem without knowing/using what learning algorithm will be employed –algorithm independent

Wrapper Methods  Wrapper methods find features that work best with some particular learning algorithm: –best features for kNN and neural nets may not be best features for decision trees –can eliminate features learning algorithm “has trouble with”  Forward stepwise selection  Backwards elimination  Bi-directional stepwise selection and elimination

Relevance Determination Methods  Rank features by information gain –Info Gain = reduction in entropy due to attribute  Try first 10, 20, 30, …, N features with learner  Evaluate on test set (or use cross validation)  May be only practical method if thousands of attributes

Advantages of Feature Selection  Improved accuracy!  Less complex models: –run faster –easier to understand, verify, explain  Feature selection points you to most important features  Don’t need to collect/process features not used in models

Limitations of Feature Selection  Given many features, feature selection can overfit –consider 10 relevant features, and 10 9 random irrelevant features  Wrapper methods require running base learning algorithm many times, which can be expensive!  Just because feature selection doesn’t select a feature, doesn’t mean that feature isn’t a strong predictor –redundant features  May throw away features domain experts want in model  Most feature selection methods are greedy and won’t find optimal feature set

Current Research in Feature Selection  Speeding-up feature selection (1000’s of features)  Preventing overfitting (1000’s of features)  Better proxy methods –would be nice to know what the good/relevant features are independent of the learning algorithm  Irrelevance detection: –truly irrelevant attributes can be ignored –better algorithms –better definition(s)

Bottom Line  Feature selection almost always improves accuracy on real problems  Plus: –simpler, more intelligible models –features selected can tell you about problem –less features to collect when using model in future Feature selection usually is a win.