Presentation is loading. Please wait.

Presentation is loading. Please wait.

School of Computer Science & Engineering

Similar presentations


Presentation on theme: "School of Computer Science & Engineering"— Presentation transcript:

1 School of Computer Science & Engineering
Artificial Intelligence Missing Value Handling Dae-Won Kim School of Computer Science & Engineering Chung-Ang University

2 What is missing data problems?

3 Feature values are unobserved for some of the cases.

4 The problem of missing values have various causes:

5 e.g., application form data where the responses to some questions depends on the answers to others.

6 e.g., dust, corruption, or scratches from observation device and image acquisition process.

7 From now, we will learn simple methods to handle the missing value problem.

8 Quiz: In the preprocessing.

9 Training data Fish1 = 10cm, 5cm, 10kg, Salmon
Fish4 = 50cm, 5cm, 20kg, Sea bass Fish5 = 60cm, 8cm, 16kg, Sea bass Fish6 = 60cm, 9cm, ??kg, Sea bass How to handle the missing values (??, NaN, blank).

10 Method 1. The simplest method is to remove the patterns with missing values.

11 Training data Fish1 = 10cm, 5cm, 10kg, Salmon
Fish4 = 50cm, 5cm, 20kg, Sea bass Fish5 = 60cm, 8cm, 16kg, Sea bass Fish6 = 60cm, 9cm, ??kg, Sea bass We remove patterns with missing values.

12 Training data Fish1 = 10cm, 5cm, 10kg, Salmon
Fish4 = 50cm, 5cm, 20kg, Sea bass Fish5 = 60cm, 8cm, 16kg, Sea bass Fish6 = 60cm, 9cm, ??kg, Sea bass We remove features with missing values. This is called the whole data strategy.

13 Method 2. Missing values are replaced by zeros.

14 Training data Fish1 = 10cm, 5cm, 10kg, Salmon
Fish4 = 50cm, 5cm, 20kg, Sea bass Fish5 = 60cm, 8cm, 16kg, Sea bass Fish6 = 60cm, 9cm, 0 kg, Sea bass This is called the zero-filling method.

15 Method 3. Missing values are estimated by the average.

16 Training data Fish1 = 10cm, 5cm, 10kg, Salmon
Fish4 = 50cm, 5cm, 20kg, Sea bass Fish5 = 60cm, 8cm, 16kg, Sea bass Fish6 = 60cm, 9cm, 18kg, Sea bass This is called the average-filling method.

17 Method 4. Missing values are estimated by the NN approach.

18 Training data Fish1 = 10cm, 5cm, 10kg, Salmon
Fish4 = 50cm, 5cm, 20kg, Sea bass Fish5 = 60cm, 8cm, 16kg, Sea bass Fish6 = 60cm, 9cm, ??kg, Sea bass We calculate the neighbors of Fish2, Fish6.

19 Training data Who is the NN of Fish2? Fish1 = 10cm, 5cm, 10kg, Salmon

20 Using absolute distance measure,
Fish1 = 10cm, 5cm, 10kg, Salmon Fish2 = ??cm, 5cm, 12kg, Salmon Fish3 = 20cm, 7cm, 10kg, Salmon D(Fish2, Fish1)=|5-5|+|12-10|=2 D(Fish2, Fish3)=|5-7|+|12-10|=4

21 Using absolute distance measure,
Fish1 = 10cm, 5cm, 10kg, Salmon Fish2 = 10cm, 5cm, 12kg, Salmon Fish3 = 20cm, 7cm, 10kg, Salmon Fish1 is the NN of Fish2. This is called the NN-Imputation.

22 Using absolute distance measure,
Fish4 = 50cm, 5cm, 20kg, Sea bass Fish5 = 60cm, 8cm, 16kg, Sea bass Fish6 = 60cm, 9cm, 16kg, Sea bass Fish5 is the NN of Fish6.

23 Issues on NN-Imputation.
Distance measure, KNN-Impute.

24 Quiz: On-the-fly.

25 Training data Fish1 = 10cm, 5cm, 10kg, Salmon
Fish4 = 50cm, 5cm, 20kg, Sea bass Fish5 = 60cm, 8cm, 16kg, Sea bass Fish6 = 60cm, 9cm, ??kg, Sea bass Without any preprocessing, predict the class of a new fish=[30cm, 7cm, 15kg].

26 The first idea is to use the bayesian classification algorithm.

27 Technically, bayesian classifiers are free to the missing value problem.

28 Bayesian classifiers can be used with the preprocessed input data.

29 The second idea is to use K-NN without preprocessing.

30 We modify the distance measure in which the missing features are not participated in the calculation.

31 This is called the partial distance strategy.

32 Using absolute distance measure,
Fish1 = 10cm, 5cm, 10kg, Salmon Fish2 = ??cm, 5cm, 12kg, Salmon x = 30m, 7cm, 15kg Dp(x, Fish1) =|30-10|+|7-5|+|15-10|

33 Using absolute distance measure,
Fish1 = 10cm, 5cm, 10kg, Salmon Fish2 = ??cm, 5cm, 12kg, Salmon x = 30m, 7cm, 15kg Dp(x, Fish2) =3/(3-1) x (|7-5|+|15-12|)

34 Let s be the number of features, xj and yj be the j-th feature of two patterns x and y, then the partial distance between x and y is defined as:

35 Quiz: Categorical Features

36 Training data Fish1 = long, white, low, Salmon
Fish2 = ??, white, high, Salmon Fish3 = short, white, low, Salmon Fish4 = long, gray, high, Salmon Fish5 = short, gray, low, Sea bass Fish6 = short, white, low, Sea bass Fish7 = long, white, ??, Sea bass Predict the class of a new fish=[long, gray, low].

37 We can use various approaches.

38 When the average-filling is used, we should think of how to represent the average of categorical features.

39 The common method is to use a mode that represents the most frequent feature.

40 Training data Fish1 = long, white, low, Salmon
Fish2 = long, white, high, Salmon Fish3 = short, white, low, Salmon Fish4 = long, gray, high, Salmon Fish5 = short, gray, low, Sea bass Fish6 = short, white, low, Sea bass Fish7 = long, white, low, Sea bass Estimated by the mode in the preprocessing.

41 Besides, all methods mentioned in the numerical features can be used.

42 Missing value handling is still an open problem.


Download ppt "School of Computer Science & Engineering"

Similar presentations


Ads by Google