Download presentation
Presentation is loading. Please wait.
Published byJuliet Sims Modified over 9 years ago
1
DATA MINING Handling Missing Attribute Values and Knowledge Discovery Shahzeb Kamal Olov Junker AmsterdamUppsala HASCO 2014 - SHAH - OLOV
2
DATA MINING Process of Extracting Information from a Database & transforming it into understandable structure Why? Because data in real world is ugly Incomplete Contain errors Inconsistent HASCO 2014 - SHAH - OLOV
3
Handling Missing Attribute values Techniques Consistency Various Algorithms KDD (Knowledge Discovery in Databases) Exploring patterns in datasets Core of KDD is DM HASCO 2014 - SHAH - OLOV BMWL
4
Missing Data Data is not always available Machine malfunction Inconsistent with other recorded data Data not entered due to misunderstanding Certain data may not be considered important at the time of entry Mistakenly Changed/Erased data HASCO 2014 - SHAH - OLOV
5
How to handle missing data Sequential method Pre process the data / Fill the missing attributes before main process (e.g. rule induction) (Rule induction: extracting rules from observing the data) Parallel method Extracting rules from the original incomplete data sets HASCO 2014 - SHAH - OLOV Goal: Rule induction
6
Sequential methods Case wise deletion HASCO 2014 - SHAH - OLOV CASE ATTRIBUTESDecision FLU TempHeadacheNausea 1High?NoYes 2Very highYes 3?No 4HighYes 5High?YesNo 6NormalYesNo 7NormalNoYesNo 8?Yes? CASE ATTRIBUTESDecision FLU TempHeadacheNausea 1Very highYes 2HighYes 3NormalYesNo 4NormalNoYesNo
7
Most common value of an attribute HASCO 2014 - SHAH - OLOV CASE ATTRIBUTESDecision FLU TempHeadacheNausea 1High?NoYes 2Very highYes 3?No 4HighYes 5High?YesNo 6NormalYesNo 7NormalNoYesNo 8?Yes? CASE ATTRIBUTESDecision FLU TempHeadacheNausea 1HighyesNoYes 2Very highYes 3highNo 4HighYes 5HighyesYesNo 6NormalYesNo 7NormalNoYesNo 8highYesyesYes Most common value of an attribute restricted to a concept (concept: set of all cases with same decision) Eg. Case 1 belong to concept {1, 2, 4, 8} So, Headache = Yes
8
Assigning all possible values to a missing attribute value HASCO 2014 - SHAH - OLOV CASE ATTRIBUTESDecision FLU TempHeadacheNausea 1High?NoYes 2Very highYes 3?No 4HighYes 5High?YesNo 6NormalYesNo 7NormalNoYesNo 8?Yes? CASE ATTRIBUTESDecision FLU TempHeadacheNausea 1aHighyesNoYes 1bHighNo Yes 2Very highYes 3aHighNo 3bVery highNo 3cNormalNo 4HighYes 5aHighYes No 5bHighNoYesNo 6NormalYesNo 7NormalNoYesNo 8aHighYes 8bHighYesNoYes 8cVery highYes 8dVery highYesNoYes 8eNormalYesyes 8fNormalYesNoyes All possible values of an attribute restricted to a concept
9
Assigning mean value HASCO 2014 - SHAH - OLOV CASE ATTRIBUTESDecision FLU TempHeadacheNausea 1100.2?NoYes 2102.6Yes 3?No 499.6Yes 599.8?YesNo 696.4YesNo 796.6NoYesNo 8?Yes? For symbolic attributes -> replace by common value ^_^ CASE ATTRIBUTESDecision FLU TempHeadacheNausea 1100.2yesNoYes 2102.6Yes 399.2No 499.6Yes 599.8yesYesNo 696.4YesNo 796.6NoYesNo 899.2YesyesYes
10
Assigning mean value restricted to a concept HASCO 2014 - SHAH - OLOV CASE ATTRIBUTESDecision FLU TempHeadacheNausea 1100.2?NoYes 2102.6Yes 3?No 499.6Yes 599.8?YesNo 696.4YesNo 796.6NoYesNo 8?Yes? For symbolic attributes -> replace by common value CASE ATTRIBUTESDecision FLU TempHeadacheNausea 1100.2yesNoYes 2102.6Yes 397.6No 499.6Yes 599.8noYesNo 696.4YesNo 796.6NoYesNo 8100.8YesyesYes
11
Global Closest Fit Replacing the missing attribute value by the known value in another “case” that resembles as much as possible to the case with the missing value. HASCO 2014 - SHAH - OLOV We compute ‘Distance’ -> Smallest distance = Closest value Distance(x,y)= Σ all cases distance(x i, y i ) Where, 0 if x i =y i, Distance(x i,y i ) 1 if x and y are symbolic and x i ≠y i, or x i =? Or y i =? |xi-yi|/r if x i and y i are numerical values and x i ≠y i
12
For example Distance(1,2) = |100.2-102.6|/|102.6-96.4| +1 + 1 = 2.39 HASCO 2014 - SHAH - OLOV CASE ATTRIBUTESDecision FLU TempHeadacheNausea 1100.2?NoYes 2102.6Yes 3?No 499.6Yes 599.8?YesNo 696.4YesNo 796.6NoYesNo 8?Yes? CASE ATTRIBUTESDecision FLU TempHeadacheNausea 1100.2yesNoYes 2102.6Yes 3100.2No 499.6Yes 599.8yesYesNo 696.4YesNo 796.6NoYesNo 8102.6YesyesYes
13
Concept Closest Fit First split the dataset in data subsets with same concept Replace the missing attribute value by the known value in another “case” that resembles as much as possible to the case with the missing value. Merge the data subsets HASCO 2014 - SHAH - OLOV CASE ATTRIBUTESDecision FLU TempHeadacheNausea 1100.2?NoYes 2102.6Yes 3?No 499.6Yes 599.8?YesNo 696.4YesNo 796.6NoYesNo 8?Yes? CASE ATTRIBUTESDecision FLU TempHeadacheNausea 1100.2?NoYes 2102.6Yes 499.6Yes 8? ? CASE ATTRIBUTESDecision FLU Temp 3?No 599.8?YesNo 696.4YesNo 796.6NoYesNo
14
HASCO 2014 - SHAH - OLOV CASE ATTRIBUTESDecision FLU TempHeadacheNausea 1100.2?NoYes 2102.6Yes 3?No 499.6Yes 599.8?YesNo 696.4YesNo 796.6NoYesNo 8?Yes? CASE ATTRIBUTESDecision FLU TempHeadacheNausea 1100.2yesNoYes 2102.6Yes 396.4No 499.6Yes 599.8yesYesNo 696.4YesNo 796.6NoYesNo 8102.6YesyesYes
15
Other methods of filling the missing values Number of methods to handle missing attribute values based on the dependence between known and missing values. Chase algorithm: for each case with missing data, a net data subset is created. Missing value = decision value, then merge the data set. Maximum likelihood estimation. Monte Carlo method: missing values are replaced by many possible values and then the complete data set is analyzed and the results are combined. HASCO 2014 - SHAH - OLOV
16
Parallell methods subsets, then rule induction 2 types of missing values: ”lost”: needed but gone ”do not care”: irelevant
17
17 HASCO 2014 - SHAH - OLOV CASE ATTRIBUTES Decision FLU TempHeadache e Nausea 1High?NoYes 2Very highYes 3?No 4HighYes 5High?YesNo 6NormalYesNo 7NormalNoYesNo 8?Yes? Concepts All cases with the same decision value C1 = {1,2,4,8} C2 = {3,5,6,7}
18
18 HASCO 2014 - SHAH - OLOV CASE ATTRIBUTES Decision FLU TempHadacheNausea 1High?NoYes 2Very highYes 3?No 4HighYes 5High?YesNo 6NormalYesNo 7NormalNoYesNo 8?Yes? ”Lost” values does not belong to any block. Blocks: colored, same value for certain attribute e.g. [(Temp, high)] = {1,4,5} [(Nausea, yes)] = {2,4,5,7} Characteristic sets: Intersection of blocks containing a certain case. e.g. K(4) = {4}, K(5) = {4,5} Use these to create lower and upper approximations of concepts: Lower({1,2,4,8}) = {1,2,4} Upper({1,2,4,8}) = {1,2,4,6,8} -> Rule induction Parallell method, ”Lost” values
19
19 HASCO 2014 - SHAH - OLOV CASE ATTRIBUTES Decision FLU TempHeadacheNausea 1High*NoYes 2Very highYes 3 * No 4HighYes 5High*YesNo 6NormalYesNo 7NormalNoYesNo 8*Yes* ”Do not care” values belong to every block. Blocks: colored, same value for certain attribute e.g. [(Temp, high)] = {1,3,4,5,8} [(Nausea, yes)] = {2,4,5,7,8} Characteristic sets: Intersection of blocks containing a certain case. e.g. K(4) = {4,5,8}, K(5) = {4,5,8} Use these to create lower and upper approximations of concepts: Lower({1,2,4,8}) = {2,8} Upper({1,2,4,8}) = {1,2,3,4,5,6,8} -> Rule induction Parallell method, ”Do not care” values
20
20 HASCO 2014 - SHAH - OLOV Rules: trained to describe cases, induced from the decicion table Possible rules, from upper approximation of a concept Certain rules, from lower approximation of a concept Missing values are lost: Possible: (Temp, normal) -> (Flu, no) (Headache, no) -> (Flu, no) Certain: (Temp, high) & (Nausea, no) -> (Flu, yes) (Headache, yes) & (Nausea,yes) -> (Flu, yes) Rule Induction - MLEM2 algorithm,
21
KDD Organized and automatized process of exploring patterns in large data sets More general than Data Mining The core of KDD is DM
22
9 steps Iteraive & Interactive Not yet: Best solution to each kind of problem at each step.
23
Step 1: Understand and specify the goals of the end user Preprocessing part: Step 2: Select and create the data set Step 3: Preprocessing and cleaning => enhance reliabiliy Step 4: Data trasformation => better data for DM
24
Data Mining part: Step 5: Choosing the apprpriate DM task Step 6: Choosing the DM algorithm, precision vs understandability Step 7: Employing the DM algorithm Step 8: Evaluate and interpret mined patterns
25
Step 9: Using the discovered knowledge Success of the entire KDD determined by this Challenges, e.g. loosing lab conditions
26
END HASCO 2014 - SHAH - OLOV
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.