Download presentation
Presentation is loading. Please wait.
Published byMiles Jefferson Modified over 9 years ago
1
Why preprocessing? Learning method needs data type: numerical, nominal,.. Learning method cannot deal well enough with noisy / incomplete data Too many data (memory, time) –Examples –Attributes –Values Data violate asumption of method –Correlated attributes
2
Bias in learning method –E.g. linearity
3
Preprocessing part of DM/ML Ideally methods should include transformations Practically: preprocessing takes most of the time of DM process Transformations * learning methods large search space Some preprocessing is useful for all learning methods, some is specific Main types: –Remove bad examples / features –discretisation
4
Attribute selection Aka “feature subset selection” Features that cannot contribute at all to prediction/classification cause problems for (some) learners Redundant attributes can also be harmful “wrapper approach”: evaluate feature subsets by learning with them “Filter approach”: try to identify bad attributes without learning, eg. associated with target class and association between attributes
5
Many combinations … Optimal attribute subset depends on learner Redundant: combine, not remove –E.g. “thermometer value”, “subjective temperature” average value is more reliable than one of these!
6
Discretisation supervised / unsupervised Fixed size “bins” / fixed number of “bins” / flexible Supervised ~ 1 attribute learning with intervals So: information gain, MDL (!?); maybe chi- square for stopping Recursive splitting = 1-pass splitting
7
Discrete numerical Each attribute-value combination as separate binary attribute 1/0 Or|: “scaling”: red10 yellow7 red9 Green5 3 Yellow6
8
More transformations Principal component analysis –Find principal components (~ correlated attributes) –Remove components with little variance –Use components as attributes for learning
9
Data cleansing Impossible values Outliers (from distribution, median/mean) Outliers (from predictions) Risk: throw away unexpected but correct / data: anomalies
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.