Richard Jensen and Qiang Shen Prof Qiang Shen Aberystwyth University, UK Dr. Richard Jensen Aberystwyth University, UK Interval-valued Fuzzy-Rough Feature Selection in Datasets with Missing Values Interval-valued Fuzzy-Rough Feature Selection in Datasets with Missing Values FUZZ-IEEE 2009
Richard Jensen and Qiang Shen Outline The importance of feature selectionThe importance of feature selection Rough set theoryRough set theory Fuzzy-rough feature selection (FRFS)Fuzzy-rough feature selection (FRFS) Interval-valued FRFSInterval-valued FRFS ExperimentationExperimentation ConclusionConclusion
Richard Jensen and Qiang Shen Why dimensionality reduction/feature selection?Why dimensionality reduction/feature selection? Growth of information - need to manage this effectivelyGrowth of information - need to manage this effectively Curse of dimensionality - a problem for machine learningCurse of dimensionality - a problem for machine learning Data visualisation - graphing dataData visualisation - graphing data High dimensional data Dimensionality Reduction Low dimensional data Processing System Intractable Feature selection
Richard Jensen and Qiang Shen Feature selection Feature selection (FS) is a DR technique that preserves data semantics (meaning of data)Feature selection (FS) is a DR technique that preserves data semantics (meaning of data) Subset generation: forwards, backwards, random…Subset generation: forwards, backwards, random… Evaluation function: determines ‘goodness’ of subsetsEvaluation function: determines ‘goodness’ of subsets Stopping criterion: decide when to stop subset searchStopping criterion: decide when to stop subset search Generation Evaluation Stopping Criterion Validation Feature set Subset Subset suitability ContinueStop
Richard Jensen and Qiang Shen Rough set theory Rx is the set of all points that are indiscernible with point x in terms of feature subset B UpperApproximation Set A LowerApproximation Equivalence class Rx
Richard Jensen and Qiang Shen Rough set feature selection Attempts to remove unnecessary or redundant featuresAttempts to remove unnecessary or redundant features Evaluation: function based on rough set concept of lower approximationEvaluation: function based on rough set concept of lower approximation Generation: greedy hill-climbing algorithm employedGeneration: greedy hill-climbing algorithm employed Stopping criterion: when maximum evaluation value is reachedStopping criterion: when maximum evaluation value is reached
Richard Jensen and Qiang Shen 7 Fuzzy-rough sets Fuzzy-rough set Fuzzy similarity
Richard Jensen and Qiang Shen Fuzzy-rough sets Fuzzy-rough feature selectionFuzzy-rough feature selection Evaluation: function based on fuzzy-rough lower approximationEvaluation: function based on fuzzy-rough lower approximation Generation: greedy hill-climbingGeneration: greedy hill-climbing Stopping criterion: when maximal ‘goodness’ is reached (or to degree α)Stopping criterion: when maximal ‘goodness’ is reached (or to degree α) Problem #1: how to choose fuzzy similarity?Problem #1: how to choose fuzzy similarity? Problem #2: how to handle missing values?Problem #2: how to handle missing values?
Richard Jensen and Qiang Shen Interval-valued FRFS IV fuzzy rough set IV fuzzy similarity Answer #1: Model uncertainty in fuzzy similarity by interval-valued similarityAnswer #1: Model uncertainty in fuzzy similarity by interval-valued similarity
Richard Jensen and Qiang Shen Interval-valued FRFS Missing values When comparing two object values for a given attribute – what to do if at least one is missing?When comparing two object values for a given attribute – what to do if at least one is missing? Answer #2: Model missing values via the unit intervalAnswer #2: Model missing values via the unit interval
Richard Jensen and Qiang Shen Other measures Boundary regionBoundary region Discernibility functionDiscernibility function
Richard Jensen and Qiang Shen Experimentation Datasets corrupted with noiseDatasets corrupted with noise 10-fold cross validation with JRip10-fold cross validation with JRip
Richard Jensen and Qiang Shen Results: lower
Richard Jensen and Qiang Shen Results: boundary
Richard Jensen and Qiang Shen Results: discernibility
Richard Jensen and Qiang Shen Conclusion New approaches to fuzzy-rough feature selection based on IVFSNew approaches to fuzzy-rough feature selection based on IVFS Can handle missing values effectivelyCan handle missing values effectively Allows greater flexibility w.r.t. similarity relationsAllows greater flexibility w.r.t. similarity relations Future workFuture work Further investigationsFurther investigations Development and extension of other fuzzy-rough methods to handle missing values – classifiers, clusterers etc.Development and extension of other fuzzy-rough methods to handle missing values – classifiers, clusterers etc.
Richard Jensen and Qiang Shen WEKA implementations of all fuzzy-rough feature selectors and classifiers can be downloaded from:WEKA implementations of all fuzzy-rough feature selectors and classifiers can be downloaded from:
Richard Jensen and Qiang Shen
RSAR approximations Approximating a concept X using knowledge in PApproximating a concept X using knowledge in P Lower approximation: contains objects that definitely belong to XLower approximation: contains objects that definitely belong to X Upper approximation: contains objects that possibly belong to XUpper approximation: contains objects that possibly belong to X
Richard Jensen and Qiang Shen FRFS Based on fuzzy similarityBased on fuzzy similarity Lower/upper approximationsLower/upper approximations