Download presentation
Presentation is loading. Please wait.
Published byByron Wiggins Modified over 6 years ago
1
Advanced Artificial Intelligence Feature Selection
Chung-Ang University, Jinhyeong Park Welcome to Advanced Artificial Intelligence class. This time, I will introduce the Feature Selection.
2
What is feature? "Feature" a component of data
Before talk about feature selection, you should know about feature. A feature is an individual measurable characteristic of the data. Feature is usually numeric, but the meaning depends on the data. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
3
What’s is important feature?
What is feature? Representing natural text Text Features of Text Label: Puppy or not? YES I am a puppy and I am unhappy. I 2 you upset unhappy 1 puppy bear What’s is important feature? ⋮ For example. This is an instance of the text data. Text data consists of several words. In text data, features can be each words. Each feature can be relevant or irrelevant about our purpose. Which features are relevant when determining the data is Puppy or not? Exactly word “puppy” will be relevant more than word “unhappy”. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
4
What’s is important feature?
What is feature? Representing Image Instance Features of Instance Label: Person or not? YES 3.29 -15 48.3 25.1 3.82 What’s is important feature? ⋮ Next is an instance of the image data. Image data consists of many pixels with millions of RGB values. Image data also have many features that can express them. And Some feature can be redundant about other features. In this example, which features can be called redundant to the others? Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
5
What’s is important feature?
What is feature? Representing Image Instance Features of Instance Label: Person or not? YES 3.29 -15 48.3 25.1 3.82 What’s is important feature? ⋮ Maybe it is eyes. One eye is similar to another. If you don’t need a information about pair of features “eye”, they are redundant to each other. When preparing data for machine learning, we need to select only the important features without out of many. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
6
Why select features? So why need to select features? Because it is to increase prediction accuracy and reduce learning time. A data consists of one or more features, and a feature has feature space that can have various values. If it data has one feature, it has a one-dimensional feature space. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
7
Why select features? If it has two features, it has a tow-dimensional feature space. If it has an N feature, it has N-dimensional feature space. Suppose you have data that can have 10 kinds of values in each feature space. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
8
The curse of dimensionality
10 positions With one-dimensional feature space there are only 10 possible positions. Therefore, 10 data are required to create a representative sample which covers the problem space. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
9
The curse of dimensionality
10 positions 2 dimension: 100 positions With two-dimensions there are 10^2 = 100 possible positions. Therefore, 100 data are required to create a representative sample which covers the problem space. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
10
The curse of dimensionality
1000 positions With just three-dimensions there are now 10^3 = 1000 possible positions therefore 1000 data are required to create a representative sample which covers the problem space. This exponential growth in the required number of data continues to grow exponentially indefinitely. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
11
The curse of dimensionality
This means that the higher the dimension, less space the data occupies in the whole space. As the data becomes sparse, the new data is likely to be further away from train data, requiring much more work to be done for prediction and less accurate than in lower dimensions. This phenomenon is called the curse of dimensionality. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
12
Dimensionality Reduction
Feature Selection Feature Extraction · Filter · Wrapper · PCA To avoid the curse of dimensionality, numerous Dimensionality Reduction techniques have been developed. These techniques aim to reduce the number of dimensions in a data set without significant loss of information. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
13
Dimensionality Reduction
Feature Selection Feature Extraction · Filter · Wrapper · PCA Dimensionality Reduction can be divided into two subcategories called Feature Selection and Feature Extraction. Feature Selection includes Wrapper and Filter. And Feature Extraction includes Principle Components Analysis(PCA). Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
14
Feature Selection & Extraction
𝑎+𝑏+𝑐+𝑑=𝑒 So how exactly do Feature Selection and Feature Extraction reduce dimensions? To see how this works, think of a simple algebraic equation. 𝑎+𝑏+𝑐+𝑑=𝑒. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
15
Feature Selection & Extraction
𝑎+𝑏+𝑐+𝑑=𝑒 𝑐=0 𝑎+𝑏+𝑑=𝑒 Feature Selection Consider if 𝑐 was equal to 0 or an unnecessary small number. It wouldn’t really be relevant, therefore it could be taken out of the equation. By doing so, you’d be using Feature Selection because you’d be selecting only the relevant variables and leaving out the irrelevant one. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
16
Feature Selection & Extraction
𝑎+𝑏+𝑐+𝑑=𝑒 𝑐=0 𝑎𝑏=𝑎+𝑏 𝑎+𝑏+𝑑=𝑒 𝑎𝑏+𝑐+𝑑=𝑒 Feature Extraction Feature Selection Now, If you can equate 𝑎𝑏=𝑎+𝑏, making a representation of two variables into one. Then you’re using Feature Extraction to reduce the number of variables. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
17
Feature Selection & Extraction
𝑎+𝑏+𝑐+𝑑=𝑒 find feature subset create new feature 𝑐=0 𝑎𝑏=𝑎+𝑏 𝑎+𝑏+𝑑=𝑒 𝑎𝑏+𝑐+𝑑=𝑒 Feature Extraction Feature Selection Feature selection techniques should be distinguished from feature extraction. Feature extraction creates new features from function of the original features, whereas feature selection returns a subset of the features. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
18
Feature Selection Full Feature Set Select Useful Features
Feature Subset The focus of feature selection is to select a nice subset from the input data. It can make nice predictive accuracy while reducing noise or irrelevant features. There are two strategies for do this. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
19
Selection Strategy • Wrapper • Filter
Concerning different selection strategies, feature selection methods can be broadly categorized as wrapper methods and filter methods. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
20
Selection Strategy • Wrapper Step (1) Search for a subset of features.
Step (2) evaluate the selected features. Wrapper methods rely on the predictive performance of a predefined learning algorithm to evaluate the quality of selected features. Given a specific learning algorithm, a typical wrapper method performs two steps. It repeats (1) and (2) until some stopping criteria are satisfied. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
21
Selection Strategy • Wrapper Selecting the Best Subset
Set of all Features Generate a Subset Learning Algorithm Performance Feature set search component first generates a subset of features. Then the learning algorithm acts as a black box to evaluate the quality of these subset based on the learning performance. It means, the whole process works iteratively until the best learning subset is achieved or the desired number of selected features is obtained. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
22
Selection Strategy • Wrapper Selecting the Best Subset
Set of all Features Generate a Subset Learning Algorithm Performance Then the feature subset that gives the highest learning performance is returned as the selected features subset. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
23
Subset Selection 2 𝑛 • Wrapper 𝑓 1 𝑓 2 𝑓 3 𝑓 𝑛 Full Feature Set ···
possible subset Unfortunately, If we have 𝑛 features, the number of possible subsets is 2 𝑡𝑜 𝑡ℎ𝑒 𝑝𝑜𝑤𝑒𝑟 𝑛. It is impossible for us to enumerate each of these possible subsets and check which good it is. Therefore, Wrapper methods usually uses the Heuristic Search Algorithm or Sequential Selection Algorithm to obtain the final subset within a reasonable time. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
24
Selection Strategy • Genetic Algorithm for Feature Selection
𝑓 1 𝑓 2 𝑓 3 𝑓 4 𝑓 5 𝑓 6 𝑓 7 𝑓 8 𝑓 9 Feature Subset GA Chromosome 1 Let see how Heuristic Search Algorithm is used for Feature Selection. Genetic Algorithm can be used to find the subset of features wherein the chromosome bits represent if the feature is included or not. The global maximum for the objective function can be found which gives the best suboptimal subset. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
25
Subset Selection • Genetic Algorithm for Feature Selection
𝑓 1 𝑓 2 𝑓 3 𝑓 4 𝑓 5 𝑓 6 𝑓 7 𝑓 8 𝑓 9 Feature Subset GA Chromosome 1 The GA parameters and operators can be modified to suit the data or the application to obtain the best performance. A modified version called the CHCGA. This differs from traditional GA in the following ways. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
26
Selection Strategy > < • Genetic Algorithm for Feature Selection
Parent 1 Child 1 Parent 1 > 𝐸( 𝑃 1 ) 𝐸( 𝐶 1 ) Parent 2 Child 2 Child 2 < 𝐸( 𝑃 2 ) 𝐸( 𝐶 2 ) First, The best N individuals are chosen from the pool of parents and offspring. It means, better offspring replaces lesser fit parents. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
27
Selection Strategy • Genetic Algorithm for Feature Selection Parent 1
Child 1 1 Crossover ≠ = Parent 2 1 Child 2 1 Second, A highly disruptive half uniform crossover(HUX) operator is used. Only half of the bits that are different will be exchanged. For this purpose, it is calculated the number of different bits called Hamming distance between the parents. The half of this number is the number of bits exchanged between parents to form the offspring. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
28
Selection Strategy • Genetic Algorithm for Feature Selection Parent 1
Parent 1 1 ≠ = = ≠ Parent 2 1 Parent 3 1 Hamming distance/2 = 2 Hamming distance/2 = 1.5 Crossover Crossover d = 1.75 d = 1.75 Third, During reproduction step, each member of the parent population is randomly selected and paired for mating. But, it will not be paired if half of Hamming distance does not exceed the threshold d. The threshold is usually initialized to L/4 where L is the chromosome length. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
29
Selection Strategy • Genetic Algorithm for Feature Selection Parent 1
Child 1 1 ≠ = Parent 2 1 Child 2 1 If no offspring is obtained in the generation, the threshold is decremented by one. Due to these mating criteria of mating only diverse parents, the population converges as the threshold decreases. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
30
Selection Strategy • Genetic Algorithm for Feature Selection Parent 1
Child 1 1 ≠ = Parent 2 1 Child 2 1 The CHCGA converges on the solution faster and provides a more effective search by maintaining the diversity and avoiding stagnation of the population. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
31
Crossover(subset reproduction)
Selection Strategy Wrapper algorithm Best subset 𝑓 6 𝑓 7 𝑓 12 𝑓 17 𝑓 20 𝑓 21 Crossover(subset reproduction) 𝑓 1 𝑓 5 𝑓 7 𝑓 8 𝑓 10 𝑓 12 New subset Learning Algorithm Evaluate each subset 𝑓 2 𝑓 6 𝑓 7 𝑓 10 𝑓 11 𝑓 20 𝑓 8 𝑓 9 𝑓 10 𝑓 12 𝑓 13 𝑓 15 Search iteratively until good subset New subset 𝑓 6 𝑓 7 𝑓 12 𝑓 17 𝑓 20 𝑓 21 ⋮ To sum up, GA algorithm repeatedly reproduces a subset using Crossover function and searches for the best set. In this process, subset is evaluated by the learning algorithm. This algorithm run iteratively until it reach the criteria or find the best subset. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
32
Selecting the Best Subset
Selection Strategy • Filter Set of all Features Selecting the Best Subset Learning Algorithm Performance Filter methods are independent of any learning algorithms. They rely on statistical measure about data to evaluate importance of each feature. Therefore, Filter methods are more computationally efficient than wrapper methods. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
33
Selecting the Best Subset
Selection Strategy • Filter Set of all Features Selecting the Best Subset Learning Algorithm Performance No Feedback But due to the lack of a learning algorithm guiding in the feature selection phase, the selected features may not be optimal for the target learning algorithms. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
34
Selection Strategy • Filter
Step (1) Rank feature importance according to some evaluation criteria. Step (2) Delete the lowly ranked features or Select the highly ranked features. A typical filter method consists of two steps. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
35
Selection Strategy • Filter method Correlation criteria
Mutual Information Some examples of statistical measure include Correlation criteria and Mutual Information. We will look into Correlation criteria which will help us understand the relevance of a feature. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
36
Selection Strategy • Correlation criteria (1)
One of the simplest criteria is the Pearson correlation coefficient defined as (1). Where 𝑥 𝑖 is the 𝑖 𝑡ℎ variable, 𝑌 is the class labels, 𝑐𝑜𝑣() is the covariance and 𝑣𝑎𝑟() the variance. Correlation ranking can only detect linear dependencies between variable and target. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
37
Selection Strategy 𝑓 1 𝑓 2 𝑓 3 𝑓 4 𝑓 5 𝑓 6 𝑓 7 ⋯ 𝑓 𝑑 input 𝑓 6 𝑓 12
𝑓 26 𝑓 15 𝑓 2 𝑓 9 ⋯ 𝑓 4 Filter algorithm 𝑓 1 𝑓 2 𝑓 3 𝑓 4 𝑓 5 𝑓 6 𝑓 7 ⋯ 𝑓 𝑛 0.2 0.3 0.1 0.8 0.7 𝐸( 𝑓 𝑑 ) Select Top n Feature subset 𝑓 6 𝑓 12 𝑓 7 𝑓 26 𝑓 15 𝑓 2 𝑓 6 𝑓 12 𝑓 7 𝑓 26 𝑓 15 𝑓 2 𝑓 9 ⋯ 𝑓 4 Top rear Feature Ranking To sum up, we put data into the filter algorithm. And the filter algorithm use statistical measure such as correlation criteria to evaluate each feature and return the feature ranking. We can delete low ranked features or select high ranked features to obtain a Feature subset. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
38
Summary • Wrapper • Filter
Wrappers method generally achieve better performance than filter method since they are tuned to the specific interactions between the Learning algorithm and the data. However, the specific interaction need a large computational cost and can only be slow. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
39
Summary • Wrapper • Filter
Filter methods are independent about the Learning algorithm. More computationally efficient than wrapper methods. But, we are not be sure of a subset suitable for the learning algorithm. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
40
Summary Wrapper Selecting the Best Subset Guiding by Filter
Set of all Features Generate a Subset Learning Algorithm Performance Recently research, It is effective to apply the Filter method when using the Wrapper methods. We can use the filter method when the Wrapper method is initialization phase or reproduction phase. It allows the wrapper to focus on promising features and increase the performance. Advanced Artificial Intelligence / Chung-Ang University / Park Jinhyeong
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.