Download presentation
Presentation is loading. Please wait.
Published byRalph Bryan Modified over 9 years ago
1
1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집
2
2 0 Introduction t Gap between theory and practice is wide t measuring is significant dynamic run control data preprocessing significance or meaning of run t online analysis tools t offline analysis tools
3
3 1 Statistical Tools for GP 1.1 Basic Statistics Concepts t statistical population entire group of instances (measured and unmeasured) t sample subset of a statistical population t statistical significance level a percentage value, chosen for judging the value of a measurement
4
4 1.2 Basic Tools for GP t Confidence interval range around a measured occurrence of an event in which the statistician estimates a specified portion of future measurements of that same event example u 3000 programs/population u avg. indiv. is 200 nodes u 3000*200->600 000 nodes u suppose that 600 out of 1000 sampling nodes were introns s 95% confidence level is 57%~63% u 2000 sampling nodes : 59%~61% u 30 sampling nodes : 40%~78%
5
5 Correlation Measures t Correlation Measures Correlation Coefficient u 0.8 means 80% of the variation in one variable may be explained by variations in the other variable u + : increasing values of the first variable are related to increasing values of the second variable student’s t-test u t-test>=2 : two variables are related at the 95% confidence level or better use of correlation analysis in GP u ex) relationship between mutation rate/performance
6
6 Testing Propositions Multiple Regression u more sophisticated technique than simple correlation coefficient t Testing Propositions F-Test u test whether a proposition is not false Caveat u statistics has a number of assumptions u in practice, the tests work pretty well even if assumptions does not meet
7
7 2. Offline Preprocessing and Analysis t Task of researcher select data series and data instances determine transformation on the data t Preprocessing and analysis preprocessing to meet input representation constraints preprocessing to extract useful information from the data to enable the machine learning system to learn analyzing the data to select a training set
8
8 2.1 Feature Representation Constraints t Representation of features ANN:[-1:1] Boolean system : 0 or 1, true or false GP u great freedom of representation of the features u can accept inputs that can be handled by the computer language
9
9 2.2 Feature Extraction t Feature Extraction extract useful types of information from the raw data filter out noise t Principal Components Analysis(PCA) purpose:reducing redundancy in raw data extracts the useful variation from several partially correlated data series and condenses that information into fewer but completely uncorrelated data series not automatic process
10
10 Extraction of Periodic Information in Time Series Data t Extraction of Periodic Information in Time Series Data simple techniques u simple or exponential moving averages (SMAs or EMAs) u SMA:serve as a sort of low pass filter->lag discrete fourier transform
11
11 2.3 Analysis of Input Data t Selecting a training set how to choose among input series how to choose training instances t Choosing among Input Series meta-learning approach correlation coefficients between each potential input and output->narrow what input to use correlation coefficients between each potential input->grouping inputs try different runs with different combinations of variables->select variables that are associated with good runs
12
12 Choosing Training Instances t Choosing Training Instances Data Mining u many more training instances available than a GP system could possibly digest u approach s select a random sample of training instances s calculate approximate sample size s pick the random sample that is picked matches sampled distribution closely s GP system is programmed to pick a new small training set
13
13 3 Offline Postprocessing 3.1 Measurement of Processing Effort t processing effort the number of indivs that have to be processed in order to find a solution t Instantaneous Probability a certain run with M indivs generates a solution in generation i t Success Probability prob. That one obtains a solution for the given problem if one performs a run over i generations
14
14 t the probability of finding a solution by generation i, using R runs t how many runs do we need to solve a program with a certain probability z?
15
15 3.2 Trait Mining t Code generated by GP problem-relevant but redundant code and irrelevant to the problem at hand avoiding useless code u restrict the size and complexity of the programs u gene banking s keeps book on all expressions evolved so far during a GP run
16
16 4 Analysis and Measurement of Online Data t 4.1 Online Data Analysis monitor the transition from randomness to stability highlight how the transition takes place raise the possibility of being able to control GP runs through feedback from the online measurements
17
17 4.2 Measurement of Online Data t Generational Online Measurement population as a whole u avg. fitness, percentage of the population that is comprised of introns,... t Steady-State Online Measurement
18
18 4.3 Survey of Available Online Tools t Fitness best fitness avg. fitness variance of the fitness
19
19 Diversity t Diversity genotypic diversity:structural difference phenotypic diversity:behavioral difference t Measuring Genotypic Diversity no quality(fitness) information contained based on a comparison of the structure of the individuals only
20
20 Edit distance t Edit distance the number of elementary substitution operations necessary to traverse the search space from one program to another Fixed Length Genomes Tree Genomes
21
21 Phenotypic Diversity t Fitness Variance t Fitness Histograms t Entropy
22
22 Measuring Operator Effects t Crossover Effects comparison avg. Fitness of both parents with avg. Fitness of both offspring the fitness of children and parents are compared by one by one formal way
23
23 Intron Measurements t Intron counting process of ascertaining the number of introns intron counting in tree structures u flag:indicating whether this node has the same output as one of the inputs intron counting in linear genomes u replace instruction with a NOP-instruction, if there was no change for any of the fitness cases, classify this instruction as intron
24
24 Compression Measurement t Compression effective length of an individual often is reduced during evolution effective length=total length - intron length
25
25 Node use and Block activation t Node use how often individual nodes are used in the present generation incrementing a counter associated with a node t Block activation the number of times the root node of a bock is executed t Salient block code which influence fitness evaluation
26
26 Real Time Run Control Using Online Measurements t PADO meta-learning module changed the crossover operator itself during the run. t AIMGP termination condition : when destructive crossover fell below 10% of total events u doubled speed
27
27 5. Generalization and Induction t Generalization problem sampling error overfitting u complexity of the learned solution u amount of time spent training u size of the training set
28
28 5.1 An Example of Overfitting and Poor Generalization t Training function t error function
29
29
30
30
31
31
32
32 5.2 Dealing with Generalization Issues t Training set and test set best indiv. From training set is run on the test set which is the best indiv. of training set? u Highest fitness one may be overfitted the training data. t Adding a new test set
33
33 6 Conclusion t Observe, measure, test t tools to estimate the predictive value of GP models t tools to improve GP system or to test theories about
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.