1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집.

1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

2 0 Introduction t Gap between theory and practice is wide t measuring is significant  dynamic run control  data preprocessing  significance or meaning of run t online analysis tools t offline analysis tools

3 1 Statistical Tools for GP 1.1 Basic Statistics Concepts t statistical population  entire group of instances (measured and unmeasured) t sample  subset of a statistical population t statistical significance level  a percentage value, chosen for judging the value of a measurement

4 1.2 Basic Tools for GP t Confidence interval  range around a measured occurrence of an event in which the statistician estimates a specified portion of future measurements of that same event  example u 3000 programs/population u avg. indiv. is 200 nodes u 3000*200->600 000 nodes u suppose that 600 out of 1000 sampling nodes were introns s 95% confidence level is 57%~63% u 2000 sampling nodes : 59%~61% u 30 sampling nodes : 40%~78%

5 Correlation Measures t Correlation Measures  Correlation Coefficient u 0.8 means 80% of the variation in one variable may be explained by variations in the other variable u + : increasing values of the first variable are related to increasing values of the second variable  student’s t-test u t-test>=2 : two variables are related at the 95% confidence level or better  use of correlation analysis in GP u ex) relationship between mutation rate/performance

6 Testing Propositions  Multiple Regression u more sophisticated technique than simple correlation coefficient t Testing Propositions  F-Test u test whether a proposition is not false  Caveat u statistics has a number of assumptions u in practice, the tests work pretty well even if assumptions does not meet

7 2. Offline Preprocessing and Analysis t Task of researcher  select data series and data instances  determine transformation on the data t Preprocessing and analysis  preprocessing to meet input representation constraints  preprocessing to extract useful information from the data to enable the machine learning system to learn  analyzing the data to select a training set

8 2.1 Feature Representation Constraints t Representation of features  ANN:[-1:1]  Boolean system : 0 or 1, true or false  GP u great freedom of representation of the features u can accept inputs that can be handled by the computer language

9 2.2 Feature Extraction t Feature Extraction  extract useful types of information from the raw data  filter out noise t Principal Components Analysis(PCA)  purpose:reducing redundancy in raw data  extracts the useful variation from several partially correlated data series and condenses that information into fewer but completely uncorrelated data series  not automatic process

10 Extraction of Periodic Information in Time Series Data t Extraction of Periodic Information in Time Series Data  simple techniques u simple or exponential moving averages (SMAs or EMAs) u SMA:serve as a sort of low pass filter->lag  discrete fourier transform

11 2.3 Analysis of Input Data t Selecting a training set  how to choose among input series  how to choose training instances t Choosing among Input Series  meta-learning approach  correlation coefficients between each potential input and output->narrow what input to use  correlation coefficients between each potential input->grouping inputs  try different runs with different combinations of variables->select variables that are associated with good runs

12 Choosing Training Instances t Choosing Training Instances  Data Mining u many more training instances available than a GP system could possibly digest u approach s select a random sample of training instances s calculate approximate sample size s pick the random sample that is picked matches sampled distribution closely s GP system is programmed to pick a new small training set

13 3 Offline Postprocessing 3.1 Measurement of Processing Effort t processing effort  the number of indivs that have to be processed in order to find a solution t Instantaneous Probability  a certain run with M indivs generates a solution in generation i t Success Probability  prob. That one obtains a solution for the given problem if one performs a run over i generations

14 t the probability of finding a solution by generation i, using R runs t how many runs do we need to solve a program with a certain probability z?

15 3.2 Trait Mining t Code generated by GP  problem-relevant but redundant code and irrelevant to the problem at hand  avoiding useless code u restrict the size and complexity of the programs u gene banking s keeps book on all expressions evolved so far during a GP run

16 4 Analysis and Measurement of Online Data t 4.1 Online Data Analysis  monitor the transition from randomness to stability  highlight how the transition takes place  raise the possibility of being able to control GP runs through feedback from the online measurements

17 4.2 Measurement of Online Data t Generational Online Measurement  population as a whole u avg. fitness, percentage of the population that is comprised of introns,... t Steady-State Online Measurement

18 4.3 Survey of Available Online Tools t Fitness  best fitness  avg. fitness  variance of the fitness

19 Diversity t Diversity  genotypic diversity:structural difference  phenotypic diversity:behavioral difference t Measuring Genotypic Diversity  no quality(fitness) information contained  based on a comparison of the structure of the individuals only

20 Edit distance t Edit distance  the number of elementary substitution operations necessary to traverse the search space from one program to another  Fixed Length Genomes  Tree Genomes

21 Phenotypic Diversity t Fitness Variance t Fitness Histograms t Entropy

22 Measuring Operator Effects t Crossover Effects  comparison avg. Fitness of both parents with avg. Fitness of both offspring  the fitness of children and parents are compared by one by one  formal way

23 Intron Measurements t Intron counting  process of ascertaining the number of introns  intron counting in tree structures u flag:indicating whether this node has the same output as one of the inputs  intron counting in linear genomes u replace instruction with a NOP-instruction, if there was no change for any of the fitness cases, classify this instruction as intron

24 Compression Measurement t Compression  effective length of an individual often is reduced during evolution  effective length=total length - intron length

25 Node use and Block activation t Node use  how often individual nodes are used in the present generation  incrementing a counter associated with a node t Block activation  the number of times the root node of a bock is executed t Salient block  code which influence fitness evaluation

26 Real Time Run Control Using Online Measurements t PADO  meta-learning module changed the crossover operator itself during the run. t AIMGP  termination condition : when destructive crossover fell below 10% of total events u doubled speed

27 5. Generalization and Induction t Generalization problem  sampling error  overfitting u complexity of the learned solution u amount of time spent training u size of the training set

28 5.1 An Example of Overfitting and Poor Generalization t Training function t error function

32 5.2 Dealing with Generalization Issues t Training set and test set  best indiv. From training set is run on the test set  which is the best indiv. of training set? u Highest fitness one may be overfitted the training data. t Adding a new test set

33 6 Conclusion t Observe, measure, test t tools to estimate the predictive value of GP models t tools to improve GP system or to test theories about

1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집.

Similar presentations

Presentation on theme: "1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집.

Similar presentations

Presentation on theme: "1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집."— Presentation transcript:

Similar presentations

About project

Feedback