Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Lecture 34 of 42 Wednesday, 19 November 2008 William H. Hsu Department of Computing and Information Sciences, KSU KSOL course page: Course web site: Instructor home page: Reading for Next Class: Sections 22.1, , Russell & Norvig 2 nd edition Genetic and Evolutionary Computation Discussion: GA, GP
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Hidden Units and Feature Extraction Training procedure: hidden unit representations that minimize error E Sometimes backprop will define new hidden features that are not explicit in the input representation x, but which capture properties of the input instances that are most relevant to learning the target function t(x) Hidden units express newly constructed features Change of representation to linearly separable D’ A Target Function (Sparse aka 1-of-C, Coding) Can this be learned? (Why or why not?) Learning Hidden Layer Representations
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Training: Evolution of Error and Hidden Unit Encoding error D (o k ) h j ( ), 1 j 3
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Input-to-Hidden Unit Weights and Feature Extraction Changes in first weight layer values correspond to changes in hidden layer encoding and consequent output squared errors w 0 (bias weight, analogue of threshold in LTU) converges to a value near 0 Several changes in first 1000 epochs (different encodings) Training: Weight Evolution u i1, 1 i 8
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Convergence of Backpropagation No Guarantee of Convergence to Global Optimum Solution Compare: perceptron convergence (to best h H, provided h H; i.e., LS) Gradient descent to some local error minimum (perhaps not global minimum…) Possible improvements on backprop (BP) Momentum term (BP variant with slightly different weight update rule) Stochastic gradient descent (BP algorithm variant) Train multiple nets with different initial weights; find a good mixture Improvements on feedforward networks Bayesian learning for ANNs (e.g., simulated annealing) - later Other global optimization methods that integrate over multiple networks Nature of Convergence Initialize weights near zero Therefore, initial network near-linear Increasingly non-linear functions possible as training progresses
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Overtraining in ANNs Error versus epochs (Example 2) Recall: Definition of Overfitting h’ worse than h on D train, better on D test Overtraining: A Type of Overfitting Due to excessive iterations Avoidance: stopping criterion (cross-validation: holdout, k-fold) Avoidance: weight decay Error versus epochs (Example 1)
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Overfitting in ANNs Other Causes of Overfitting Possible Number of hidden units sometimes set in advance Too few hidden units (“underfitting”) ANNs with no growth Analogy: underdetermined linear system of equations (more unknowns than equations) Too many hidden units ANNs with no pruning Analogy: fitting a quadratic polynomial with an approximator of degree >> 2 Solution Approaches Prevention: attribute subset selection (using pre-filter or wrapper) Avoidance Hold out cross-validation (CV) set or split k ways (when to stop?) Weight decay: decrease each weight by some factor on each epoch Detection/recovery: random restarts, addition and deletion of weights, units
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence 90% Accurate Learning Head Pose, Recognizing 1-of-20 Faces Example: Neural Nets for Face Recognition 30 x 32 Inputs Left Straight Right Up Hidden Layer Weights after 1 Epoch Hidden Layer Weights after 25 Epochs Output Layer Weights (including w 0 = ) after 1 Epoch
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Example: NetTalk Sejnowski and Rosenberg, 1987 Early Large-Scale Application of Backprop Learning to convert text to speech Acquired model: a mapping from letters to phonemes and stress marks Output passed to a speech synthesizer Good performance after training on a vocabulary of ~1000 words Very Sophisticated Input-Output Encoding Input: 7-letter window; determines the phoneme for the center letter and context on each side; distributed (i.e., sparse) representation: 200 bits Output: units for articulatory modifiers (e.g., “voiced”), stress, closest phoneme; distributed representation 40 hidden units; weights total Experimental Results Vocabulary: trained on 1024 of 1463 (informal) and 1000 of (dictionary) 78% on informal, ~60% on dictionary
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence NeuroSolutions Demo
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence PAC Learning: Definition and Rationale Intuition Can’t expect a learner to learn exactly Multiple consistent concepts Unseen examples: could have any label (“OK” to mislabel if “rare”) Can’t always approximate c closely (probability of D not being representative) Terms Considered Class C of possible concepts, learner L, hypothesis space H Instances X, each of length n attributes Error parameter , confidence parameter , true error error D (h) size(c) = the encoding length of c, assuming some representation Definition C is PAC-learnable by L using H if for all c C, distributions D over X, such that 0 < < 1/2, and such that 0 < < 1/2, learner L will, with probability at least (1 - ), output a hypothesis h H such that error D (h) Efficiently PAC-learnable: L runs in time polynomial in 1/ , 1/ , n, size(c)
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence PAC Learning: Results for Two Hypothesis Languages Unbiased Learner Recall: sample complexity bound m 1/ (ln | H | + ln (1/ )) Sample complexity not always polynomial Example: for unbiased learner, | H | = 2 | X | Suppose X consists of n booleans (binary-valued attributes) | X | = 2 n, | H | = 2 2 n m 1/ (2 n ln 2 + ln (1/ )) Sample complexity for this H is exponential in n Monotone Conjunctions Target function of the form Active learning protocol (learner gives query instances): n examples needed Passive learning with a helpful teacher: k examples (k literals in true concept) Passive learning with randomly selected examples (proof to follow): m 1/ (ln | H | + ln (1/ )) = 1/ (ln n + ln (1/ ))
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence PAC Learning: Monotone Conjunctions [1] Monotone Conjunctive Concepts Suppose c C (and h H) is of the form x 1 x 2 … x m n possible variables: either omitted or included (i.e., positive literals only) Errors of Omission (False Negatives) Claim: the only possible errors are false negatives (h(x) = -, c(x) = +) Mistake iff (z h) (z c) ( x D test. x(z) = false): then h(x) = -, c(x) = + Probability of False Negatives Let z be a literal; let Pr(Z) be the probability that z is false in a positive x D z in target concept (correct conjunction c = x 1 x 2 … x m ) Pr(Z) = 0 Pr(Z) is the probability that a randomly chosen positive example has z = false (inducing a potential mistake, or deleting z from h if training is still in progress) error(h) z h Pr(Z) c h Instance Space X
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence PAC Learning: Monotone Conjunctions [2] Bad Literals Call a literal z bad if Pr(Z) > = ’/n z does not belong in h, and is likely to be dropped (by appearing with value true in a positive x D), but has not yet appeared in such an example Case of No Bad Literals Lemma: if there are no bad literals, then error(h) ’ Proof: error(h) z h Pr(Z) z h ’/n ’ (worst case: all n z’s are in c ~ h) Case of Some Bad Literals Let z be a bad literal Survival probability (probability that it will not be eliminated by a given example): 1 - Pr(Z) < 1 - ’/n Survival probability over m examples: (1 - Pr(Z)) m < (1 - ’/n) m Worst case survival probability over m examples (n bad literals) = n (1 - ’/n) m Intuition: more chance of a mistake = greater chance to learn
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence PAC Learning: Monotone Conjunctions [3] Goal: Achieve An Upper Bound for Worst-Case Survival Probability Choose m large enough so that probability of a bad literal z surviving across m examples is less than Pr(z survives m examples) = n (1 - ’/n) m < Solve for m using inequality 1 - x < e -x n e -m ’/n < m > n/ ’ (ln (n) + ln (1/ )) examples needed to guarantee the bounds This completes the proof of the PAC result for monotone conjunctions Nota Bene: a specialization of m 1/ (ln | H | + ln (1/ )); n/ ’ = 1/ Practical Ramifications Suppose = 0.1, ’ = 0.1, n = 100: we need 6907 examples Suppose = 0.1, ’ = 0.1, n = 10: we need only 460 examples Suppose = 0.01, ’ = 0.1, n = 10: we need only 690 examples
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence PAC Learning: k-CNF, k-Clause-CNF, k-DNF, k-Term-DNF k-CNF (Conjunctive Normal Form) Concepts: Efficiently PAC-Learnable Conjunctions of any number of disjunctive clauses, each with at most k literals c = C 1 C 2 … C m ; C i = l 1 l 1 … l k ; ln (| k-CNF |) = ln (2 (2n) k ) = (n k ) Algorithm: reduce to learning monotone conjunctions over n k pseudo-literals C i k-Clause-CNF c = C 1 C 2 … C k ; C i = l 1 l 1 … l m ; ln (| k-Clause-CNF |) = ln (3 kn ) = (kn) Efficiently PAC learnable? See below (k-Clause-CNF, k-Term-DNF are duals) k-DNF (Disjunctive Normal Form) Disjunctions of any number of conjunctive terms, each with at most k literals c = T 1 T 2 … T m ; T i = l 1 l 1 … l k k-Term-DNF: “Not” Efficiently PAC-Learnable (Kind Of, Sort Of…) c = T 1 T 2 … T k ; T i = l 1 l 1 … l m ; ln (| k-Term-DNF |) = ln (k3 n ) = (n + ln k) Polynomial sample complexity, not computational complexity (unless RP = NP) Solution: Don’t use H = C! k-Term-DNF k-CNF (so let H = k-CNF)
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Consistent Learners General Scheme for Learning Follows immediately from definition of consistent hypothesis Given: a sample D of m examples Find: some h H that is consistent with all m examples PAC: show that if m is large enough, a consistent hypothesis must be close enough to c Efficient PAC (and other COLT formalisms): show that you can compute the consistent hypothesis efficiently Monotone Conjunctions Used an Elimination algorithm (compare: Find-S) to find a hypothesis h that is consistent with the training set (easy to compute) Showed that with sufficiently many examples (polynomial in the parameters), then h is close to c Sample complexity gives an assurance of “convergence to criterion” for specified m, and a necessary condition (polynomial in n) for tractability
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence VC Dimension: Framework Infinite Hypothesis Space? Preceding analyses were restricted to finite hypothesis spaces Some infinite hypothesis spaces are more expressive than others, e.g., rectangles vs. 17-sided convex polygons vs. general convex polygons linear threshold (LT) function vs. a conjunction of LT units Need a measure of the expressiveness of an infinite H other than its size Vapnik-Chervonenkis Dimension: VC(H) Provides such a measure Analogous to | H |: there are bounds for sample complexity using VC(H)
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence VC Dimension: Shattering A Set of Instances Dichotomies Recall: a partition of a set S is a collection of disjoint sets S i whose union is S Definition: a dichotomy of a set S is a partition of S into two subsets S 1 and S 2 Shattering A set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S, there exists a hypothesis in H consistent with this dichotomy Intuition: a rich set of functions shatters a larger instance space The “Shattering Game” (An Adversarial Interpretation) Your client selects an S (an instance space X) You select an H Your adversary labels S (i.e., chooses a point c from concept space C = 2 X ) You must find then some h H that “covers” (is consistent with) c If you can do this for any c your adversary comes up with, H shatters S
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence VC Dimension: Examples of Shattered Sets Three Instances Shattered Intervals Left-bounded intervals on the real axis: [0, a), for a R 0 Sets of 2 points cannot be shattered Given 2 points, can label so that no hypothesis will be consistent Intervals on the real axis ([a, b], b R > a R ): can shatter 1 or 2 points, not 3 Half-spaces in the plane (non-collinear): 1? 2? 3? 4? Instance Space X 0a ab +
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Lecture Outline Readings for Friday Finish Chapter 20, Russell and Norvig 2e Suggested: Chapter 1, , Goldberg; 9.1 – 9.4, Mitchell Evolutionary Computation Biological motivation: process of natural selection Framework for search, optimization, and learning Prototypical (Simple) Genetic Algorithm Components: selection, crossover, mutation Representing hypotheses as individuals in GAs An Example: GA-Based Inductive Learning (GABIL) GA Building Blocks (aka Schemas) Taking Stock (Course Review)
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Simple Genetic Algorithm (SGA) Algorithm Simple-Genetic-Algorithm (Fitness, Fitness-Threshold, p, r, m) // p: population size; r: replacement rate (aka generation gap width), m: string size P p random hypotheses// initialize population FOR each h in P DO f[h] Fitness(h)// evaluate Fitness: hypothesis R WHILE (Max(f) < Fitness-Threshold) DO 1. Select: Probabilistically select (1 - r)p members of P to add to P S 2. Crossover: Probabilistically select (r · p)/2 pairs of hypotheses from P FOR each pair DO P S += Crossover ( )// P S [t+1] = P S [t] + 3. Mutate: Invert a randomly selected bit in m · p random members of P S 4. Update: P P S 5. Evaluate: FOR each h in P DO f[h] Fitness(h) RETURN the hypothesis h in P that has maximum fitness f[h]
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence GA-Based Inductive Learning (GABIL) GABIL System [Dejong et al, 1993] Given: concept learning problem and examples Learn: disjunctive set of propositional rules Goal: results competitive with those for current decision tree learning algorithms (e.g., C4.5) Fitness Function: Fitness(h) = (Correct(h)) 2 Representation Rules: IF a 1 = T a 2 = F THEN c = T; IF a 2 = T THEN c = F Bit string encoding: a 1 [10]. a 2 [01]. c [1]. a 1 [11]. a 2 [10]. c [0] = Genetic Operators Want variable-length rule sets Want only well-formed bit string hypotheses
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Crossover: Variable-Length Bit Strings Basic Representation Start with a 1 a 2 c a 1 a 2 c h 1 1[ ]00 h 2 0[1 1] Idea: allow crossover to produce variable-length offspring Procedure 1. Choose crossover points for h 1, e.g., after bits 1, 8 2. Now restrict crossover points in h 2 to those that produce bitstrings with well-defined semantics, e.g.,,, Example Suppose we choose Result h h
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence GABIL Extensions New Genetic Operators Applied probabilistically 1. AddAlternative: generalize constraint on a i by changing a 0 to a 1 2. DropCondition: generalize constraint on a i by changing every 0 to a 1 New Field Add fields to bit string to decide whether to allow above operators a 1 a 2 c a 1 a 2 cAADC So now learning strategy also evolves! aka genetic wrapper
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence GABIL Results Classification Accuracy Compared to symbolic rule/tree learning methods C4.5 [Quinlan, 1993] ID5R AQ14 [Michalski, 1986] Performance of GABIL comparable Average performance on a set of 12 synthetic problems: 92.1% test accuracy Symbolic learning methods ranged from 91.2% to 96.6% Effect of Generalization Operators Result above is for GABIL without AA and DC Average test set accuracy on 12 synthetic problems with AA and DC: 95.2%
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Building Blocks (Schemas) Problem How to characterize evolution of population in GA? Goal Identify basic building block of GAs Describe family of individuals Definition: Schema String containing 0, 1, * (“don’t care”) Typical schema: 10**0* Instances of above schema: , , … Solution Approach Characterize population by number of instances representing each schema m(s, t) number of instances of schema s in population at time t
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Selection and Building Blocks Restricted Case: Selection Only average fitness of population at time t m(s, t) number of instances of schema s in population at time t average fitness of instances of schema s at time t Quantities of Interest Probability of selecting h in one selection step Probability of selecting an instance of s in one selection step Expected number of instances of s after n selections
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Schema Theorem Theorem m(s, t) number of instances of schema s in population at time t average fitness of population at time t average fitness of instances of schema s at time t p c probability of single point crossover operator p m probability of mutation operator l length of individual bit strings o(s) number of defined (non “*”) bits in s d(s) distance between rightmost, leftmost defined bits in s Intuitive Meaning “The expected number of instances of a schema in the population tends toward its relative fitness” A fundamental theorem of GA analysis and design
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Genetic Programming Readings / Viewings View GP videos 1-3 GP1 – Genetic Programming: The Video GP2 – Genetic Programming: The Next Generation GP3 – Genetic Programming: Invention GP4 – Genetic Programming: Human-Competitive Suggested: Chapters 1-5, Koza Previously Genetic and evolutionary computation (GEC) Generational vs. steady-state GAs; relation to simulated annealing, MCMC Schema theory and GA engineering overview Today: GP Discussions Code bloat and potential mitigants: types, OOP, parsimony, optimization, reuse Genetic programming vs. human programming: similarities, differences
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence GP Flow Graph Adapted from The Genetic Programming Notebook © 2002 Jaime J. Fernandez
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Structural Crossover Adapted from The Genetic Programming Notebook © 2002 Jaime J. Fernandez
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Structural Mutation Adapted from The Genetic Programming Notebook © 2002 Jaime J. Fernandez
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Terminology Evolutionary Computation (EC): Models Based on Natural Selection Genetic Algorithm (GA) Concepts Individual: single entity of model (corresponds to hypothesis) Population: collection of entities in competition for survival Generation: single application of selection and crossover operations Schema aka building block: descriptor of GA population (e.g., 10**0*) Schema theorem: representation of schema proportional to its relative fitness Simple Genetic Algorithm (SGA) Steps Selection Proportionate (aka roulette wheel): P(individual) f(individual) Tournament: let individuals compete in pairs or tuples; eliminate unfit ones Crossover Single-point: { , } Two-point: { , } Uniform: { , } Mutation: single-point (“bit flip”), multi-point
Computing & Information Sciences Kansas State University Wednesday, 19 Nov 2008CIS 530 / 730: Artificial Intelligence Summary Points Evolutionary Computation Motivation: process of natural selection Limited population; individuals compete for membership Method for parallelizing and stochastic search Framework for problem solving: search, optimization, learning Prototypical (Simple) Genetic Algorithm (GA) Steps Selection: reproduce individuals probabilistically, in proportion to fitness Crossover: generate new individuals probabilistically, from pairs of “parents” Mutation: modify structure of individual randomly How to represent hypotheses as individuals in GAs An Example: GA-Based Inductive Learning (GABIL) Schema Theorem: Propagation of Building Blocks Next Lecture: Genetic Programming, The Movie