Evolutionary Search Artificial Intelligence CMSC 25000 January 25, 2007.

1 Evolutionary Search Artificial Intelligence CMSC 25000 January 25, 2007

2 Agenda Motivation: –Evolving a solution Genetic Algorithms –Modelling search as evolution Mutation Crossover Survival of the fittest Survival of the most diverse Conclusions

3 Motivation: Evolution Evolution through natural selection –Individuals pass on traits to offspring –Individuals have different traits –Fittest individuals survive to produce more offspring –Over time, variation can accumulate Leading to new species

4 Simulated Evolution Evolving a solution Begin with population of individuals –Individuals = candidate solutions ~chromosomes Produce offspring with variation –Mutation: change features –Crossover: exchange features between individuals Apply natural selection –Select “best” individuals to go on to next generation Continue until satisfied with solution

5 Genetic Algorithms Applications Search parameter space for optimal assignment –Not guaranteed to find optimal, but can approach Classic optimization problems: –E.g. Travelling Salesman Problem Program design (“Genetic Programming”) Aircraft carrier landings

6 Genetic Algorithm Example Cookie recipes (Winston, AI, 1993) As evolving populations Individual = batch of cookies Quality: 0-9 –Chromosomes = 2 genes: 1 chromosome each Flour Quantity, Sugar Quantity: 1-9 Mutation: –Randomly select Flour/Sugar: +/- 1 [1-9] Crossover: –Split 2 chromosomes & rejoin; keeping both

7 Fitness Natural selection: Most fit survive Fitness= Probability of survival to next gen Question: How do we measure fitness? –“Standard method”: Relate fitness to quality :0-1; :1-9: Chromosome Quality Fitness 1 4 3 1 1 2 1 43214321 0.4 0.3 0.2 0.1

8 GA Design Issues Genetic design: –Identify sets of features = genes; Constraints? Population: How many chromosomes? –Too few => inbreeding; Too many=>too slow Mutation: How frequent? –Too few=>slow change; Too many=> wild Crossover: Allowed? How selected? Duplicates?

9 GA Design: Basic Cookie GA Genetic design: –Identify sets of features: 2 genes: flour+sugar;1-9 Population: How many chromosomes? –1 initial, 4 max Mutation: How frequent? –1 gene randomly selected, randomly mutated Crossover: Allowed? No Duplicates? No Survival: Standard method

10 Basic Cookie GA Results Results are for 1000 random trials –Initial state: 1 1-1, quality 1 chromosome On average, reaches max quality (9) in 16 generations Best: max quality in 8 generations Conclusion: –Low dimensionality search Successful even without crossover

11 Basic Cookie GA+Crossover Results Results are for 1000 random trials –Initial state: 1 1-1, quality 1 chromosome On average, reaches max quality (9) in 14 generations Conclusion: –Faster with crossover: combine good in each gene –Key: Global max achievable by maximizing each dimension independently - reduce dimensionality

12 Solving the Moat Problem Problem: –No single step mutation can reach optimal values using standard fitness (quality=0 => probability=0) Solution A: –Crossover can combine fit parents in EACH gene However, still slow: 155 generations on average 123454321 2 3 4 5 4 3 2 1 00000002 00000003 00787004 00898005 00787004 00000003 00000002 23454321

13 Questions How can we avoid the 0 quality problem? How can we avoid local maxima?

14 Rethinking Fitness Goal: Explicit bias to best – Remove implicit biases based on quality scale Solution: Rank method –Ignore actual quality values except for ranking Step 1: Rank candidates by quality Step 2: Probability of selecting ith candidate, given that i-1 candidate not selected, is constant p. –Step 2b: Last candidate is selected if no other has been Step 3: Select candidates using the probabilities

15 Rank Method Chromosome Quality Rank Std. Fitness Rank Fitness 1 4 1 3 1 2 5 2 7 5 4 3 2 1 0 1234512345 0.4 0.3 0.2 0.1 0.0 0.667 0.222 0.074 0.025 0.012 Results: Average over 1000 random runs on Moat problem - 75 Generations (vs 155 for standard method) No 0 probability entries: Based on rank not absolute quality

16 Diversity Diversity: –Degree to which chromosomes exhibit different genes –Rank & Standard methods look only at quality –Need diversity: escape local min, variety for crossover –“As good to be different as to be fit”

17 Rank-Space Method Combines diversity and quality in fitness Diversity measure: –Sum of inverse squared distances in genes Diversity rank: Avoids inadvertent bias Rank-space: –Sort on sum of diversity AND quality ranks –Best: lower left: high diversity & quality

18 Rank-Space Method Chromosome Q D D Rank Q Rank Comb Rank R-S Fitness 1 4 3 1 1 2 1 7 5 4 3 2 1 0 1534215342 1234512345 0.667 0.025 0.222 0.012 0.074 Diversity rank breaks ties After select others, sum distances to both Results: Average (Moat) 15 generations 0.04 0.25 0.059 0.062 0.05 1425314253 W.r.t. highest ranked 5-1

19 GA’s and Local Maxima Quality metrics only: –Susceptible to local max problems Quality + Diversity: –Can populate all local maxima Including global max –Key: Population must be large enough

20 GA Discussion Similar to stochastic local beam search –Beam: Population size –Stochastic: selection & mutation –Local: Each generation from single previous –Key difference: Crossover – 2 sources! Why crossover? –Schema: Partial local subsolutions E.g. 2 halves of TSP tour

21 Question Traveling Salesman Problem –CSP-style Iterative refinement –Genetic Algorithm N-Queens –CSP-style Iterative refinement –Genetic Algorithm

22 Iterative Improvement Example TSP –Start with some valid tour E.g. find greedy solution –Make incremental change to tour E.g. hill-climbing - take change that produces greatest improvement –Problem: Local minima –Solution: Randomize to search other parts of space Other methods: Simulated annealing, Genetic alg’s

23 Machine Learning: Nearest Neighbor & Information Retrieval Search Artificial Intelligence CMSC 25000 January 25, 2007

24 Agenda Machine learning: Introduction Nearest neighbor techniques –Applications: Credit rating Text Classification –K-nn –Issues: Distance, dimensions, & irrelevant attributes Efficiency: –k-d trees, parallelism

25 Machine Learning Learning: Acquiring a function, based on past inputs and values, from new inputs to values. Learn concepts, classifications, values –Identify regularities in data

26 Machine Learning Examples Pronunciation: –Spelling of word => sounds Speech recognition: –Acoustic signals => sentences Robot arm manipulation: –Target => torques Credit rating: –Financial data => loan qualification

27 Machine Learning Characterization Distinctions: –Are output values known for any inputs? Supervised vs unsupervised learning –Supervised: training consists of inputs + true output value »E.g. letters+pronunciation –Unsupervised: training consists only of inputs »E.g. letters only Course studies supervised methods

28 Machine Learning Characteristics Many machine learning techniques –Supervised vs Unsupervised Supervised: Input + true labels Unsupervised: Input ONLY –Classification vs Regression Classification: Output is from finite label set Regression: Output is continuous valued –Decision Boundary What function is learned? “Inductive Bias” –Linear, Rectangular, Vornoi diagram –Input features Discrete? Continuous? Which ones? Scaling?

29 Machine Learning Characterization Distinctions: –Are output values discrete or continuous? Discrete: “Classification” –E.g. Qualified/Unqualified for a loan application Continuous: “Regression” –E.g. Torques for robot arm motion Characteristic of task

30 Machine Learning Characterization Distinctions: –What form of function is learned? Also called “inductive bias” Graphically, decision boundary E.g. Single, linear separator –Rectangular boundaries - ID trees –Vornoi spaces…etc… + + + - - -

31 Machine Learning Functions Problem: Can the representation effectively model the class to be learned? Motivates selection of learning algorithm ++ - - - For this function, Linear discriminant is GREAT! Rectangular boundaries (e.g. ID trees) TERRIBLE! Pick the right representation!

32 Machine Learning Features Inputs: –E.g.words, acoustic measurements, financial data –Vectors of features: E.g. word: letters –‘cat’: L1=c; L2 = a; L3 = t Financial data: F1= # late payments/yr : Integer F2 = Ratio of income to expense: Real

33 Machine Learning Features Question: –Which features should be used? –How should they relate to each other? Issue 1: How do we define relation in feature space if features have different scales? –Solution: Scaling/normalization Issue 2: Which ones are important? –If differ in irrelevant feature, should ignore

34 Complexity & Generalization Goal: Predict values accurately on new inputs Problem: –Train on sample data –Can make arbitrarily complex model to fit –BUT, will probably perform badly on NEW data Strategy: –Limit complexity of model (e.g. degree of equ’n) –Split training and validation sets Hold out data to check for overfitting

35 Nearest Neighbor Memory- or case- based learning Supervised method: Training –Record labeled instances and feature-value vectors For each new, unlabeled instance –Identify “nearest” labeled instance –Assign same label Consistency heuristic: Assume that a property is the same as that of the nearest reference case.

36 Nearest Neighbor Example Problem: Robot arm motion –Difficult to model analytically Kinematic equations –Relate joint angles and manipulator positions Dynamics equations –Relate motor torques to joint angles –Difficult to achieve good results modeling robotic arms or human arm Many factors & measurements

37 Nearest Neighbor Example Solution: –Move robot arm around –Record parameters and trajectory segment Table: torques, positions,velocities, squared velocities, velocity products, accelerations –To follow a new path: Break into segments Find closest segments in table Get those torques (interpolate as necessary)

38 Nearest Neighbor Example Issue: Big table –First time with new trajectory “Closest” isn’t close Table is sparse - few entries Solution: Practice –As attempt trajectory, fill in more of table After few attempts, very close

39 Nearest Neighbor Example Credit Rating: –Classifier: Good / Poor –Features: L = # late payments/yr; R = Income/Expenses Name L R G/P A0 1.2G B25 0.4P C5 0.7 G D 20 0.8 P E 30 0.85 P F11 1.2 G G7 1.15 G H15 0.8 P

40 Nearest Neighbor Example Name L R G/P A0 1.2G B25 0.4P C5 0.7 G D 20 0.8 P E 30 0.85 P F11 1.2 G G7 1.15 G H15 0.8 P L R 3020 10 1 A B C D E F G H

41 Nearest Neighbor Example L 3020 10 1 A B C D E F G H R Name L R G/P I6 1.15 J22 0.45 K 15 1.2 G IP J ?? K Distance Measure: Sqrt ((L1-L2)^2 + [sqrt(10)*(R1-R2)]^2)) - Scaled distance

42 Nearest Neighbor Analysis Problem: –Ambiguous labeling, Training Noise Solution: –K-nearest neighbors Not just single nearest instance Compare to K nearest neighbors –Label according to majority of K What should K be? –Often 3, can train as well

43 Text Classification

44 Matching Topics and Documents Two main perspectives: –Pre-defined, fixed, finite topics: “Text Classification” –Arbitrary topics, typically defined by statement of information need (aka query) “Information Retrieval”

45 Vector Space Information Retrieval Task: –Document collection –Query specifies information need: free text –Relevance judgments: 0/1 for all docs Word evidence: Bag of words –No ordering information

46 Vector Space Model Computer Tv Program Two documents: computer program, tv program Query: computer program : matches 1 st doc: exact: distance=2 vs 0 educational program: matches both equally: distance=1

47 Vector Space Model Represent documents and queries as –Vectors of term-based features Features: tied to occurrence of terms in collection –E.g. Solution 1: Binary features: t=1 if present, 0 otherwise –Similiarity: number of terms in common Dot product

48 Vector Space Model II Problem: Not all terms equally interesting –E.g. the vs dog vs Levow Solution: Replace binary term features with weights –Document collection: term-by-document matrix –View as vector in multidimensional space Nearby vectors are related –Normalize for vector length

49 Vector Similarity Computation Similarity = Dot product Normalization: –Normalize weights in advance –Normalize post-hoc

50 Term Weighting “Aboutness” –To what degree is this term what document is about? –Within document measure –Term frequency (tf): # occurrences of t in doc j “Specificity” –How surprised are you to see this term? –Collection frequency –Inverse document frequency (idf):

51 Term Selection & Formation Selection: –Some terms are truly useless Too frequent, no content –E.g. the, a, and,… –Stop words: ignore such terms altogether Creation: –Too many surface forms for same concepts E.g. inflections of words: verb conjugations, plural –Stem terms: treat all forms as same underlying

52 Efficient Implementations Classification cost: –Find nearest neighbor: O(n) Compute distance between unknown and all instances Compare distances –Problematic for large data sets Alternative: –Use binary search to reduce to O(log n)

53 Efficient Implementation: K-D Trees Divide instances into sets based on features –Binary branching: E.g. > value –2^d leaves with d split path = n d= O(log n) –To split cases into sets, If there is one element in the set, stop Otherwise pick a feature to split on –Find average position of two middle objects on that dimension »Split remaining objects based on average position »Recursively split subsets

54 K-D Trees: Classification R > 0.825?L > 17.5?L > 9 ? No Yes R > 0.6?R > 0.75?R > 1.025 ?R > 1.175 ? No YesNo Yes No PoorGood Yes No Yes GoodPoor NoYes Good No Poor Yes Good

55 Efficient Implementation: Parallel Hardware Classification cost: –# distance computations Const time if O(n) processors –Cost of finding closest Compute pairwise minimum, successively O(log n) time

56 Nearest Neighbor: Issues Prediction can be expensive if many features Affected by classification, feature noise –One entry can change prediction Definition of distance metric –How to combine different features Different types, ranges of values Sensitive to feature selection

57 Nearest Neighbor: Analysis Issue: –What is a good distance metric? –How should features be combined? Strategy: –(Typically weighted) Euclidean distance –Feature scaling: Normalization Good starting point: –(Feature - Feature_mean)/Feature_standard_deviation –Rescales all values - Centered on 0 with std_dev 1

58 Nearest Neighbor: Analysis Issue: –What features should we use? E.g. Credit rating: Many possible features –Tax bracket, debt burden, retirement savings, etc.. –Nearest neighbor uses ALL –Irrelevant feature(s) could mislead Fundamental problem with nearest neighbor

59 Nearest Neighbor: Advantages Fast training: –Just record feature vector - output value set Can model wide variety of functions –Complex decision boundaries –Weak inductive bias Very generally applicable

60 Summary Machine learning: –Acquire function from input features to value Based on prior training instances –Supervised vs Unsupervised learning Classification and Regression –Inductive bias: Representation of function to learn Complexity, Generalization, & Validation

