Follow-the-moon optimization with Condor-enabled genetic algorithms Condor Week Presentation April, 2006 Follow-the-moon optimization with Condor-enabled genetic algorithms Brooklin J. Gore Senior Fellow, Advanced Computing 2004 Micron Technology, Inc. All rights reserved. Information is subject to change without notice. 1 1
Condor Week 2006 Agenda Introductions A GA Optimization Problem A ‘perfect’ Grid app Follow-the-moon computing Q/A 12/3/2018
Condor Week 2006 Overview of Micron’s Grid 11k+ processors in 11 “pools” Linux, Solaris, Windows ~65th Top500 Rank 5.7 TeraFLOPS Built in-house Open Source Grid Centralized governance Distributed management 20+ applications Self developed Micron’s Global Grid 12/3/2018
Condor Week 2006 Example Grid Application at Micron Probe Card Optimization Given: A wafer map A probe card Max DUT1s on card 1 DUT = Device Under Test Find: A probe card configuration Probe vector That minimizes: DUTs on card Touchdowns Reprobed die Maximum die contacts 12/3/2018
Condor Week 2006 Example Grid Application at Micron Probe Card Optimization is a HARD Problem Given: A wafer W of size Wx,Wy A card C of size Cx, Cy and V probe vectors C C! Configurations = Σ ------- = 2C k=0 k!(C-k)! W! Vectors = -------- ~ WV V!(W-V)! Total Complexity = 2CWV Search Space(100,500,5) = 2100*5005 ~ 4x1043 So, maybe a Genetic Algorithm would be good. 12/3/2018
Condor Week 2006 Follow-the-Moon Computing For a single execution of the GA 1-5 ‘tries’ to find optimal solution (minimize fitness function) Shoot for 20-40 minute run time (thorough yet responsive) Only output ‘good’ solutions (to avoid clutter) with ‘fitness’ filename Since GAs are probabilistic with variable run-times: The more ‘copies’ we run the better and faster we explore the solution space So, let’s run a bunch (600-1200) overnight 3 tries/job * 600 jobs * 2/hour * 12 hours ~=> 40,000 tries So, submit 600 jobs Leave in queue (run over and over), remove after 12 hours Good solutions just ‘show up’ in submit directory -- view ‘Top 10’ 12/3/2018
Condor Week 2006 Follow-the-Moon Computing Let’s take a look at an actual run… 12/3/2018
Condor Week 2006 Example Grid Application at Micron Probe Card Optimization with CGA Genetic Algorithm m49a: 29x14 wafer map 23x14 probe card 126 DUT1s on card 2 Re-probes 3 Touchdowns 12/3/2018
Condor Week 2006 Follow-the-Moon Computing ‘Perfect’ Grid app: Low data in/out Compute bound So: Direct jobs to sites Where the workers aren’t 12/3/2018
Condor Week 2006 Follow-the-Moon Computing STARTD_CRON_JOBS = $(STARTD_CRON_JOBS) flockmgr:MU_:$(MODULES)/MU_FlockMgr:30m The STARTD_CRON_JOBS above run on ALL systems, but existence of certain files trigger specific behavior, like dynamic flocking below: -rw-r--r-- 1 condor condor 201 Jul 13 2004 bgore2-lnx.ClassAds -rw-r--r-- 1 condor condor 1810 Dec 9 14:59 bgore2-lnx.Flocking -rw-r--r-- 1 condor condor 2601 Feb 21 07:42 bgore2-lnx.local 12/3/2018
Condor Week 2006 Follow-the-Moon Computing # Flocking schedule for bgore2-lnx # 12/09/2005/BJGore # # First column is seconds after midnight. Seconds after midnight key: # Midnight: 0 1am: 3600 2am: 7200 3am: 10800 4am: 14400 5am: 18000 # 6am: 21600 7am: 25200 8am: 28800 9am: 32400 10am: 36000 11am: 39600 # Noon: 43200 1pm: 46800 2pm: 50400 3pm: 54000 4pm: 57600 5pm: 61200 # 6pm: 64800 7pm: 68400 8pm: 72000 9pm: 75600 10pm: 79200 11pm: 82800 # Subsequent columns are comma-separated list of pools to flock to # This is a good flocking schedule for Boise-based submitters. # We flock to pools where it's between 6pm and 6am. # Midnight 0 condor-mava.mava, condor-mndc, condor-backend, condor-is, condor-rnd, condor-lehi.lehi condor-mava.mava, condor-mndc, condor-backend, condor-is, condor-rnd, condor-lehi.lehi, condor-nijp.nijp : (etc) 12/3/2018
Condor Week 2006 Follow-the-Moon Computing if ( $current_flist ne $new_flist ) { # Need to update FLOCK_TO $cmd_output= `$ccv_cmd -rset 'FLOCK_TO=$new_flist'`; if ( $cmd_output =~ m/Successfully/ ) { $cmd_output= `$cr_cmd`; # and reconfig the schedd to take the change print "$ca_prefix = \"FLOCK_TO updated\"\n"; } else { print "$ca_prefix = \"FLOCK_TO update failed\"\n"; print "$ca_prefix = \"FLOCK_TO is current\"\n"; 12/3/2018
Condor Week 2006 Example Grid Application at Micron Solving Probe Card Optimization What about the ‘GA Knobs’? Parameter Reasonable Values Population, P 200, 400, 800 Tournament Size, Ts 2, 4, 8 Probability of Crossover, Pc 0.00, 0.33, 0.66, 1.00 Probability of Mutation, Pm Parameter Sweep: 133 Unique combinations (Pc=Pm=0 degenerate case) Did 40, six-hour runs for each combination 31,920 hours – 3.6 years on one CPU Ran in 7 days on Grid! 12/3/2018
Condor Week 2006 Example Grid Application at Micron Solving Probe Card Optimization Derived Defaults: P = 400 Ts = 4 Pc = 0.66 Pm = 1.00 * 12/3/2018
Thank you! Questions? Micron and the Micron logo are trademarks and/or service marks of Micron Technology, Inc. All other trademarks are the property of their respective owners.