Partial Digest Problem Lecture 12 © Jeff Parker, 2015 There is something fascinating about science. One gets such wholesale returns of conjecture out of such a trifling investment of fact. -Mark Twain 1
Outline Backtracking provides a systematic way to perform efficient exhaustive search for some problems The material on Partial Digest Problem is from the presentations by Pevzner and Jones. The backtracking slides and the Turnpike example are mine 2
Outline Restriction Enzymes The Partial Digest Problem Approaches: Brute Force 8 Queens Backtracking Skeleton and finding all Solutions Data Structures to Improve our Algorithm Return to Partial Digest Problem Some thoughts on the 10 minute presentation 2
Stockbridge to Boston So was the turnpike from Stockbridge to Boston Exit Description From NYS From Exit25 1 West Stockbridge Route 41 2.9 47.4 2 Lee US 20/Route 102 10.6 55.1 3 Westfield Route 10/US 202 40.4 84.9 4 West Springfield I-91/US 5 45.7 90.2 5 Chicopee Route 33 49.0 93.5 6 Springfield I-291 51.3 95.8 7 Ludlow Route 21 54.9 99.4 8 Palmer Route 32 62.8 107.3 9 Sturbridge I-84 78.5 123.0 10 Auburn I-290/I-395 90.2 134.7 10A Millbury Route 146/US 20 94.1 138.6 11 Millbury Route 122 96.5 141.0 11A Hopkinton I-495 106.2 150.7 12 Framingham Route 9 111.4 155.9 13 Framingham/Natick 116.8 161.3 14/15 Weston I-95/Route 128 123.3 167.8 16 West Newton Route 16 125.2 169.7 17 Newton Washington/ Galen 127.7 172.2 18/20 Allston/Brighton 130.9 175.4 21 Back Bay Mass Ave 132.9 177.4 22 Copley Square MA 9 133.4 177.9 23 Theater District 133.9 178.4 24A-B-C South Station 134.6 179.1 25 South Boston Local streets. 135.3 179.8 26 Airport Logan Airport 137.3 181.8 So was the turnpike from Stockbridge to Boston The Berkshires seemed dream-like on account of that frosting With ten miles behind me and ten thousand more to go.
Discovering Restriction Enzymes Hindiii - first restriction enzyme – was discovered accidentally in 1970 while studying the bacterium Haemophilus influenzae Recognizes and cuts DNA at sequences: GTCGAC GTTAAC What are the a priori odds of hitting one of these sequences?
Molecular Scissors: EcoRI Molecular Cell Biology, 4th edition
Discovering Restriction Enzymes My father has discovered a servant who serves as a pair of scissors. If a foreign king invades a bacterium, this servant can cut him in small fragments, but he does not do any harm to his own king. Clever people use the servant with the scissors to find out the secrets of the kings. For this reason my father received the Nobel Prize for the discovery of the servant with the scissors". Daniel Nathans’ daughter (from Nobel lecture) Werner Arber Daniel Nathans Hamilton Smith Werner Arber – discovered restriction enzymes Daniel Nathans - pioneered the application of restriction for the construction of genetic maps Hamilton Smith - showed that restriction enzyme cuts DNA in the middle of a specific sequence Intro http://www.dnatube.com/video/955/Restriction-Enzymes In action http://www.youtube.com/watch?v=aA5fyWJh5S0
Many Restriction Enzymes http://en.wikipedia.org/wiki/Nucleic_acid_notation
Uses of Restriction Enzymes Recombinant DNA technology Cloning cDNA/genomic library construction cDNA = translation of mRNA: introns gone See next slide DNA mapping – Our application tonight
Adenovirus amazes at Cold Spring Harbor exon = protein-coding intron = non-coding The genes themselves are structured in coding bits, that is the stuff that becomes amino acids, called exons, and non-coding stretches of sequence in between, called introns. When the gene is transcribed the whole thing becomes an RNA molecule, including the garbage in between the exons, and then these introns are cut out in a process called splicing. The resulting bits are glued together and translated into a protein. The audience at the symposium was amazed, fascinated, and not a little bewildered to learn that late adenovirus mRNAs are mosaic molecules consisting of a sequence of sequences complementary to several non-contiguous segments of the viral genome. June 1977
Adenovirus amazes at Cold Spring Harbor exon = protein-coding intron = non-coding Genes are structured in coding bits, the stuff that becomes amino acids, called exons, and non-coding stretches called introns. When the gene is transcribed the whole thing becomes an RNA molecule, including the garbage in between the exons. The introns are cut out in a process called splicing. The resulting bits are glued together and translated into a protein. The genes themselves are structured in coding bits, that is the stuff that becomes amino acids, called exons, and non-coding stretches of sequence in between, called introns. When the gene is transcribed the whole thing becomes an RNA molecule, including the garbage in between the exons, and then these introns are cut out in a process called splicing. The resulting bits are glued together and translated into a protein.
Gene Structure http://www.dnalc.org/resources/3d/rna-splicing.html intron1 intron2 exon1 exon2 exon3 transcription splicing The genes themselves are structured in coding bits, that is the stuff that becomes amino acids, called exons, and non-coding stretches of sequence in between, called introns. When the gene is transcribed the whole thing becomes an RNA molecule, including the garbage in between the exons, and then these introns are cut out in a process called splicing. The resulting bits are glued together and translated into a protein. translation Codon: A triplet of nucleotides that is converted to one amino acid exon = protein-coding intron = non-coding Batzoglou
Alternative Splicing
Full Restriction Digest Cutting DNA at each restriction site creates multiple restriction fragments: Is it possible to reconstruct the order of the fragments from the sizes of the fragments {3, 5, 5, 9} ?
Full Restriction Digest: Multiple Solutions Alternative ordering of restriction fragments: vs
Partial Restriction Digest The sample of DNA is exposed to the restriction enzyme for only a limited amount of time to prevent it from being cut at all restriction sites This experiment generates the set of all possible restriction fragments between every two (not necessarily consecutive) cuts This set of fragment sizes is used to determine the positions of the restriction sites in the DNA sequence
Partial Digest Example Partial Digest uses all restriction fragments:
Measuring Length of Restriction Fragments Restriction enzymes break DNA into restriction fragments. Gel electrophoresis is a process for separating DNA by size and measuring sizes of restriction fragments Can separate DNA fragments that differ in length in only 1 nucleotide for fragments up to 500 nucleotides long
Gel Electrophoresis: Example Direction of DNA movement Smaller fragments travel farther Molecular Cell Biology, 4th edition
Gel Electrophoresis DNA fragments are injected into a gel positioned in an electric field DNA are negatively charged near neutral pH The ribose phosphate backbone of each nucleotide is acidic; DNA has an overall negative charge DNA molecules move towards the positive electrode
Gel Electrophoresis DNA fragments of different lengths are separated according to size Small molecules move through gel matrix more readily than large molecules The gel matrix restricts random diffusion so molecules of different lengths separate into different bands
Detecting DNA: Autoradiography One way to visualize separated DNA bands on a gel is autoradiography: The DNA is radioactively labeled The gel is laid against a sheet of photographic film in the dark, exposing the film at the positions where the DNA is present.
Detecting DNA: Fluorescence Another way to visualize DNA bands in gel is fluorescence: The gel is incubated with a solution containing the fluorescent dye ethidium Ethidium binds to the DNA The DNA lights up when the gel is exposed to ultraviolet light.
Outline Restriction Enzymes The Partial Digest Problem Approaches: Brute Force 8 Queens Backtracking Skeleton and finding all Solutions Data Structures to Improve our Algorithm Return to Partial Digest Problem Some thoughts on the 10 minute presentation 2
Multiset of Restriction Fragments We assume that multiplicity of a fragment can be detected, i.e., the number of restriction fragments of the same length can be determined (e.g., by observing twice as much fluorescence intensity for a double fragment than for a single fragment) Multiset: {3, 5, 5, 8, 9, 14, 14, 17, 19, 22}
Partial Digest Fundamentals X: the set of n integers representing the location of all cuts in the restriction map, including the start and end n: the total number of cuts DX: the multiset of integers representing lengths of each of the fragments produced from a partial digest
Partial Digest Problem: Formulation Goal: Given all pairwise distances between points on a line, reconstruct the positions of those points Input: The multiset of pairwise distances L, containing n(n-1)/2 integers Output: A set X, of n integers, such that DX = L
Partial Digest Example Input is DX = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} Search for X = {0, ?, ?, ?, 10}
Partial Digest Example 2 4 7 10 5 8 3 6 Representation of DX = {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} as a table, with X = {0, 2, 4, 7, 10} Along the top and left side. Elements at (i, j) in table is xj – xi for 1 ≤ i < j ≤ n.
Partial Digest: Multiple Solutions It is not always possible to uniquely reconstruct a set X based only on DX. For example, the set X = {0, 2, 5} and (X + 10) = {10, 12, 15} both produce DX={2, 3, 5} as their partial digest set. We can also reverse the set (drive from Boston to Stockbridge) X = {0, 3, 5}
Partial Digest: Multiple Solutions The sets {0, 1, 2, 5, 7, 9, 12} and {0, 1, 5, 7, 8, 10, 12} present a less trivial example of non-uniqueness. They both digest into: {1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 5, 6, 7, 7, 7, 8, 9, 10, 11, 12} These are called Homometric Sets
Homometric Sets 1 2 5 7 9 12 4 6 8 11 3 10 1 5 7 8 10 12 4 6 9 11 2 3
Outline Restriction Enzymes The Partial Digest Problem Approaches: Brute Force 8 Queens Backtracking Skeleton and finding all Solutions Data Structures to Improve our Algorithm Return to Partial Digest Problem Some thoughts on the 10 minute presentation 2
Brute Force Algorithms Also known as exhaustive search Examine every variant to find a solution Efficient in rare cases; usually impractical
Partial Digest: Brute Force 1) Find the restriction fragment of maximum length M. M is the length of the DNA sequence. 2) For every possible set X={0, x2, … ,xn-1, M} compute the corresponding DX 3) If DX is equal to the experimental partial digest L, then X is the correct restriction map
BruteForcePDP BruteForcePDP(L, n): M <- maximum element in L for every set of n – 2 integers 0 < x2 < … xn-1 < M X <- {0,x2,…,xn-1,M} Form DX from X if DX = L return X output “no solution”
Efficiency of BruteForcePDP BruteForcePDP takes O(M n-2) time since it must examine all possible sets of positions. One way to improve the algorithm is to limit the values of xi to only those values which occur in L.
Before: BruteForcePDP BruteForcePDP(L, n) M <- maximum element in L for every set of n – 2 integers 0 < x2 < … xn-1 < M X <- { 0,x2,…,xn-1,M } Form DX from X if DX = L return X output “no solution”
Another BruteForcePDP AnotherBruteForcePDP(L, n) M <- maximum element in L for every set of n – 2 integers 0 < x2 < … xn-1 < M X <- { 0,x2,…,xn-1,M } Form DX from X if DX = L return X output “no solution” from L
Efficiency of AnotherBruteForcePDP It’s more efficient, but still slow If L = {2, 998, 1000} (n = 3, M = 1000), BruteForcePDP will be extremely slow, but AnotherBruteForcePDP will be quite fast Fewer sets are examined, but runtime is still exponential: O(n2n-4) If k is in set, we need to check k, M-k
Outline Restriction Enzymes The Partial Digest Problem Approaches: Brute Force 8 Queens Backtracking Skeleton and finding all Solutions Data Structures to Improve our Algorithm Return to Partial Digest Problem Some thoughts on the 10 minute presentation 2
The 8 Queens problem Problem: place 8 queens on a chess board so that none attack each other General class of problems can be solved by backtracking 11
Decreases size of search space How many ways can we put 8 queens on chessboard? C(64,8) 17
Decreases size of search space How many ways can we put 8 queens on chessboard? C(64,8) Restrict queen i to col i means we only look at 8^8 positions 8 * 8 * 8 * 8 * 8 * 8 * 8 * 8 = (2^3) ^ 8 = 2 ^ 24 ~ 16,000,000 17
Decreases size of search space If we were to examine all placements of queens, would take C(64,8) Restrict queen i to col i means we only look at 8^8 positions 8 * 8 * 8 * 8 * 8 * 8 * 8 * 8 = (2^3) ^ 8 = 2 ^ 24 ~ 16,000,000 Restrict queen i to col i and avoid rows in use gives fewer moves 8 * 7 * 6 * 5 * 4 * 3 * 2 * 1 = 40,320 17
Decreases size of search space If we were to examine all placements of queens, would take C(64,8) Restrict queen i to col i means we only look at 8^8 positions 8 * 8 * 8 * 8 * 8 * 8 * 8 * 8 = (2^3) ^ 8 = 2 ^ 24 ~ 16,000,000 Restrict queen i to col i and avoid rows in use gives fewer moves 8 * 7 * 6 * 5 * 4 * 3 * 2 * 1 = 40,320 Backtracking can reduce this far more. When we place the first queen we eliminate at least one of the 7 second slots on a diagonal. We don't need to consider 2 * 6 * 5 * 4 * 3 * 2 * 1 = 1440 extensions In fact, for an 8x8 board we look at 113 partial boards The only complete board we look at is a solution.... 17
Backtracking Demo for (r0 = 0; r0 < 4; r0++) if safe(r0, 0) place queen at (r0, 0) for (r1 = 0; r1 < 4; r1++) if safe(r1, 1) place queen at (r1, 1) for (r2 = 0; r2 < 4; r2++) In the slides to follow, read from left to right, top to bottom. 12
13
14
A solution to the 4 queens problem We skipped many dead ends 15
Backtracking Evaluation Backtracking finds a solution, if it exists. Can be modified to find all solutions Works if we can tell that we have hit a roadblock "No extension of this position is going to work…" 16
Stack Tracing We place partial solution on the system's procedure stack. We extend in a new stack frame. If we fail, we remove frame, and return to our position If we find a safe row, we try to extend. If no row works out, we pop the stack and retry in the previous column Recursion hides the stack - simply the procedure stack 16
Why use Recursion? Why not use nested for loops, as suggested? for (r0 = 0; r0 < 4; r0++) if safe (r0, 0) place queen at (r0, 0) for (r1 = 0; r1 < 4; r1++) if safe (r1, 1) place queen at (r1, 1) for (r2 = 0; r2 < 4; r2++) ... 1) Does not scale to different sized boards 2) You must duplicate identical code (place and remove). An error in one spot is hard to find 19
Outline Restriction Enzymes The Partial Digest Problem Approaches: Brute Force 8 Queens Backtracking Skeleton and finding all Solutions Data Structures to Improve our Algorithm Return to Partial Digest Problem Some thoughts on the 10 minute presentation 2
Backtracking Skeleton // Not every backtracking example has all points below: useful template boolean backtracking(some parameters) if (we have a solution) report results return True; if (there is no hope) return False; for (first position to last position) if (this looks like a legal position) Record position // Can I build on this step? if (backtracking(parameters modified to reflect my new position)) // Success! Record position and return remember my position else remove any changes made in this call // try next iteration of loop return false; // I was not successful: backtrack Recursive call 16
Backtracking for 8 Queens # Try to extend a solution into col. def check(lst, col, rowfree, upfree, downfree): print col, lst # Debugging: 3 [0, 3, 1, -1] if (col == len(lst)): printBoard(lst) return True for row in xrange(len(lst)): if (safe(lst, row, col, rowfree, upfree, downfree)): place(lst, row, col, rowfree, upfree, downfree) if (check(lst,col+1,rowfree,upfree,downfree)): remove(lst, row, col, rowfree, upfree,downfree) return False # backtrack 19
if (there is no hope) return false; boolean backtracking(some parameters) if (we have a solution) report results return true; if (there is no hope) return false; for (first position to last position ) if (this looks like a legal position ) Record position // Can I build on this step? if (backtracking(parameters modified to reflect my new position)) // Success! Record position and return remember my position else remove any changes made in this call try next iteration of loop return false; // I was not successful: Backtrack def check(lst, col, rowfree, upfree, downfree): if (col == len(lst)): print "Success!" printBoard(lst) return True for row in xrange(len(lst)): if (safe(lst, row, col, rowfree, upfree, downfree)): place(lst, row, col, rowfree, upfree, downfree) if (check(lst, col+1, rowfree, upfree, downfree)): remove(lst, row, col, rowfree, upfree, downfree) return False # Backtrack 16
4 Queens Results Success! . * . * . . * Check 0 [-1, -1, -1, -1] * . . * Check 0 [-1, -1, -1, -1] Check 1 [0, -1, -1, -1] Check 2 [0, 2, -1, -1] Backtrack Check 2 [0, 3, -1, -1] Check 3 [0, 3, 1, -1] Check 1 [1, -1, -1, -1] Check 2 [1, 3, -1, -1] Check 3 [1, 3, 0, -1] Check 4 [1, 3, 0, 2] 19
Check 0 [-1, -1, -1, -1] – not shown Check 1 [0, -1, -1, -1] Backtrack Check 2 [0, 3, -1, -1] Check 3 [0, 3, 1, -1] Check 1 [1, -1, -1, -1] Check 2 [1, 3, -1, -1] Check 3 [1, 3, 0, -1] Check 4 [1, 3, 0, 2] 13
3 Queens Results Check 0 [-1, -1, -1] Check 1 [0, -1, -1] Backtrack Check 1 [1, -1, -1] Check 1 [2, -1, -1] Check 2 [2, 0, -1] Could not solve for a board with side 3 19
8 Queens Results * . . . . . . . * . . * . . . . . . * * . . . . . * . . * . . 19
Finding All Solutions // No longer returns after success: keeps trying void backtracking(some parameters) { if (we have a solution) report the position and return if (there is no hope) return for (first position to last postion) if (this looks like a legal postion) # if (backtracking(parameters modified)) # return true; backtracking(parameters modified) remove traces of this call // . . . and try the next iteration 16
Find all Solutions of 8 Queens def check(lst, col, rowfree, upfree, downfree): if (col == len(lst)): print "Success!" printBoard(lst) return True for row in xrange(len(lst)): if (safe(lst, row, col, rowfree, upfree, downfree)): place(lst, row, col, rowfree, upfree, downfree) #if (check(lst, col+1, rowfree, upfree, downfree)): # return True check(lst, col+1, rowfree, upfree, downfree) remove(lst, row, col, rowfree, upfree, downfree) return False # Backtrack Not in handout, but in online solution 19
Find all Solutions of 8 Queens def check(lst, col, rowfree, upfree, downfree): if (col == len(lst)): print "Success!" printBoard(lst) return True for row in xrange(len(lst)): if (safe(lst, row, col, rowfree, upfree, downfree)): place(lst, row, col, rowfree, upfree, downfree) check(lst, col+1, rowfree, upfree, downfree) remove(lst, row, col, rowfree, upfree, downfree) return False # Backtrack 19
Outline Restriction Enzymes The Partial Digest Problem Approaches: Brute Force 8 Queens Backtracking Skeleton and finding all Solutions Data Structures to Improve our Algorithm Return to Partial Digest Problem Some thoughts on the 10 minute presentation 2
Data Structures Smart Data Structures make our programs faster How do we store the board? As a 2 dimensional array? We could use a two dimensional array of booleans. def place(row, column): board[row][column] = True Operations needed for backtracking safe - Check to see if queen is safe place - Place a queen remove - Remove Queen Takes a fixed amount of time to set or remove. But which of the three operations above do we do most often? 20
Data Structures Smart Data Structures make our programs faster How do we store the board? As a 2 dimensional array? We could use a two dimensional array of booleans. def place(row, column): board[row][column] = True Operations needed for backtracking safe - Check to see if queen is safe place - Place a queen remove - Remove Queen Takes a fixed amount of time to set or remove. But which of the three operations above do we do most often? if (safe(...)): place(...) if (check()): return True remove(...) 20
Python Profiler counts the calls if (safe(...)): place(..) if (check(...)): return True remove(...) 2166 function calls (2053 primitive calls) in 0.045 Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 38 0.000 0.000 0.000 0.000 :0(insert) 696 0.006 0.000 0.006 0.000 :0(len) 1 0.001 0.001 0.001 0.001 :0(setprofile) 1 0.000 0.000 0.044 0.044 <string>:1(<module>) 876 0.009 0.000 0.011 0.000 Queens.py:19(safe) 113 0.002 0.000 0.003 0.000 Queens.py:29(place) 105 0.002 0.000 0.003 0.000 Queens.py:36(remove) 219 0.003 0.000 0.003 0.000 Queens.py:43(prettyPrint) 114/1 0.021 0.000 0.043 0.043 Queens.py:49(check) 1 0.000 0.000 0.000 0.000 Queens.py:7(printBoard) 1 0.000 0.000 0.043 0.043 Queens.py:81(solve) 0 0.000 0.000 profile:0(profiler) 1 0.000 0.000 0.045 0.045 profile:0(solve([-1, -1, -1, -1, -1, -1, -1, -1])) 19
Testing a new position When we wish to test position (i, j), we attack over 20 squares. We don't need to test this col, nor the spots to our right. Worst case is 14 squares (x, j), (i, y), (i+x, j+x), (i+x, j-x) Still, there is a good deal of work to be done here.... 21
Alternative Store board as an array of row positions. This board is [5, 3, 0, 4, 1,.....] Nothing new yet... Ideas? 22
Smarter Storage Store 3 arrays of boolean rowfree, upfree, downfree Is this row under attack? Is this diagonal attacked? Placement & removal is now more expensive: must touch 4 arrays However, when testing we examine at most 3 array locations (col is implicit) 23
Is this spot safe? # When testing we only examine 3 array locations (col is implicit) # Is it safe to place a queen at (row, col)? def safe(lst, row, col, rowfree, upfree, downfree): if (not rowfree[row]): return False if (not upfree[row+col]): if (not downfree[row-col+len(lst)-1]): return True 26
Place and Remove # Put a queen in pos (row, col). def place (lst, row, col, rowfree, upfree, downfree): lst[col] = row rowfree[row] = False upfree[row+col] = False downfree[row-col+len(lst)-1] = False # Remove a queen from pos (row, col). def remove(lst, row, col, rowfree, upfree, downfree): lst[col] = -1 rowfree[row] = True upfree[row+col] = True downfree[row-col+len(lst)-1] = True 25
Initialize the arrays # Set things up for a run. def solve(lst): rowfree = [ ] upfree = [ ] downfree = [ ] for x in xrange(len(lst)): rowfree.insert(x, True) for x in xrange(2*len(lst)- 1): upfree.insert(x, True) downfree.insert(x, True) if (not check(lst, 0, rowfree, upfree, downfree, 0)): print "Can not solve for a board with side", len(lst) solve([-1, -1, -1, -1]) 24
Backtracking vs… What will work on the following: Making Change Sudoku Maximization: find the best motif Babylonian fractions Express numbers as the sum of terms 1/n Cannot reuse n: 1/3 + 1/3 is not legal Limited sequences of choices Some way to recognize dead ends 24
Outline Restriction Enzymes The Partial Digest Problem Approaches: Brute Force 8 Queens Backtracking Skeleton and finding all Solutions Data Structures to Improve our Algorithm Return to Partial Digest Problem Some thoughts on the 10 minute presentation 2
Return to Turnpike Problem Partial Digest Problem L = {2, 8, 10} (Always) Start with set X = {0} Look at members of L Try extending by adding 2 to X to get {0, 2} Digest for X = {0, 2} is {2} Since {2} is a subset of {2, 8, 10}, we continue Try extending by adding 8 to get X = {0, 2, 8} This gives us a Digest of {2, 2, 6} – not a subset of L Try extending by adding 10 to get {0, 2, 10} The Digest of {0, 2, 10} is {2, 8, 10} Success!! 24
extend() def extend(cand, lst, depth): # List 'cand' generates a subset of list 'lst'. Try to extend def extend(cand, lst, depth): # Are we done? diff(cand) is a subset of lst. if (subset(lst, differences(cand))): print "Success!! ", cand, lst return True for x in lst: if ((x > 0) and (not (x in cand))): nwLst = cand[:] insrt(nwLst, x) if (subset(differences(nwLst), lst)): if (extend(nwLst, lst, depth+1)): return False 24
if (there is no hope) return false; boolean backtracking(some parameters) if (we have a solution) report results return true; if (there is no hope) return false; for (first position to last postion) if (this looks like a legal postion) Record position // Can I build on this step? if (backtracking(parameters modified to reflect my new position)) // Success! Record position and return remember my position else remove any changes made in this call try next iteration of loop return false; // I was not successful: Backtrack # List 'cand' generates a subset of list 'lst'. Try to extend def extend(cand, lst, depth): # Are we done? We know diff(cand) is a subset of lst. if (subset(lst, differences(cand))): print "Success!! ", cand, lst return True for x in lst: if ((x > 0) and (not (x in cand))): nwLst = cand[:] insrt(nwLst, x) if (subset(differences(nwLst), lst)): if (extend(nwLst, lst, depth+1)): return False 16
Sample Run What is actually printed by fragment on the next page Extend [0] [] vs [2, 8, 10] Try extending by adding 2 to get [0, 2] Extend [0, 2] [2] vs [2, 8, 10] Try extending by adding 8 to get [0, 2, 8] Try extending by adding 10 to get [0, 2, 10] Extend [0, 2, 10] [2, 8, 10] vs [2, 8, 10] Success!! [0, 2, 10] [2, 8, 10] 24
Turnpike Still edited to fit on one slide # List 'cand' generates a subset of list 'lst'. Try to extend def extend(cand, lst, depth): # Debug printing prettyPrint(depth, "Extend") print cand, differences(cand), "vs", lst # Are we done? We know diff(cand) is a subset of lst. if (subset(lst, differences(cand))): print "Success!! ", cand, lst return True for x in lst: if ((x > 0) and (not (x in cand))): nwLst = cand[:] insrt(nwLst, x) prettyPrint(depth, "Try extending by adding") print x, "to get", nwLst if (subset(differences(nwLst), lst)): if (extend(nwLst, lst, depth+1)): return False 24
Utility # Insert item itm into sorted list lst def insrt(lst, itm): spot = -1 for x in xrange(len(lst)): if (lst[x] < itm): spot = x else: break lst.insert(spot+1, itm) return lst # Take {0, 2, 4, 7, 10} generate {2, 2, 3, 3, 4, 5, 6, 7, 8, 10} def differences(a): lst = [] for x in xrange(len(a)): for y in xrange(x+1, len(a)): insrt(lst, a[y] - a[x]) 0 2 4 7 10 0 2 4 7 10 2 2 5 8 4 3 6 7 3 10 24
subset() I don't have to bump y: it is loop variable {2, 5, 7}, {2, 5} # Assume a and b are sorted. Is a subset of b? def subset(a, b): if (len(a) > len(b)): return False if (0 == len(a)): return True x = 0 for y in xrange(len(b)): if (a[x] < b[y]): if (a[x] == b[y]): x = x + 1 if (x == len(a)): {2, 5, 7}, {2, 5} {}, {2, 5} {2, 7}, {2, 5, 7, 9} Iterate over xrange(4) x is index into set a y is index into set b 0 0 a[0] == b[0] 1 1 a[1] > b[1] 1 2 a[1] == b[2] 2 == len(a) Return true I don't have to bump y: it is loop variable 24
Pretty Print # Cand generates a subset of list. Try to add elt def extend(cand, lst, depth): prettyPrint(depth, "Extend") print cand, differences(cand), "vs", lst ... if (extend(nwLst, lst, depth+1)): Extend [0] [] vs [2, 8, 10] Try extending by adding 2 to get [0, 2] Extend [0, 2] [2] vs [2, 8, 10] Try extending by adding 8 to get [0, 2, 8] Try extending by adding 10 to get [0, 2, 10] Extend [0, 2, 10] [2, 8, 10] vs [2, 8, 10] Success!! [0, 2, 10] [2, 8, 10] 24
Current Version Extend [0] [] vs [2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 10, 10, 12] Try extending by adding 2 to get [0, 2] Extend [0, 2] [2] vs [2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 10, 10, 12] Try extending by adding 3 to get [0, 2, 3] Try extending by adding 4 to get [0, 2, 4] Extend [0, 2, 4] [2, 2, 4] vs [2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 10, 10, 12] Try extending by adding 3 to get [0, 2, 3, 4] Try extending by adding 5 to get [0, 2, 4, 5] Try extending by adding 6 to get [0, 2, 4, 6] Try extending by adding 7 to get [0, 2, 4, 7] Extend [0, 2, 4, 7] [2, 2, 3, 4, 5, 7] vs [2, 2, 2, 3, 3, 4, 5, 5, 6,... Try extending by adding 3 to get [0, 2, 3, 4, 7] Try extending by adding 5 to get [0, 2, 4, 5, 7] Try extending by adding 6 to get [0, 2, 4, 6, 7] Try extending by adding 8 to get [0, 2, 4, 7, 8] Try extending by adding 10 to get [0, 2, 4, 7, 10] 24
Improvement 1 Extend [0] [] vs [2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 10, 10, 12] Try extending by adding 2 to get [0, 2] Extend [0, 2] [2] vs [2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 10, 10, 12] Try extending by adding 3 to get [0, 2, 3] Try extending by adding 4 to get [0, 2, 4] Extend [0, 2, 4] [2, 2, 4] vs [2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 10, 10, 12] Try extending by adding 3 to get [0, 2, 3, 4] Try extending by adding 5 to get [0, 2, 4, 5] Try extending by adding 6 to get [0, 2, 4, 6] Try extending by adding 7 to get [0, 2, 4, 7] Extend [0, 2, 4, 7] [2, 2, 3, 4, 5, 7] vs [2, 2, 2, 3, 3, 4, 5, 5, 6,... Try extending by adding 3 to get [0, 2, 3, 4, 7] Try extending by adding 5 to get [0, 2, 4, 5, 7] Try extending by adding 6 to get [0, 2, 4, 6, 7] Try extending by adding 8 to get [0, 2, 4, 7, 8] Try extending by adding 10 to get [0, 2, 4, 7, 10] 24
Improvement 1 Extend [0] [] vs [2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 10, 10, 12] Try extending by adding 2 to get [0, 2] Extend [0, 2] [2] vs [2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 10, 10, 12] Try extending by adding 3 to get [0, 2, 3] Try extending by adding 4 to get [0, 2, 4] Extend [0, 2, 4] [2, 2, 4] vs [2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 10, 10, 12] Try extending by adding 3 to get [0, 2, 3, 4] Try extending by adding 5 to get [0, 2, 4, 5] Only try each new item once Don't try something that didn't work in the past 24
Why does this happen? Extend [0] [] vs [2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 10, 10, 12] Try extending by adding 2 to get [0, 2] Extend [0, 2] [2] vs [2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 10, 10, 12] Try extending by adding 3 to get [0, 2, 3] # Cand generates a subset of list. Try to add elt. # We assume that subset(differences(cand), lst) def extend(cand, lst, depth): # Are we done? if (subset(lst, differences(cand))): print "Success!! ", cand, lst return True for x in lst: if ((x > 0) and (not (x in cand))): 24
Improvements Rather than start small, start large. We will see example Extend [0] [] vs [2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 10, 10, 12] Try extending by adding 2 to get [0, 2] Extend [0, 2] [2] vs [2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 10, 10, 12] Try extending by adding 3 to get [0, 2, 3] Try extending by adding 4 to get [0, 2, 4] Extend [0, 2, 4] [2, 2, 4] vs [2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 10, 10, 12] Try extending by adding 3 to get [0, 2, 3, 4] Try extending by adding 5 to get [0, 2, 4, 5] Try extending by adding 6 to get [0, 2, 4, 6] Try extending by adding 7 to get [0, 2, 4, 7] Extend [0, 2, 4, 7] [2, 2, 3, 4, 5, 7] vs [2, 2, 2, 3, 3, 4, 5, 5,... Try extending by adding 3 to get [0, 2, 3, 4, 7] Try extending by adding 5 to get [0, 2, 4, 5, 7] Try extending by adding 6 to get [0, 2, 4, 6, 7] Try extending by adding 8 to get [0, 2, 4, 7, 8] Try extending by adding 10 to get [0, 2, 4, 7, 10] Rather than start small, start large. We will see example Careful: the first two cases are special 24
Profile Look at where you spend time. May be possible to cut down on number of calls, or to speed up commonly called routines len() is called often: insrt() is taking more time Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 342 0.003 0.000 0.003 0.000 :0(insert) 717 0.006 0.000 0.006 0.000 :0(len) 1 0.001 0.001 0.001 0.001 :0(setprofile) 1 0.000 0.000 0.027 0.027 <string>:1(<module>) 38 0.005 0.000 0.020 0.001 Turnpike.py:18(differences) 32 0.002 0.000 0.003 0.000 Turnpike.py:26(subset) 32 0.000 0.000 0.000 0.000 Turnpike.py:42(prettyPrint) 6/1 0.002 0.000 0.027 0.027 Turnpike.py:48(extend) 342 0.009 0.000 0.014 0.000 Turnpike.py:7(insrt) 1 0.000 0.000 0.029 0.029 profile:0(extend([0], [2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 10, 10, 12], 0)) 0 0.000 0.000 profile:0(profiler) 24
Don't work so hard We compute full differences each time, rather than building a new difference from the old We also keep the full list, rather than remember just the remaining items This would give smaller sets, which would speed up subset() ncalls tottime percall cumtime percall filename:lineno(function) 342 0.003 0.000 0.003 0.000 :0(insert) 717 0.006 0.000 0.006 0.000 :0(len) 1 0.001 0.001 0.001 0.001 :0(setprofile) 1 0.000 0.000 0.027 0.027 <string>:1(<module>) 38 0.005 0.000 0.020 0.001 Turnpike.py:18(differences) 32 0.002 0.000 0.003 0.000 Turnpike.py:26(subset) 32 0.000 0.000 0.000 0.000 Turnpike.py:42(prettyPrint) 6/1 0.002 0.000 0.027 0.027 Turnpike.py:48(extend) 342 0.009 0.000 0.014 0.000 Turnpike.py:7(insrt) 1 0.000 0.000 0.029 0.029 profile:0(extend([0], [2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8, 8, 10, 10, 12], 0)) 0 0.000 0.000 profile:0(profiler) 24
Sets Python supports Sets Does not support Multisets – but Sets are useful 24
Summary Backtracking allows us to search a large solution space It organizes the search, and avoids wasting time Is not always applicable: assumes We can enumerate choices We can recognize a dead end Clever modification of algorithms can yield large savings Profile your algorithm to see what you should tune Representation should match the requirements Clever choice of data structure can save a great deal of time Rather than checking dozen of squares, we tested 3 31
References Most data structures texts will discuss the 8 Queens problem as an example of backtracking. Computer Algorithms by Sara Baase and Allen Van Gelder, use it as an example when discussing graph traversal. Steven S. Skiena, Warren D. Smith, Paul Lemke, Reconstructing sets from interpoint distances , Proceeding of the SCG 1990 Proceedings of the sixth annual symposium on Computational geometry. (see the papers section) PCA is not something to try by hand. Use a tool such as Matlab Here is a discussion of PCA in Excel http://www.quepublishing.com/articles/article.aspx?p=1912076
Outline Restriction Enzymes The Partial Digest Problem Approaches: Brute Force 8 Queens Backtracking Skeleton and finding all Solutions Data Structures to Improve our Algorithm Return to Partial Digest Problem Some thoughts on the 10 minute presentation 2
Thoughts on Teaching Take a look at the presentation posted to Canvas To get across ideas in a limited time Pick one good example and spend time on that 8 Queens is a great example of backtracking http://blog.richmond.edu/wross/2008/03/26/how-to-give-a-good-20-minute-math-talk/ http://www.ams.org/notices/200709/tx070901136p.pdf
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0 }
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0 } Remove 10 from L and insert it into X. We know this must be the length of the DNA sequence because it is the largest fragment.
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 10 } This is the first case We are left with 2, 3, 4, 5, 6, 7, and 8
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 10 } Take 8 from L and make y = 2 or 8. But since the two cases are symmetric, we can assume y = 2.
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 10 } We find that the distances from y=2 to other elements in X are D(y, X) = {8, 2}, so we remove {8, 2} from L and add 2 to X.
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 10 } This is the second case We are left with 2, 3, 4, 5, 6, and 7
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 10 } Take 7 from L and make y = 7 or y = 10 – 7 = 3. We will explore y = 7 first, so D(y, X ) = {7, 5, 3}.
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 10 } For y = 7 first, D(y, X ) = {7, 5, 3}. Therefore we remove {7, 5 ,3} from L and add 7 to X. D(y, X) = {7, 5, 3} = {½7 – 0½, ½7 – 2½, ½7 – 10½}
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 7, 10 } We are left with 2, 3, 4, and 6
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 7, 10 } This time make y = 4. D(y, X) = {4, 2, 3 ,6}, which is a subset of L so we will explore this branch. We remove {4, 2, 3 ,6} from L and add 4 to X.
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 4, 7, 10 }
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 4, 7, 10 } L is now empty, so we have a solution, which is X.
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 7, 10 } To find other solutions, we backtrack.
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 10 } More backtrack.
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 10 } This time we will explore y = 3. D(y, X) = {3, 1, 7}, which is not a subset of L, so we won’t explore this branch.
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 7, 10 } Take 6 from L and make y = 6. Unfortunately D(y, X) = {6, 4, 1 ,4}, which is not a subset of L. Therefore we won’t explore this branch. 6
An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 10 } We backtracked back to the root. Therefore we have found all the solutions.
Analyzing PartialDigest Algorithm Still exponential in worst case, but is very fast on average Informally, let T(n) be time PartialDigest takes to place n cuts No branching case: T(n) < T(n-1) + O(n) Quadratic Branching case: T(n) < 2T(n-1) + O(n) Exponential