Computational Biology Lecture #3: Mapping

Computational Biology Lecture #3: Mapping
Bud Mishra Professor of Computer Science and Mathematics 9 ¦ 24 ¦ 2002 9/20/2018 ©Bud Mishra, 2001

Restriction Enzyme ©Bud Mishra, 2001
Type II sequence specific restriction endonuclease An enzyme that can “cut “a double stranded DNA by breaking the phosophodiester bonds at specific “target or restriction sites” on the DNA. Retriction Sites: Completely determined by their base pair decomposition 4 » 8 long sequences of base pairs Restriction Pattern 9/20/2018 ©Bud Mishra, 2001

Restriction Enzymes ©Bud Mishra, 2001
Bacterial Immune Systems against Viral DNA Bacteria use restriction enzymes by cleaving invading foreign DNA Bacteria protect their own DNA against cleaving by a methylation process Restriction Enzymes are very useful in biotechnology as Biochemical Scissors Biochemical Markers 9/20/2018 ©Bud Mishra, 2001

Applications of Restriction Enzyme
RFLP (Restriction Fragment Length Polymorphisms) Polymorphisms ´ Sequence variation within a population Restriction Maps Fingerprints Double Digestion Maps Multiple Complete Digestion Maps Ordered Restriction Maps Clone Library (with Partial Digestion) DNA Probes (Small Restriction Fragments) 9/20/2018 ©Bud Mishra, 2001

Digression Brun’s Sieve: Poison Approximation
Theorem: Let W be a nonnegative integer valued random variable such that E[CW,i] = li/i! Then Pr[W=M] ¼ e-l lM/M! Proof: Let the indicator variable IW=j be IW=j = { if W=j {0 otherwise Show that IW=j = åk=01 CW,j+k Cj+k, k (-1)k 9/20/2018 ©Bud Mishra, 2001

Brun’s Sieve ©Bud Mishra, 2001 IW=j = CW,j åk=0W-j CW-j,k (-1)k
= åk=0W-j CW,j CW-j,k (-1)k = åk=01 CW,j+k Cj+k,k (-1)k By Convention, CW,j = 0 if j > W. Pr[W=m] = E[IW=m] = åk=01 E[CW, M+k CM+k,k (-1)k ] ¼ åk=01 lM+k/(M+k)! CM+k, k (-1)k = lM/M! åk=01 (-l)k/k! = e-l lM/M! ð 9/20/2018 ©Bud Mishra, 2001

Restriction Map: Resolution
G = Length of a genomic DNA pk = Probability that an arbitrary site is a restriction site for a k-cutter enzyme k = 4, 6 or 8 = Cutting Frequency Uniform i.i.d. assumption: “All base pairs occur at any given position with equal probability and independently: pk = 1/(4k) 9/20/2018 ©Bud Mishra, 2001

Numerical Values ©Bud Mishra, 2001 pk = 1/(4k) lk = G pk Cut Numbers
Cut Probability p4 1/256 p6 1/4,096 p8 1/65,536 p10 1/1,048,576 9/20/2018 ©Bud Mishra, 2001

Statistics of Restriction Sites
Xj = Bernouli r.v. = Event that “there is a restriction site beginning at j.” W = åj=1G Xj = Total # restriction sites in the genome = # successes in G independent trials CW,i = # of “i-successful trials.” Consider the set of all “i-trials” There are CG, i of these Each “i-successful trial occurs with probability pki 9/20/2018 ©Bud Mishra, 2001

Applying Brun’s Sieve ©Bud Mishra, 2001 E[CW,i] = CG,i pki
= (G(i)/i!)pki = (G pk)i/i! = lki/i! Pr[# Restriction Sites = M | G, lk] = Pr[W=M] ¼ e-lk (lk)M/M! ð E[W] = lk = G pk = G/4k s2[W] = lk a S.D.[W] = G1/2/2k 9/20/2018 ©Bud Mishra, 2001

Statistics of Restriction Fragments
Pr[A restriction Fragment is of length l] = (1 –pk)l pk ¼ e-l/mk/mk, where mk-1 = log (1/1-pk) W =r.v. with exponential distribution with mean mk : fW(w) = e-w/mk/mk, w > 0 Z = b W c =r.v. giving the length of a restriction fragment in base pairs E[W] = mk ¼ 1/pk= 4k s2[W] = mk2 a S.D.[W] = mk ¼ 4k 9/20/2018 ©Bud Mishra, 2001

Matching Rules for Restriction Fragments
Given two restriction fragments without any identifying markers, when can they be said to be the same? We must account for small measurement errors: b = Relative Sizing Errors Matching Rule (II): Two restriction fragments are said to match if their Lengths x and y differ by less than b fraction (I.e. < 100 b % -b 5 1 – y/x 5 b 9/20/2018 ©Bud Mishra, 2001

False Positive Match Probability
Given: Two randomly chosen distinct restriction fragments obtained by cleaving a large genomic DNA by the same restriction enzyme, What is the probability that the matching rule accidentally identify the fragments as the same? 9/20/2018 ©Bud Mishra, 2001

False Positive Probability
mk = Expected length of a restriction fragment x, y » Exponential(1/mk) fX(x) = e-x/mk/mk False Positive Probability = s01 ( sx(1-b)x(1+b) e-y/mk/mk dy ) e-x/mk/mk dx =s01 ( sv(1-b)v(1+b) e-udu) e-vdv =s01 ( e-v(1-b)-ev(1+b)) e-vdv =(1/2-b) – (1/2+b) = 2 b/(4-b2) ¼ b/2 9/20/2018 ©Bud Mishra, 2001

Maps using Clones ©Bud Mishra, 2001 Clone:
A large fragment of genomic DNA that has been preselected. One can make faithful copies of a clone large number of times from a small number of initial clones. All location information for a clone is assumed to be lost. For instance: it is not known: Which chromosome a clone belongs to… Whether two clones overlap… What base-pair sequence the clone has… etc. 9/20/2018 ©Bud Mishra, 2001

Clone Libraries Commonly Used
Clones Insert Size Lambda (l) 2—20 Kb Cosmid (Artificial Plasmid) 20—45 Kb BAC (Bacterial Artificial Chromosome) 100—200 Kb YAC (Yeast Artificial Chromosome) 1—2 Mb 9/20/2018 ©Bud Mishra, 2001

Clone Library ©Bud Mishra, 2001
A preselected set of clones ´ Clone Library Locations of the clones are assumed to be uniformly random i.i.d. The size of a clone is roughly same. G = Genome length, L = Clone Length, N = # Clones in a library Coverage = NL/G = c (The number of times the clones will cover the genome if the clones are concatenated end-to-end. Also, the expected number of clones covering any location of the genome.) 9/20/2018 ©Bud Mishra, 2001

Example ©Bud Mishra, 2001 A BAC library for human
G = 3,300 Mb, L =180 Kb, N = 96,000 c = NL/G = 96 £ 103 £ 180 £ 103/ (3.3 £ 109) ¼ 6£ 96,000 randomly chosen BACs from the human genome provide a 6£ library. Certain regions of the genome may be difficult to clone and hence may not be represented in the library. A Tiling Path = A subset of clones that minimally cover the genome. Removal of any clone from the tiling path will leave some location of the genome uncovered. Every location of the genome is covered by no more than two clones. Every clone is overlapped by at most two other clones. The coverage for a tiling path: 1 · cTP · 2 9/20/2018 ©Bud Mishra, 2001

Clone Library ©Bud Mishra, 2001 Genome Clone Library
Minimal Tiling Path 9/20/2018 ©Bud Mishra, 2001

Mapping A Single Clone ©Bud Mishra, 2001 Restriction Pattern:
Decorate a clone with additional information—E.g., Restriction Pattern (Ordered Restriction Map, Finger Prints) End Sequencing (500 base pairs on each end) Probes (PCR products, Hybridization probes, etc.) Restriction Pattern: Take a clone and completely digest it into small pieces (restriction fragments) by a restriction enzyme. The restriction fragments and their order are always the same for that clone. 9/20/2018 ©Bud Mishra, 2001

Restriction Maps of a Clone
Clone with Restriction Sites 1 2 3 4 5 6 1 2 3 4 5 6 5 Ordered Restriction Map of the Clone (Ordered set of Restriction Fragments) 2 4 1 3 Finger Print or Unordered Restriction Map of the Clone (unordered collection of Restriction Fragments) 6 9/20/2018 ©Bud Mishra, 2001

A Clone Map ©Bud Mishra, 2001 Key Question:
Given two clones, when can we say whether they overlap by simply examining their fingerprints or maps? Issues: False positive and false negative in overlap detection Ordering all the clones using the overlap prpoerties Computing the tiling path Subcloning and sequencing (Divide-and Conquer) 9/20/2018 ©Bud Mishra, 2001

Amplification by Molecular Cloning
In vivo Approach: Ingredients a Host Organism: E. coli bacteria or yeast replicates a suitably modified foreign DNA. Cloning Vector: Insert DNA: Cell will not replicate any foreign DNA in the absence of a suitable cloning vector. Combined to create a circular Recombinant DNA—”replicon” Vector Insert ”replicon” 9/20/2018 ©Bud Mishra, 2001

Cloning ©Bud Mishra, 2001 Step 1: Step 2: Step3:
Inserts and vectors with same “sticky ends” are mixed together with ligase enzyme. This produces a circular replicon. Step 2: Transform the host cell by exposing a population of hosts to the ligase mixture containing the replicon The replicons are inserted into the host cell Transformed host cells are transferred to culture dishes containing a solid growth medium Cells divide making a colony containing 230 ¼ 109 inserts in 10 hours. Step3: Identify the colonies of clones containing the copies of the inserts Pick these colonies Isolate and linearize the replicons. 9/20/2018 ©Bud Mishra, 2001

Sequencing A Genome ©Bud Mishra, 2001 A “divide-and-conquer” approach:
Step 1: Divide…Create a “high coverage” clone library by choosing many randomly located clones (E.g., 96,000 BAC clones- each of length 180 Kb – from a human genome of length 3,300 Mb. 6£ coverage BAC library.) Step 2: Contig…Use the clone overlap information to create the contigs (E.g., 6 £ coverage BAC library would yield 96,000 £ e-6 ¼ 200 contigs—About 10 contigs per chromosome each of size aout 10 Mb) 9/20/2018 ©Bud Mishra, 2001

Sequencing A Genome ©Bud Mishra, 2001
Step 3: Prune…Remove “non-essential” clones from the contigs to form a “minimal tiling path.” (E.g., Minimal tiling path would consist of » 32,000 BAC clones.) Step 4: Shotgun Sequencing…Subclone a BAC on the minimal tiling path into M13’s. Generate sequence reads from M13 subclones. Sequence reads = 300 » 1,000 bps, 95% accuracy. Step 5: Contig the sequence reads… Step 6: Assemble the sequences and close the gaps… 9/20/2018 ©Bud Mishra, 2001

Finishing Phase ©Bud Mishra, 2001
Filling the gaps between the contigs: Synthesize a primer from the end of the contig sequence Generate a new read from the M13 subclone that starts with the synthesized primer. If there is no such M13 subclone— Synthesize a pair of primers from the sequence at the ends of a “gap” Amplify the DNA across the gap by performing PCR on the clone DNA Sequence the PCR product. 9/20/2018 ©Bud Mishra, 2001

Sequence Assembly ©Bud Mishra, 2001 Idealized Assembly:
Assuming no error in the read sequence. Shortest Common Superstring Problem: Given: A set {si}, where si is a string over some alphabet. Find: The shortest string S which contains each si as a contiguous substring. (SCSP – Shortest Common Superstring Problem – is NP-complete) 9/20/2018 ©Bud Mishra, 2001

Greedy Algorithm for Sequence Assembly
Find overlaps between pairs of sequence reads – (Only consider overlaps that span at least 15 bps.) Sort overlaps by decreasing length Merge read contigs according to the sorted list. 9/20/2018 ©Bud Mishra, 2001

Computational Biology Lecture #3: Mapping

Similar presentations

Presentation on theme: "Computational Biology Lecture #3: Mapping"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Biology Lecture #3: Mapping

Similar presentations

Presentation on theme: "Computational Biology Lecture #3: Mapping"— Presentation transcript:

Similar presentations

About project

Feedback