Download presentation
Presentation is loading. Please wait.
Published byDustin Small Modified over 9 years ago
1
National Taiwan University Department of Computer Science and Information Engineering Dynamic Programming Algorithms for Haplotype Block Partitioning: Applications to Human Chromosome 21 Haplotype Data Speaker: Yao-Ting Huang Advisor: Kuan-Mao Chao National Taiwan University Department of Computer Science & Information Engineering Algorithms and Computational Biology Lab.
2
National Taiwan University Department of Computer Science and Information Engineering Referrences Zhang, K., Sun, F., Waterman, M.S., Chen, T. Dynamic programming algorithms for haplotype block partitioning: Applications to human chromosome 21 haplotype data, RECOMB, 2003 Patil, N., et at, Blocks of limited haplotype diversity revealed by high- resolution scanning of human chromosome 21. Science 294: 1719-1723, 2001. Waterman, M.S., Eggert, M. and Lander E.L. Parametric sequence comparisons. Proceedings of the National Academy of Sciences of the United States of America, 1992. Zhang, K., Deng M., Chen, T., Waterman, M.S., Sun, F. A dynamic programming algorithm for haplotype block partitioning. Proceedings of the National Academy of Sciences of the United States of America, 2002 Garey, M.R. and Johnson D.S. Computers and Intractability, New York, 1979
3
National Taiwan University Department of Computer Science and Information Engineering Outline Related biological background Related works The haplotype block partition problem Three dynamic programming algorithms Result
4
National Taiwan University Department of Computer Science and Information Engineering Introduction to Nucleic Acids Cytoplasm Cell Membrane Nucleus DNA Genes Chromosomes
5
National Taiwan University Department of Computer Science and Information Engineering Chromosomes and DNA Human cell has 46 chromosomes, present in 23 pairs, one from each of the two parents. Genetic information is stored and organized on the chromosomes, which are encoded in DNA (deoxyribonucleic acid).
6
National Taiwan University Department of Computer Science and Information Engineering Structure of DNA DNA is a nucleotide, which has the structure Phosphate-sugar-base Phosphate BaseSugar
7
National Taiwan University Department of Computer Science and Information Engineering Structure of DNA DNA has four bases Adenine (A), Guanine (G), Cytosine (C), Thymine (T) A and G are purines, where C and T are pyrimidines. Purines are double ring bases Primidines are single ring bases. C C C C N N O N Cytosine C C C C N N O O ThymineAdenine C C C C N N N N N C Guanine C C C C N N O N N N C
8
National Taiwan University Department of Computer Science and Information Engineering Mutation Mutation is caused by chemicals or malfunction of DNA replication and exchange a single nucleotide for another. e.g., C T or A G. Variation (mutation) Recombination Gene replication Gene replication or Parent 1 Parent 2
9
National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism Single Nucleotide Polymorphism (SNP) arises from mutation. Mutation nucleotides become SNPs when observed frequency > 1% in a population. SNP: DNA single base variations found >1% Mutation: DNA single base variations found <1% A C T T A G C T T A C T T A G C T C General Population SNP A C T T A G C T T A C T T A G C T C General Population Mutation 94% 6% 99.9% 0.1%
10
National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism All humans share 99.9% the same DNA sequence SNPs occur about every 600 base pairs. 90% of human genome variation comes SNPs. The human genome contains about 3 million SNPs. Because of the A-T/C-G complement, a SNP can have only two variants: (AT) or (CG). A SNP is a variable with two states: Major allele: Allele (i.e., (AT) or (CG)) > 50%. Minor allele: Allele < 50%.
11
National Taiwan University Department of Computer Science and Information Engineering Haplotype A set of closely linked SNPs located on one chromosome, which tend to be inherited together (not easily separated by recombination). DNA Sequence 1 2 3 4 5 6 AG- 2/6 GTA 3/6 AGA 1/6 Haplotypes Phenotype Black eye Brown eye Black eye Blue eye Brown eye SNP 1SNP 2SNP 3 GATATTCGTACGGA-T GATGTTCGTACTGAAT GATATTCGTACGGA-T GATATTCGTACGGAAT GATGTTCGTACTGAAT
12
National Taiwan University Department of Computer Science and Information Engineering An Example The haplotype patterns for 20 independent chromosomes (column) defined by 147 SNPs (row) spanning 106 kb of genomic sequence. Blue box = major allele Yellow box = minor allele The expanded box on the right is an SNP block of 26 SNPs over 19kb. The 4 most common of 7 different haplotypes include 80% of the chromosomes, and can be distinguished with 2 SNPs.
13
National Taiwan University Department of Computer Science and Information Engineering Related Works Patil et al. proposes a greedy algorithm to identify 20 haplotypes for 24047 SNPs spanning over 32.4 Mbp on human chromosome 21. The haplotypes are partitioned into 4135 blocks with 4563 tag SNPs. Zhang et al. reduced the number of haplotype blocks and tag SNPs to 2575 and 3582, respectively, which is done by dynamic algorithm.
14
National Taiwan University Department of Computer Science and Information Engineering Zhang’s Algorithms for Haplotype Block Partitioning Zhang et at. propose two dynamic programming algorithms to prioritize the SNPs and the corresponding chromosome regions. Maximize the fraction of the genome covered by using a fixed number of tag SNPs. Another algorithm to search the local maximal haplotypes that are shared by at least two haplotype samples. Local maximal haplotype: the haplotype with the maximal length which are shared by a given number of samples. Local maximal haplotype may correspond to important historical events during the evolution of the species.
15
National Taiwan University Department of Computer Science and Information Engineering Definition Given K haplotype samples comprised of n consecutive SNPs. Let h i be a K-dimensional vector, where i = 1, 2, …, n. e.g., h 1 = {0, 0, 1, 0}, h2 = {0, 0, 1, 1} when K = 4, n = 2 h i (k) = 0, 1, or 2. 0: missing data 1: major allele 2: minor allele
16
National Taiwan University Department of Computer Science and Information Engineering Definition Two haplotypes are said to be compatible if the alleles are the same for them at each loci with no missing data. A haplotype in the block is ambiguous if it is compatible with two other haplotypes that are themselves incompatible. E.g., h 1 = (1, 0, 0, 2), h 2 = (1, 1, 2, 0), h 3 = (1, 1, 1, 2) h 1 is compatible with h 2 and h 3, but h 2 is incompatible to h 3. h 1 is ambiguous, whereas h 2 and h 3 are unambiguous. Only unambiguous haplotypes are discussed in this paper. Compatible haplotypes are treated as identical.
17
National Taiwan University Department of Computer Science and Information Engineering Definition Haplotype block: a segment of consecutive SNPs can form a haplotype block if at least α percent of umambigous haplotypes are represented more than once in the samples. the α value in Zhang and Patil’s experiments are both set to 80. Tag SNPs: minimum number of SNPs that can distinguish at least α percentage of the haplotypes.
18
National Taiwan University Department of Computer Science and Information Engineering Predefined Functions block(i, …, j) is a boolean function Block(i, …, j) = 1 iff at least αM unambiguous haplotypes defined by that SNPs are represented more than once, where M ≤ K is the total number of defined haplotypes. f(·) is the number of tag SNPs within a block. Let B = {B 1, B 2, …, B 3 } is a set of disjoint blocks L(i, …, j) is the length of a block. L(i, …, j) = i – j + 1
19
National Taiwan University Department of Computer Science and Information Engineering Problem 1 Block Partition with a Fixed number of tag SNPs: Given K haplotypes consisting of n consecutive SNPs, and an integer m, find a set of disjoint blocks B = {B 1, B 2, …, B l } with f(B) ≤ m such that L(B) is maximized. 2D Dynamic programming algorithm for problem 1 Le S(j,k) be the maximum length of the genome covered by at most k tag SNPs for the optimal block partition of the first j SNPs, j = 1, 2, …, n. S(0,k) = 0 for any k S(0,k) = -∞
20
National Taiwan University Department of Computer Science and Information Engineering 2D Dynamic Programming Algorithm for Problem 1 Case 1: the last block ends before j S(j, k)) = S(j-1, k) Case 2: the last block ends exactly at j and starts at i S(j, k)) = S(i-1, k - f(i,..,j)) + L(i,..,j) The optimal block partition can be found by backtracking the elements of S that contribute to S(n,m)
21
National Taiwan University Department of Computer Science and Information Engineering Time Complexity to Compute S(n,m) If the block(·), f(·), and L(·) functions are computed in advance, then S(n, m) has space complexity = O(m*n). time complexity = O(N*m*n), where N is the number of SNPs contained in the largest block. Time complexity to compute L(·) is O(1). Time complexity to compute block(i, …, i+k+1) is O(K 2* k). Need to compare whether any two haplotypes are compatible at these k SNPs.
22
National Taiwan University Department of Computer Science and Information Engineering Time Complexity to Compute S(n,m) Time complexity to compute f(·) is a NP-Complete problem. Equal to the Minimum Test Set problem. e.g., simplest way to compute f(i, …, i+N+1) Overall Time complexity O(2 N^K *N*n) + O(K 2 *N 2 *n) + (N*m*n)
23
National Taiwan University Department of Computer Science and Information Engineering Problem 2 Block partition with a fixed genome coverage Given a chromosome of length L, K haplotypes consisting of n consecutive SNPs and β ≤ 1, find a set of disjoint blocks B = {B 1, B 2, …, B l } with L(B) ≥ βL such that f(B) is minimized. Parametric dynamic programming algorithm Define the positive score for SNPs i, …, j, to be the number of tag SNPs, f(i, …,j), if block(i, …,j) = 1 and this block is included in the partition. Define the penalty for SNPs i, …, j, to be λL(i,…,j) if they are excluded from the partition.
24
National Taiwan University Department of Computer Science and Information Engineering Parametric Dynamic Programming Le S(j, λ) be the minimum score for the optimal block partition of the first j SNPs with respect to the deletion parameter λ. S(0, λ) = 0 S(n, ∞) = the minimum number of tag SNPs for the entire chromosome because all SNPs are included in the block partition. The scoring function S(j, λ) is the sum of the total number of tag SNPs for included blocks, and The penalty for excluded intervals.
25
National Taiwan University Department of Computer Science and Information Engineering Properties of the scoring function S(j, λ) is an increase, piecewise-linear, and convex function of λ, S(j, λ) = a + b* λ The right-most linear segment of S(j, λ) is constant. The intercept for each linear segment is the total number of tag SNPs. The slope for each linear segment is the total length of excluded intervals.
26
National Taiwan University Department of Computer Science and Information Engineering Compute S(n, λ) The algorithm starts with S(n, 0) and S(n, ∞), and let L 0 and L ∞ intersects at (x,y). Case 1: if (S(n,x) = y), L 0 and L ∞ together define the entire function of S(n, λ). Case 2: if (S(n,x) < y), divide λ into two regions: [0, x] and [0, ∞], and repeat the above procedures for this two regions. (x,y) λ S(n, λ) λ (x, S(n,x) )
27
National Taiwan University Department of Computer Science and Information Engineering Time complexity of S(n, λ) If L(·) is additive, the computational time can be reduced to O(n) by using the following recursion The time complexity for finding S(n, λ) is O(2 N^K *N*n) + O(K 2 *N 2 *n) + (N*S*n) S: the number of segments in S(n, λ) For the case when different block partitions contribute to the same score, the algorithm chooses the right most segment with maximum number of tag SNPs and the minimum length of excluded intervals.
28
National Taiwan University Department of Computer Science and Information Engineering Problem 3 Finding local maximal haplotypes for a subset of samples Given K haplotypes consisting of n consecutive SNPs, and two integers, k ≤ K and m ≤ n, find all local maximal haplotypes that are shared by at least k samples and contain at least m SNPs.
29
National Taiwan University Department of Computer Science and Information Engineering Algorithm 3 Step 1. Let S be a super set containing a set of {all K samples}, |S=1| and j = i. Step 2. For every set S w < S, split into two sets if tuere ex9st tw0 samples in S w that disagree at the jth SNP. Step 3. Report one local maximal local haplotype if |j-i+1| ≥ m, |S w | ≥ k, and there exists two samples that disagree at the (i-1)th and two ones at (j+1)th SNPs. Step 4. Stop if |S| = k; Otherwise, let j = j+1 and go to Step 2.
30
National Taiwan University Department of Computer Science and Information Engineering Time complexity of algorithm 3 Let N be the length of the local maximal haplotypes shared b at most k samples. The overall time complexity is O(K*N*n). i+3111111 i+2112211 i+1112112 i111122 1111 11 22 111 1 2 1 22 12 11 111 1 112 2 1 2 22 12 11 11 11 11 11 …
31
National Taiwan University Department of Computer Science and Information Engineering Results of Algorithm 1 The data set includes 20 haplotypes of 24047 SNPs (at least 10% minor allele frequence) spanning over abot 32,4 MB. The parameter α for block() function is set to 80%. Figure a shows the relationship between number of tag SNPs and percentage of the covered toal SNPs, where Figure b is w.r.t actual genome length. The data set the published haplotype data of Human Chromosome 21 from Patil et al.
32
National Taiwan University Department of Computer Science and Information Engineering Results of Algorithm 2 Figure a shows the relationship between the percentages of the total number of SNPs included and the deletion parameter λ. Figure b shows the relationship between the percentages of the total number of SNPs included and the number of tag SNPs.
33
National Taiwan University Department of Computer Science and Information Engineering Results of Algorithm 3 Local maximal haplotypes are defined as that are shared by at least 2 samples and contain at least 100 consecutive SNPs.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.