Download presentation
Presentation is loading. Please wait.
Published byCornelius Johnston Modified over 9 years ago
1
1 TreeDT:Gene Mapping by Tree Disequilibrium Test Author:Pettri Sevon Dept. of computer science & Finnish Genome center. Univ. of Helsinki Hannu T.T. Toivonen Nokia Research Center. Univ. of Helsinki Vesa Ollikainen Finnish Genome Center. Univ. of Helsinki Advisor: Dr. Hsu Graduate: Cheng-Wen Hong
2
2 Outline 1.Motivation 2.Objective 3.Introduction 4.Problem Background 5.Method 6.Algorithms 7.Related Work 8.Experiment 9.Conclusions 10.Personal Opinion
3
3 Motivation USA and England will finish the human gene mapping in 2003. In the long time. A geneticist will research human gene sequence variation,the inheritance of complex trait and the discovery of new disease susceptibility genes. It is an immense important for human health.
4
4 Objective We find a novel gene mapping method (TreeDT).It is effective to locate a disease- susceptility gene for a given disease. The gene and the proteins can be analyzed to understand the disease causing mechanisms and to design new medicines.
5
5 Introduction (1).Gene mapping aims at discovering a statistical connection from a given disease to a narrow region in the genome(chromosomes). (2).Genetic markers along chromosomes provide data that can be used to discover associations between patient phenotypes(diseased vs.healthy) and chromosomal regions(i.e. potential disease gene loci). (3).We introduce TreeDT, a novel method for gene mapping. It analyses the observed strings of markers by tree patterns that reflect the possible genetic history of a disease susceptibility(DS) gene and locate the DS gene loci effectively.
6
6 (3).The contributions of TreeDT are : (1). A novel approach to gene mapping using tree patterns. (2). An efficient algorithm for generating and testing tree patterns. (3).a method for estimating the statistical significance of findings.
7
7 Problem Background (1).Marker Data: A genetic marker is a short polymorphic region in the DNA, denoted here by M1,M2,…The different variants of DNA that different people have at the marker are alleles, denoted in our examples by 1,2,3,…. The collection of markers is a maker map, And its corresponding alleles constitute its haplotype (figure1) The input data consists of haplotypes of diseased and control persons.
8
8 Problem Background (2).Linkage disequilibrium All the current carriers of a DS gene have inherited from a founder who introduced the gene mutation to population(figure2). And if find a haplotype linked with the mutation locus forever.It is a linkage disequilibrium(LD),non- random association between nearby markers. (3).Gene Mapping Using linkage analysis to determine the relative position bet- -ween two genes on chromosome.
9
9 Problem Background (4).Summary of Background and Problem Located markers can be very informative:given an ancestor with a mutated gene, the descendants that inherit the gene are also likely to inherit alleles of nearby markers. The LD-based gene mapping problem is now. The input consists of a marker map,and a set of disease- -associated haplotypes and a set of control haplotypes on the given map.The task is to predict the location of a disease susceptibility gene on map.
10
10 Method Based on the observed haplotypes, TreeDT evaluates the most likely coalescence tree at a number of locations along the analyzed chromosome.and then assesses the subtree clustering of disease-associated haplotypes in these trees(Using tree disequilibrium test,intended for predicting DS gene location.)
11
11 Method (1).Haplotype Prefix Trees:Given a location(potential gene locus) in the chromosome-the haplotypes to the right(or to the left) of the location can be organized into a prefix tree (Figure3and4). TreeDT builds two prefix trees, one to the left and one to the right, Between each pair of consecutive markers and test their disequilibrum.
12
12 Method (2).Tree Disequilibrium Test( for a haplotype prefix tree T) H 0 : The disease-association statuses are randomly distributed in the leaves of T. H 1 :The distribution of the disease-association statuses deviates in some subtrees of T from the overall distribution of statuses. For measuring the disequilibruim: The test statistic Z k for a tree with k deviant subtrees T 1,..,T K,where a i is the number of disease-associated haplotypes and n i the total number of haplotypes in subtree TiES,AND P is the proportion of disease-associated haplotypee in the sample.
13
13 Method (3).Significance Test (a)Z k is a measure for the disequilibrium of a given tree,at a certain location in the chromosome,with given k deviant subtrees. (b)TreeDT finds for each k the set S of subtrees that maximizes Z k (Z k can be efficiently maximized simultaneously for all k using a recursive algorithm.) (c)Since Z k’s for different degrees of freedom k are not comparable and the distribution of the maximized Z k is very complex,TreeDT estimates the p value for each maximized Z k (under H 0 ), p values are estimated by a permutation test. (d)In order to get a single p value for the disequilibrium at a given location, A comined measure we the product of the lowest p value over aal k from each side.
14
14 Method (e)The output of TreeDT is essentially the p value ranked list of locations. A point prediction for the DS gene location is obtained by taking the best location, a (potentially fragmented) region of length L is obtained by taking best locations until a length of L is covered. (f)All these three nested p value tests(for each tree and k, for each location,for the best location) can carried out efficiently.
15
15 Algorithms (1).Constructing Haplotype Prefix-Trees The haplotype prefix-trees to the left and right from each analyzed location can be efficiently identified using a string –sorting algorithm. (2).An Algorithm for Maximizing the Tree Disequilibrium Statistic Z k It is essential that the time-complexity of the algorithm for maximizing the Z k is as low as possible. Because it must be excuted for each tree location and permutation in turn. (3).INPUT: A haplotype prefix tree T OUTPUT:Maximum values of Z k in the tree T for each k. The time complexity of the algorithm is O(n*n),where n is the number of leaves(haplotype) in the tree.
16
16 Algorithm (4).Multiple Nest Permutation Tests The straight forward algorithm for a three-level nested permutation test using nested loops would have time complexity proportional to n*n*n,where nis the number of permutations at each level.
17
17 Relate Work (1).Several statistical methods to detect LD around a DS gene. But these methods are computationally heavy. (2).Haplotype Pattern Mining(HPM) is based on analyzing the LD of sets of haplotype patterns. (3).Transmission / Disequilibrium Tests(TDT) are an established way of testing association and linkage in a sample where linkage disequilibrium exists between the mutation locus and nearby marker loci. (4).m-TDT is to detect LD in multipoint variant,haplotype of several alleles.
18
18 Experiments We compare TreeDT empirically to TDT, to m-TDT,and to HPM. We evaluate the methods on Simulation of data( simulated to resemble a realistic population isolate. Using 100 data sets,Each data set consisted of 200 disease- associated and 200 control chromosomes.The length of be analyzed was 100 cM, and a map of 101equidistantly spaced markers,each having 5 alleles.
19
19 Analysis of TreeDT (1).First we assess the prediction accuracy(power) of TreeDT with different A,the proportion of disease-associated chromosomes that actually carry the mutation.For A=20% or 15% the accuracy is very good. And with lower values of A the accuracy decreases until with A=5%(challenging) only in20-30% of data sets can the gene be localized within a reasonable accuracy 10-20 cm.
20
20 Analysis of TreeDT (2).We evaluate the effect of the only parameter of TreeDT,the number of deviant subtrees(founders) that are searched for in each tree (FIGURE5B). As we increase the number of founders (deviant subtrees),evidence about the gene location becomes more fragmented, but the upper limit of 6 subtrees gives consistently competitive results.
21
21 Analysis of TreeDT Figure 5c show the experimental relationship between power(ratio ture positives / all positives) and overall p(ratio false positives / all negatives),For higher values of A the classification accuracy is extremely good,but A=5%(challenging) the classification no better than random guessing.
22
22 Comparison to other methods (1).TreeDT,HPM and m-TDT have practically identical performance in localizing the DS gene in the baseline setting (FIGURE 6A), TDT is clearly inferior compared to the other methods.
23
23 Comparison to other methods (2).In a test setting with three founders who introduced the mutation to the population (Figure 6B),TreeDT has an edge over HPM,which in turn has an edge over m-TDT,TDT barely beats random guessing.
24
24 Comparison to other methods (3).We compare the methods with a large amount of missing data (Figure 6c).HPM is most robust with respect to missing data,but TreeDT is not much weaker than HPM.Performance of m-TDT degrads much more clearly. In the previous discussion(1)(2)(3) can show that TreeDT is very competitive.
25
25 Conclusions (1).TreeDT is a novel method for gene mapping and our experiment show that TreeDT is effective in extreme conditions for gene mapping problems:with lots of noise(only 10% - 20% of affected chromosomes carry the mutation,lots of missing data) and with small sample sizes(200 affected and 200 control chromosomes). (2).TreeDT is competitive with other recent data mining methods.
26
26 Personal Opinion We can find a better statistic for Tree Disequilibrium Test, (1).The Distribution of the maximized statistic is very simple and compute p values are low time complexity, (2).The maximized statistics are comparable in different degrees of freedom. (3).we don,t use Tree method to find other methods2626.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.