Presentation is loading. Please wait.

Presentation is loading. Please wait.

. Bayesian Networks For Genetic Linkage Analysis Lecture #7.

Similar presentations


Presentation on theme: ". Bayesian Networks For Genetic Linkage Analysis Lecture #7."— Presentation transcript:

1 . Bayesian Networks For Genetic Linkage Analysis Lecture #7

2 2 Bayesian network for Recombination S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 S 13m L 11f L 11m L 13m X 11 S 13f L 12f L 12m L 13f X 12 X 13  is the recombination fraction between loci 2 & 1. y3y3 y2y2 y1y1

3 3 The likelihood function = P(l 11m ) P(l 11f ) P(x 11 | l 11m, l 11f, ) … P(s 13m ) P(s 13f ) P(s 23m | s 13m,  2 ) P(s 23m | s 13m,  2 ) P(l 11m,l 11f,x 11,l 12m,l 12f,x 12,l 13m,l 13f,x 13, l 21m,l 21f,x 21,l 22m,l 22f,x 22,l 23m,l 23f,x 23, s 13m,s 13f,s 23m,s 23f |  2 ) = Product over all local probability tables Prob(data|  2 ) = P(x 11, x 12, x 13, x 21, x 22, x 23 ) = Probability of data (sum over all states of all hidden variables) Prob(data|  2 ) = P(x 11, x 12, x 13, x 21, x 22, x 23 ) =  l11m, l11f … s23f [ P(l 11m ) P(l 11f ) P(x 11 | l 11m, l 11f, ) … P(s 13m ) P(s 13f ) P(s 23m | s 13m,  2 ) P(s 23m | s 13m,  2 ) ] The result is a function of the recombination fraction. The ML estimate is the  2 value that maximizes this function.

4 4 Locus-by-Locus Summation order Sum over locus i vars before summing over locus i+1 vars Sum over orange vars (L ijt ) before summing selector vars (S ijt ). This order yields a Hidden Markov Model (HMM).

5 5 Recall the resulting HMM X1X1 X2X3Xi-1XiXi+1 X1X1 X2X2 X3X3 Y i-1 XiXi X i+1 X1X1 X2X3Xi-1XiXi+1 S1S1 S2S2 S3S3 S i-1 SiSi S i+1 The compounded variable S i = (S i,1,m,…,S i,2n,f ) is the inheritance vector with 2 2n states where n is the number of persons that have parents in the pedigree (non-founders). The compounded variable X i = (X i,1,m,…,X i,2n,f ) is the data regarding locus i. Similarly for the disease locus we use Y i. To specify the HMM we explicated the transition matrices from S i-1 to S i and the matrices P(x i |S i ). Note that these quantities have already been implicitly defined.

6 6 The computational task at hand Multidimensional multiplication/summation: Example: Matrix multiplication: versus

7 7 An Example Visit to Asia Smoking Lung Cancer Tuberculosis Abnormality in Chest Bronchitis X-Ray Dyspnea u “Asia” network:

8 8 V S L T A B XD  We want to compute P(d)  Need to eliminate: v,s,x,t,l,a,b Initial factors Eliminate: v Note: f v (t) = P(t) In general, result of elimination is not necessarily a probability term Compute:

9 9 V S L T A B XD  We want to compute P(d)  Need to eliminate: s,x,t,l,a,b u Initial factors Eliminate: s Summing on s results in a factor with two arguments f s (b,l) In general, result of elimination may be a function of several variables Compute:

10 10 V S L T A B XD  We want to compute P(d)  Need to eliminate: x,t,l,a,b u Initial factors Eliminate: x Note: f x (a) = 1 for all values of a !! Compute:

11 11 V S L T A B XD  We want to compute P(d)  Need to eliminate: t,l,a,b u Initial factors Eliminate: t Compute:

12 12 V S L T A B XD  We want to compute P(d)  Need to eliminate: l,a,b u Initial factors Eliminate: l Compute:

13 13 V S L T A B XD  We want to compute P(d)  Need to eliminate: b u Initial factors Eliminate: a,b Compute:

14 14 Variable Elimination u This process is called variable elimination u Actual computation is done in elimination step u Computation depends on order of elimination

15 15 Dealing with evidence u How do we deal with evidence?  Suppose get evidence V = t, S = f, D = t  We want to compute P(L, V = t, S = f, D = t) V S L T A B XD

16 16 Dealing with Evidence u We start by writing the factors:  Since we know that V = t, we don’t need to eliminate V  Instead, we can replace the factors P(V) and P(T|V) with u These “select” the appropriate parts of the original factors given the evidence u We now continue to eliminate as before. V S L T A B XD

17 17 Complexity of variable elimination Space Complexity is exponential in number of variables in the largest intermediate factor. Or more exactly, space complexity is the size of the largest intermediate factor ) taking into account the number of values of each variable). Time Complexity is the sum of sizes of the intermediate tables.

18 18 Some options for improving efficiency 1. Multiplying special probability matrices efficiently. 2. Grouping alleles together and removing inconsistent alleles. 3. Optimizing the elimination order of variables in a Bayesian network. 4. Performing approximate calculations of the likelihood.

19 19 Sometimes conditioning is needed When intermediate tables become too large for a given RAM, even in the optimal order, one can set a value to some indices and iterate. This will decrease the table sizes and trade space with time.

20 20 The Constrained Elimination Problem u We define an optimization problem called “The constrained elimination problem”. u The solution of this problem optimizes variable elimination, with or without memory constraints. u We start with the unconstrained version.

21 21 Two Operations on a Graph 1. Eliminating vertex v from a (weighted) undirected graph G – the process of making N G (v) a clique and removing v and its incident edges from G. 2. Conditioning on vertex v in (weighted) undirected graph G – the process of removing vertex v and its incident edges from G. N G (v) is the set of vertices that are adjacent to v in G.

22 22 Example  Weights of vertices: yellow nodes: w = 2 blue nodes: w = 4 Original Bayes network. VS T L AB D X Undirected graph representation. VS T L AB D X P(v,s,…,x)= p(v) p(s) p(t|v) p(l|s)p(a|t,l)p(b|l)

23 23 Elimination Sequence  Elimination sequence of G – an order of the vertices of G, written as X α = (X α(1),…,X α(n) ), where α is a permutation on {1,…,n}. The residual graph G i is the graph obtained from G i-1 by eliminating vertex X α (i-1). (G 1 ≡G). The cost of an elimination sequence X α is the sum of costs of eliminating X α(i) from G i, for all i. The cost of eliminating vertex v from a graph G i is the product of weights of the vertices in N Gi (v).

24 24 Example Suppose the elimination sequence is X α =(V,B,S,…) : G1G1 V V S T L AB D X G2G2 S T L A B B D X G3G3 S S T L A D X

25 25 Optimal elimination sequence: one with minimal cost. Optimal elimination sequence: one with minimal cost. The unconstrained Elimination Problem reduces to finding treewidth if : the weight of each vertex is constant, the cost function is Finding the treewidth of a graph is known to be NP- complete (Arnborg et al., 1987). When no edges are added, the elimination sequence is perfect and the graph is chordal. Relation to Treewidth

26 26 Constrained Elimination Sequence Constrained elimination sequence X α, β = ((X α (1),…,, X α (n) ),β), where β is a binary vector of length n, such that: Optimal constrained elimination sequences yield optimal variable elimination under memory constraints.

27 27 Example Suppose the constrained elimination sequence is X α,β =((V,B,S,…),(0,1,0,…)) : G1G1 V V S T L AB D X G2G2 S T L A B B D X G3G3 S S T L A D X

28 28 Cost of a Constrained Elimination Sequence u The elimination cost of a constrained elimination sequence X α,β is:

29 29 The Constrained Elimination Problem u Input: G(V,E,w), threshold T.  Find a constrained elimination sequence X α,β which satisfies : 1. Its elimination cost is minimal. 2. The elimination cost of each X α(i) is lower than T.

30 30 Deterministic Greedy Algorithm u Iteration i: a vertex X i is chosen whose elimination cost is minimal.  If the elimination cost of every vertex in G i is above T, then a vertex X i is chosen to be fixed (instead of eliminated). n i (X) is the number of cliques in G i that include X.

31 31 Stochastic Greedy Algorithm u Iteration i: Three variables with minimal elimination cost are found and a coin is flipped to choose between them. The coin is biased according to the elimination costs of the vertices. u If the elimination cost of every vertex in G i is above T, then a vertex X i is chosen to be fixed (instead of eliminated). u Repeat many times (say, 100) unless the cost becomes low.

32 32 Stochastic Greedy vs. Deterministic Greedy

33 33 Standard usage of linkage There are usually 5-15 markers. 20-30% of the persons in large pedigrees are genotyped (namely, their x ij is measured). For each genotyped person about 90% of the loci are measured correctly. Recombination fraction between every two loci is known from previous studies (available genetic maps). The user adds a locus called the “disease locus” and places it between two markers i and i+1. The recombination fraction  ’ between the disease locus and marker i and  ” between the disease locus and marker i+1 are the unknown parameters being estimated using the likelihood function. This computation is done for every gap between the given markers on the map. The MLE hints on the whereabouts of a single gene causing the disease (if a single one exists).

34 34 SUPERLINK u Stage 1: each pedigree is translated into a Bayesian network. u Stage 2: value elimination is performed on each pedigree (i.e., some of the impossible values of the variables of the network are eliminated). u Stage 3: an elimination order for the variables is determined, according to some heuristic.  Stage 4: the likelihood of the pedigrees given the  values is calculated using variable elimination according to the elimination order determined in stage 3. u Allele recording and special matrix multiplication is used.

35 35 Experiment A (V1.0) Same topology (57 people, no loops) Increasing number of loci (each one with 4-5 alleles) Run time is in seconds. over 100 hours Out-of-memory Pedigree size Too big for Genehunter.

36 36 Experiment B (V1.0) Same topology (100 people, with loops) Increasing number of loci (each one with 5-10 alleles) Run time is in seconds. Out-of-memory Pedigree size Too big for Genehunter. Vitesse doesn’t handle looped Pedigrees.

37 37 Experiment C (V1.0) Same topology (5 people, no loops) Increasing number of loci (each one with 3-6 alleles) Run time is in seconds. Out-of-memory Bus error


Download ppt ". Bayesian Networks For Genetic Linkage Analysis Lecture #7."

Similar presentations


Ads by Google