Download presentation
Presentation is loading. Please wait.
Published byDonna King Modified over 8 years ago
1
. Basic Model For Genetic Linkage Analysis Prepared by Dan Geiger
2
2 Using the Maximum Likelihood Approach The probability of pedigree data Pr(data | ) is a function of the known and unknown recombination fractions denoted collectively by . How can we construct this likelihood function ? The maximum likelihood approach is to seek the value of which maximizes the likelihood function Pr(data | ). This is the ML estimate.
3
3 Constructing the Likelihood function L ijm = Maternal allele at locus i of person j. The values of this variables are the possible alleles l i at locus i. First, we need to determine the variables that describe the problem. There are many possible choices. Some variables we can observe and some we cannot. X ij = Unordered allele pair at locus i of person j. The values are pairs of i th -locus alleles (l i,l’ i ). L ijf = Paternal allele at locus i of person j. The values of this variables are the possible alleles l i at locus i (Same as for L ijm ). As a starting point, We assume that the data consists of an assignment to a subset of the variables {X ij }. In other words some (or all) persons are genotyped at some (or all) loci.
4
4 What is the relationships among the variables for a specific individual ? L 11f L 11m X 11 Paternal allele at locus 1 of person 1 Unordered allele pair at locus 1 of person 1 = data Maternal allele at locus 1 of person 1 P(L 11m = a) is the frequency of allele a. We use lower case letters for states writing, in short, P(l 11m ). P(x 11 | l 11m, l 11f ) = 0 or 1 depending on consistency
5
5 What is the relationships among the variables across individuals ? L 11f L 11m L 13m X 11 P(l 13m | l 11m, l 11f ) = 1/2 if l 13m = l 11m or l 13m = l 11f P(l 13m | l 11m, l 11f ) = 0 otherwise L 12f L 12m L 13f X 12 X 13 First attempt: correct but not efficient as we shall see. Mother Father Offspring
6
6 Probabilistic model for two loci L 11f L 11m L 13m X 11 L 12f L 12m L 13f X 12 X 13 Model for locus 1 L 21f L 21m L 23m X 21 L 22f L 22m L 23f X 22 X 23 Model for locus 2 L 23m depends on whether L 13m got the value from L 11m or L 11f, whether a recombination occurred, and on the values of L 21m and L 21f. This is quite complex.
7
7 Adding a selector variable L 11f L 11m L 13m X 11 S 13m Selector of maternal allele at locus 1 of person 3 Maternal allele at locus 1 of person 3 (offspring) Selector variables S ijm are 0 or 1 depending on whose allele is transmitted to offspring i at maternal locus j. P(s 13m ) = ½ P(l 13m | l 11m, l 11f,,S 13m =0) = 1 if l 13m = l 11m P(l 13m | l 11m, l 11f,,S 13m =1) = 1 if l 13m = l 11f P(l 13m | l 11m, l 11f,,s 13m ) = 0 otherwise
8
8 Probabilistic model for two loci S 13m L 11f L 11m L 13m X 11 S 13f L 12f L 12m L 13f X 12 X 13 Model for locus 1 S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 Model for locus 2
9
9 Probabilistic Model for Recombination S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 S 13m L 11f L 11m L 13m X 11 S 13f L 12f L 12m L 13f X 12 X 13 is the recombination fraction between loci 2 & 1.
10
10 Constructing the likelihood function I P(l 11m, l 11f,, x 11, s 13m,l 13m ) = P(l 11m ) P(l 11f ) P(x 11 | l 11m, l 11f, ) P(s 13m ) P(l 13m | s 13m, l 11m, l 11f ) Joint probability Prob(data) = P(x 11 ) = l11m l11f s13m l13m P(l 11m, l 11f,, x 11, s 13m,l 13m ) Probability of data (sum over all states of all hidden variables) All other variables are not-observed (hidden) Observed variable S 13m L 11f L 11m L 13m X 11
11
11 Constructing the likelihood function II = P(l 11m ) P(l 11f ) P(x 11 | l 11m, l 11f, ) … P(s 13m ) P(s 13f ) P(s 23m | s 13m, ) P(s 23m | s 13m, ) P(l 11m,l 11f,x 11,l 12m,l 12f,x 12,l 13m,l 13f,x 13, l 21m,l 21f,x 21,l 22m,l 22f,x 22,l 23m,l 23f,x 23, s 13m,s 13f,s 23m,s 23f, ) = Product over all local probability tables Prob(data| 2 ) = P(x 11, x 12, x 13, x 21, x 22, x 23 ) = Probability of data (sum over all states of all hidden variables) Prob(data| 2 ) = P(x 11, x 12, x 13, x 21, x 22, x 23 ) = l11m, l11f … s23f [ P(l 11m ) P(l 11f ) P(x 11 | l 11m, l 11f, ) … P(s 13m ) P(s 13f ) P(s 23m | s 13m, ) P(s 23m | s 13m, ) ] The result is a function of the recombination fraction. The ML estimate is the value that maximizes this function.
12
12 The Disease Locus I L 11f L 11m L 13m X 11 S 13m Phenotype variables Y ij are 0 or 1 depending on whether a phenotypic trait associated with locus i of person j is observed. E.g., sick versus healthy. For example model of perfect recessive disease yields the penetrance probabilities: P(y 11 = sick | X 11 = (a,a)) = 1 P(y 11 = sick | X 11 = (A,a)) = 0 P(y 11 = sick | X 11 = (A,A)) = 0 Y 11
13
13 The Disease Locus II L 11f L 11m L 13m X 11 S 13m Note that in this model we assume the phenotype/disease depends only on the alleles of one locus. Also we did not model levels of sickness. Y 11
14
14 Introducing a tentative disease Locus S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 S 13m L 11f L 11m L 13m X 11 S 13f L 12f L 12m L 13f X 12 X 13 The recombination fraction is unknown. Finding it can help determine whether a gene causing the disease lies in the vicinity of the marker locus. Disease locus: assume sick means x ij =(a,a) Marker locus Y 22 Y 21 Y 23
15
15 Locus-by-Locus Summation order Sum over locus i vars before summing over locus i+1 vars Sum over orange vars (L ijt ) before summing selector vars (S ijt ). This order yields a Hidden Markov Model (HMM).
16
16 Hidden Markov Models in General Application in communication: message sent is (s 1,…,s m ) but we receive (r 1,…,r m ). Compute what is the most likely message sent ? Application in speech recognition: word said is (s 1,…,s m ) but we recorded (r 1,…,r m ). Compute what is the most likely word said ? Application in Genetic linkage analysis: to be discussed now. X1X1 X2X3Xi-1XiXi+1R1R1 R2R2 R3R3 R i-1 RiRi R i+1 X1X1 X2X3Xi-1XiXi+1S1S1 S2S2 S3S3 S i-1 SiSi S i+1 Which depicts the factorization:
17
17 Hidden Markov Model In our case X1X1 X2X3Xi-1XiXi+1 X1X1 X2X2 X3X3 Y i-1 XiXi X i+1 X1X1 X2X3Xi-1XiXi+1 S1S1 S2S2 S3S3 S i-1 SiSi S i+1 The compounded variable S i = (S i,1,m,…,S i,2n,f ) is called the inheritance vector. It has 2 2n states where n is the number of persons that have parents in the pedigree (non-founders). The compounded variable X i = (X i,1,m,…,X i,2n,f ) is the data regarding locus i. Similarly for the disease locus we use Y i. To specify the HMM we need to write down the transition matrices from S i-1 to S i and the matrices P(x i |S i ). Note that these quantities have already been implicitly defined.
18
18 The transition matrix Recall that: Note that theta depends on I but this dependence is omitted. In our example, where we have one non-founder (n=1), the transition probability table size is 4 4 = 2 2n 2 2n, encoding four options of recombination/non-recombination for the two parental meiosis: (The Kronecker product) For n non-founders, the transition matrix is the n-fold Kronecker product:
19
19 Efficient Product So, if we start with a matrix of size 2 2n, we will need 2 2n multiplications if we had matrix A in hands. Continuing recursively, at most 2n times, yields a complexity of O(2n2 2n ), far less than O(2 4n ) needed for regular multiplication. With n=10 non-founders, we drop from non-feasible region to feasible one.
20
20 Probability of data in one marker locus given an inheritance vector S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 Model for locus 2 P(x 21, x 22, x 23 |s 23m,s 23f ) = = P(l 21m ) P(l 21f ) P(l 22m ) P(l 22f ) P(x 21 | l 21m, l 21f ) P(x 22 | l 22m, l 22f ) P(x 23 | l 23m, l 23f ) P(l 23m | l 21m, l 21f, S 23m ) P(l 23f | l 22m, l 22f, S 23f ) l 21m,l 21f,l 22m,l 22f l 22m,l 22f The five last terms are always zero-or-one, namely, indicator functions.
21
21 Efficient computation S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 Model for locus 2 Assume only individual 3 is genotyped. For the inheritance vector (0,1), the founder alleles L 21m and L 22f are not restricted by the data while (L 21f,L 22m ) have two possible joint assignments (A 1,A 2 ) or (A 2,A 1 ) only: The five last terms are always zero-or-one, namely, indicator functions. ={A 1,A 2 } =1 =0 p(x 21, x 22, x 23 |s 23m =1,s 23f =0 ) = p( A 1 )p( A 2 ) + p( A 2 )p( A 1 ) In general. Every inheritance vector defines a subgraph of the Bayesian network above. We build a founder graph
22
22 Efficient computation S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 Model for locus 2 The five last terms are always zero-or-one, namely, indicator functions. ={A 1,A 2 } =1 =0 In general. Every inheritance vector defines a subgraph as indicated by the black lines above. Construct a founder graph whose vertices are the founder variables and where there is an edge between two vertices if they have a common typed descendent. The label of an edge is the constraint dictated by the common typed descendent. Now find all consistent assignments for every connected component. L 21f L 21m L 22f L 22m {A 1,A 2 }
23
23 A Larger Example 4 3 6 5 2 1 8 7 a,b a,c b,d 5364 2187 {a,b} 5364 {b,d} {a,c} Descent graph Founder graph (An example of a constraint satisfaction graph) Connect two nodes if they have a common typed descendant.
24
24 The Constraint Satisfaction Problem 5364 2187 {a,b} 5364 {b,d} {a,c} The number of possible consistent alleles per non-isolated node is 0, 1 or 2. For example node 2 has all possible alleles, node 6 can only be b and node 3 can be assigned either a or b. namely, the intersection of its adjacent edges labels. For each non-singleton connected component: Start with an arbitrary node, pick one of its values. This dictates all other values in the component. Repeat with the other value if it has one. So each non-singleton component yields at most two solutions.
25
25 Solution of the CSP 5364 2187 {a,b} 5364 {b,d} {a,c} Since each non-singleton component yields at most two solutions. The likelihood is simply the product of sums each of two terms at most. Each component contributes one term. Singleton components contribute the term 1 In our example: 1 * [ p(a)p(b)p(a) + p(b)p(a)p(b)] * p(d)p(b)p(a)p(c). Complexity. Building the founder graph: O(f 2 +n). While solving general CSPs is NP-hard.
26
. Summary
27
27 Road Map For Graphical Models Foundations Probability theory –subjective versus objective Other formalisms for uncertainty (Fuzzy, Possibilistic, belief functions) Type of graphical models: Directed, Undirected, Chain Graphs, Dynamic networks, factored HMM, etc Discrete versus continuous distributions Causality versus correlation Inference Exact Inference Variable elimination, clique trees, message passing Using internal structure like determinism or zeroes Queries: MLE, MAP, Belief update, sensitivityApproximate Inference Sampling methods Loopy propagation (minimizing some energy function) Variational method
28
28 Road Map For Graphical Models Learning Complete data versus incomplete data Observed variables versus hidden variables Learning parameters versus learning structure Scoring methods versus conditional independence tests methods Exact scores versus asymptotic scores Search strategies vs. Optimal learning of trees/polytrees/TANs Applications Diagnostic tools: printer problems to airplanes failures Medical diagnostic Error correcting codes: Turbo codes Image processing Applications in Bioinformatics: gene mapping, regulatory, metabolic, and other network learning
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.