Presentation is loading. Please wait.

Presentation is loading. Please wait.

DATA ANALYSIS Module Code: CA660 Lecture Block 7.

Similar presentations


Presentation on theme: "DATA ANALYSIS Module Code: CA660 Lecture Block 7."— Presentation transcript:

1 DATA ANALYSIS Module Code: CA660 Lecture Block 7

2 2 Examples in Genomics and Trait Models Genetic traits may be controlled by No.genes-usually unknown Taking “genetic effect” as one genotypic term, a simple model for where y ij is the trait value for genotype i in replication j,  is the mean, G i the genetic effect for genotype i and  ij the errors. If assume Normality (and want Random effects) + assume zero covariance between genetic effects and error Note: If same genotype replicated b times in an experiment, with phenotypic means used, error variance averaged over b.

3 3 Example - Trait Models contd. What about Environment and G  E interactions? Extension to Simple Model. ANOVA Table: Randomized Blocks within environment and within sets/blocks in environment = b = replications. Focus - on genotype effect Source dof Expected MSQ Environment e-1 know there are differences Blocks (b-1)e again – know there are differences Genotypes g-1 G  E (g-1)(e-1) Error (b-1)(g-1)e Note: individuals blocked within each of multiple environments, so environmental effect intrinsic to error. Model form is standard, but only meaningful comparisons are within environment, hence form of random error = population variance = ; so random effects of interest from additional variances & ratios Genotypic effects measured within blocks

4 4 Example contd. HERITABILITY = Ratio genotypic to phenotypic variance Depending on relationship among genotypes, interpretation of genotypic variance differs. May contain additive, dominance, other interactions, variances (Above = heritability in broad terms). For some experimental or mating schemes, an additive genetic variance may be calculated. Narrow/specific sense heritability then Again, if phenotypic means used, obtain a mean-based heritability for b replications.

5 5 Extended Example- Two related traits Have where 1 and 2 denote traits, i the gene and j an individual in population. Then ‘y’ is the trait value,  overall mean, G genetic effect,  = random error. To quantify relationship between the two traits, the variance- covariance matrices for phenotypic,  p genetic  g and environmental effects  e So correlations between traits in terms of phenotypic, genetic and environmental effects:

6 6 MAXIMUM LIKELIHOOD ESTIMATION Recall general points: Estimation, definition of Likelihood function for a vector of parameters  and set of values x. Find most likely value of  = maximise the Likelihood fn. Also defined Log-likelihood (Support fn. S(  ) ) and its derivative, the Score, together with Information content per observation, which for single parameter likelihood is given by Why MLE? (Need to know underlying distribution). Properties: Consistency; sufficiency; asymptotic efficiency (linked to variance); unique maximum; invariance and, hence most convenient parameterisation; usually MVUE; amenable to conventional optimisation methods.

7 7 VARIANCE, BIAS & CONFIDENCE Variance of an Estimator - usual form or for k independent estimates For a large sample, variance of MLE can be approximated by can also estimate empirically, using re-sampling* techniques. Variance of a linear function (of several estimates) – (common need in genomics analysis), e.g. heritability. Recall Bias of the Estimator then the Mean Square Error is defined to be: expands to so we have the basis for C.I. and tests of hypothesis.

8 8 COMMONLY-USED METHODS of obtaining MLE Analytical - solving or when simple solutions exist Grid search or likelihood profile approach Newton-Raphson iteration methods EM (expectation and maximisation) algorithm N.B. Log.-likelihood, because max. same  value as Likelihood Easier to compute Close relationship between statistical properties of MLE and Log-likelihood

9 9 METHODS in brief Analytical : - recall Binomial example earlier Example : For Normal, MLE’s of mean and variance, (taking derivatives w.r.t mean and variance separately), and equivalent to sample mean and actual variance (i.e. /N), -unbiased if mean known, biased if not. Invariance : One-to-one relationships preserved Used: when MLE has a simple solution

10 10 Methods for MLE’s contd. Grid Search – Computational Plot likelihood or log-likelihood vs parameter. Various features Relative Likelihood =Likelihood/Max. Likelihood (ML set =1). Peak of R.L. can be visually identified /sought algorithmically. e.g. Plot likelihood and parameter space range - gives 2 peaks, symmetrical around  likelihood profile for the well-known mixed linkage phase problem in linkage analysis. If e.g. constrain MLE = R.F. between genes (possible mixed linkage phase).

11 11 contd. Graphic/numerical Implementation - initial estimate of , direction of search determined by evaluating likelihood at both sides of . Search takes direction giving increase. Initial search increments large, e.g. 0.1, then when likelihood change starts to decrease or become negative, stop and refine increment. Multiple peaks – can miss global maximum, computationally intensive Multiple Parameters - grid search. Interpretation of Likelihood profiles can be difficult.

12 12 Example Recall Exs 2, Q. 8. Data used to show a linkage relationship between marker and a “rust- resistant”gene. Escapes = individuals who are susceptible, but show no disease (rust) phenotype under experimental conditions. So define as proportion escapes and R.F. respectively. is penetrance for disease trait, i.e. P{ that individual with susceptible genotype has disease phenotype}. Purpose of expt.-typically to estimate R.F. between marker and gene. Use: Support function = Log-Likelihood

13 13 Example contd. Setting 1st derivatives (Scores) w.r.t = 0. Expected value of Score (w.r.t.  is zero, (see analogies in classical sampling/hypothesis testing). Similarly for . Here, however, No simple analytical solution, so can not solve directly for either. Using grid search, likelihood reaches maximum at In general, this type of experiment tests H 0 : Independence between marker and gene and H 0 : no escapes Uses Likelihood Ratio Test statistics. (MLE  2 equivalent) N.B: Moment estimates solve slightly different problem, because no info. on expected frequencies, - (not same as MLE)

14 14 MLE Estimation Methods contd. Newton-Raphson Iteration Have Score (  ) = 0 from previously. N-R consists of replacing Score by linear terms of its Taylor expansion, so if  ´´ a solution,  ´=1st guess Repeat with  ´´ replacing  ´ Each iteration - fits a parabola to Likelihood Fn. Problems - Multiple peaks, zero Information, extreme estimates Multiple parameters – need matrix notation, where S matrix e.g. has elements = derivatives of S( ,  ) w.r.t.  and  respectively. Similarly, Information matrix has terms of form  Estimates are L.F. 2 nd 1st  Variance of Log-L i.e.S(  )

15 15 Methods contd. Expectation-Maximisation Algorithm - Iterative. Incomplete data (Much genomic data fits this situation e.g. linkage analysis with marker genotypes of F2 progeny. Usually 9 categories observed for 2-locus, 2-allele model, but 16 = complete info., while 14 give info. on linkage. Some hidden, but if linkage parameter known, expected frequencies can be predicted – as you know - and the complete data restored using expectation). Steps: (1) Expectation estimates statistics of complete data, given observed incomplete data. -(2) Maximisation uses estimated complete data to give MLE. Iterate till converges (no further change)

16 16 E-M contd. Implementation Initial guess,  ´, chosen (e.g. =0.25 say = R.F.). Taking this as “true”, complete data is estimated, by distributional statements e.g. P(individual is recombinant, given observed genotype) for R.F. estimation. MLE estimate  ´´ computed. This, for R.F.  sum of recombinants/N. Thus MLE, for f i observed count, Convergence  ´´ =  ´ or

17 17 LIKELIHOOD : C.I. and H.T. Likelihood Ratio Test – c.f. with  2. Principal Advantage of G is Power, as unknown parameters involved in hypothesis test. Have : Likelihood of  taking a value  A which maximises it, i.e. its MLE and likelihood  under H 0 :  N, (e.g.  N = 0.5) Form of L.R. Test Statistic or, conventionally - choose; easier to interpret. Distribution of G ~ approx.  2 (d.o.f. = difference in dimension of parameter spaces for L(  A ), L(  N ) ) Goodness of Fit : notation as for  2, G ~  2 n-1 : Independence: notation again as for  2

18 18 Power-Example extended Under H 0 : At level of significance  =0.05, suppose true  =  1 = 0.2, so if n=25 (e.g. in genomics might apply where R.F. =0.2 between two genes (as opposed to 0.5). Natural logs. used, though either possible in practice. Hence, generic form “Log” rather than Ln here. Assume Ln throughout for genetic/genomic examples unless otherwise indicated) Rejection region at 0.05 level is If sketch curves, P{LRTS falls in the acceptance region} = 0.13, = Prob.of a false negative when actual value of  = 0.2 If sample size increased, e.g. n=50, E{G} = 19 and easy to show that P{False negative} = 0.01 Generally: Power for these tests given by

19 19 Likelihood C. I.’s - method Example: Consider the following Likelihood function  is the unknown parameter ; a, b observed counts For 4 data sets observed, A: (a,b) = (8,2), B: (a,b)=(16,4) C: (a,b)=(80, 20) D: (a,b) = (400, 100) Likelihood estimates can be plotted vs possible parameter values, with MLE = peak value. e.g. MLE = 0.2, L max =0.0067 for A, and L max =0.0045 for B etc. Set A: Log L max - Log L=Log(0.0067) - Log(0.00091)= 2 gives  95% C.I. so  =(0.035,0.496) corresponding to L=0.00091,  95% C.I. for A. Similarly, manipulating this expression, Likelihood value corresponding to  95% confidence interval given as L = 7.389L max Note: Usually plot Log-likelihood vs parameter, rather than Likelihood. As sample size increases, C.I. narrower and  symmetric

20 20 Multiple Populations: Extensions to G - Example Recall Mendel’s data - earlier and Extensions to  2 for same In brief Round Wrinkled Plant O E O E G dof p-value 1 45 42.75 12 14.25 0.49 1 0.49 2 0.09 1 0.77 3 0.10 1 0.75 4 1.30 1 0.26 5 0.01 1 0.93 6 0.71 1 0.40 7 0.79 1 0.38 8 0.63 1 0.43 9 1.06 1 0.30 10 0.17 1 0.68 Total 336 101 5.34 10 Pooled 336 327.75 101 109.25 0.85 1 0.36 Heterogeneity 4.50 9 0.88

21 21 Multiple Populations - summary Parallels Partitions therefore and G heterogeneity = G total - G Pooled (n=no. classes, p = no.populations) Example: Recall Backcross (AaBb x aabb)- Goodness of fit (2- locus model). For each of 4 crosses, a Total GoF statistic can be calculated according to expected segregation ratio 1:1:1:1 – (assumes no segregation distortion for both loci and no linkage between loci). For each locus GoF calculated using marginal counts, assuming each genotype segregates 1:1. Difference between Total and 2 individual locus GoF statistics is L-LRTS (or chi-squared statistic) contributed by association/linkage between 2 loci.

22 22 Example: Marker Screening Screening for Polymorphism - (different detectable alleles) – look at stages involved. Genomic map –based on genome variation at locations (from molecular assay or traditional trait observations). (1) Screening polymorphic genetic markers is Exptal step 1 - usually assay a large number of possible genetic markers in small progeny set = random sample of mapping population. If a marker does not show polymorphism for set of progeny, then marker non-informative ; will not be used for data analysis).

23 23 Example contd. (2) Progeny size for screening – based on power, convenience etc., e.g. False positive = monomorphic marker determined to be polymorphic. Rare since m-m cannot produce segregating genotypes if these determined accurately. False negatives high particularly for small sample. e.g. for markers segregating 1:1 – (i)Backcross, recombinant inbred lines, doubled haploid lines, or (ii)F2 with codominant markers, So, e.g. (i) P{sampling all individuals with same genotype) = 2(0.5) n (ii) P{false negative for single marker, n=5} = 2(0.25) 5 +0.5 5 =0.0332 Hence Power curves as before.

24 24 Example contd. S.R 1:1 vs 3:1- use LRTS Detection of departure from S.R. of 1:1 n = sample size, O 1, O 2 observed counts of 2 genotypic classes. For true S.R. 3:1, O 1 genotypic frequency of dominant genotype, T.S. parametric value is approx.

25 25 Example contd. To reject a S.R. of 1:1 at 0.05 significance level, a LogLRTS of at least 3.84 (critical value for rejection) is required. Statistical Power For n=15 then, power is For a power of 90%, n  40 needed If problem expressed other way. i.e. calculating Expected LRTS (for rejecting a 3:1 S.R. when true value is S.R. 1:1), this is 0.2877n and n  35 needed.

26 26 Maximum Likelihood Benefits Good Confidence Intervals Coverage probability realised and interval biologically meaningful MLE Good estimator of a CI MSE consistent Absence of Bias - does not “stand-alone” – minimum variance important Asymptotically Normal Precise – large sample Biological inference valid Biological range realistic


Download ppt "DATA ANALYSIS Module Code: CA660 Lecture Block 7."

Similar presentations


Ads by Google