Review of main points from last week Medical costs escalating largely due to new technology This is an ethical/social problem with major conseq. Many new.

Slides:



Advertisements
Similar presentations
A quantitative trait locus not associated with cognitive ability in children: a failure to replicate Hill, L. et al.
Advertisements

Population Genetics 1 Chapter 23 in Purves 7 th edition, or more detail in Chapter 15 of Genetics by Hartl & Jones (in library) Evolution is a change in.
Chi Square Your report 2. Intro Describe your trait you selected – What is the dominant and recessive trait – How did you collect the data? – What is.
CZ5225 Methods in Computational Biology Lecture 9: Pharmacogenetics and individual variation of drug response CZ5225 Methods in Computational Biology.
T-tests continued.
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
A method of quantifying stability and change in a population.
Basics of Linkage Analysis
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,
11.1 Genetic Variation Within Population KEY CONCEPT A population shares a common gene pool.
Sampling Designs and Techniques
Lectures 30 and 31 “Identifying human disease genes” If you are interested in studying a human disease, how do you find out which gene, when mutated, causes.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
11.4 Hardy-Wineberg Equilibrium. Equation - used to predict genotype frequencies in a population Predicted genotype frequencies are compared with Actual.
Chi-Squared Test.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display Chapter 12 Lecture Outline See PowerPoint Image Slides.
Comments on Rare Variants Analyses Ryo Yamada Kyoto University 2012/08/27 Japan.
Chapter & 11.3.
Step 3 of the Data Analysis Plan Confirm what the data reveal: Inferential statistics All this information is in Chapters 11 & 12 of text.
Conservation of genomic segments (haplotypes): The “HapMap” n In populations, it appears the the linear order of alleles (“haplotype”) is conserved in.
Chi-Square Test A fundamental problem in genetics is determining whether the experimentally determined data fits the results expected from theory. How.
From Genome-Wide Association Studies to Medicine Florian Schmitzberger - CS 374 – 4/28/2009 Stanford University Biomedical Informatics
Lecture 5: Chapter 5: Part I: pg Statistical Analysis of Data …yes the “S” word.
CATALYST Recall and Review: – What are chromosomes? – What are genes? – What are alleles? How do these terms relate to DNA? How do these terms relate to.
Mendel and the Gene Idea
Genes in human populations n Population genetics: focus on allele frequencies (the “gene pool” = all the gametes in a big pot!) n Hardy-Weinberg calculations.
Chi square analysis Just when you thought statistics was over!!
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.
Lecture 24: Quantitative Traits IV Date: 11/14/02  Sources of genetic variation additive dominance epistatic.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Chapter 23: Evaluation of the Strength of Forensic DNA Profiling Results.
Chapter 11 “The Mechanisms of Evolution” w Section 11.1 “Darwin Meets DNA” Objective: Identify mutations and gene shuffling as the primary sources of inheritable.
Chi square and Hardy-Weinberg
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Chi Square Pg 302. Why Chi - Squared ▪Biologists and other scientists use relationships they have discovered in the lab to predict events that might happen.
Genome-Wides Association Studies (GWAS) Veryan Codd.
AP Biology Heredity PowerPoint presentation text copied directly from NJCTL with corrections made as needed. Graphics may have been substituted with a.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Brendan Burke and Kyle Steffen. Important New Tool in Genomic Medicine GWAS is used to estimate disease risk and test SNPs( the most common type of genetic.
Lecture 17: Model-Free Linkage Analysis Date: 10/17/02  IBD and IBS  IBD and linkage  Fully Informative Sib Pair Analysis  Sib Pair Analysis with Missing.
Single Nucleotide Polymorphisms (SNPs
Genomic Analysis: GWAS
Common variation, GWAS & PLINK
KEY CONCEPT A population shares a common gene pool.
Chi-Square Test A fundamental problem is genetics is determining whether the experimentally determined data fits the results expected from theory (i.e.
Constrained Hidden Markov Models for Population-based Haplotyping
Understanding Results
Genome Wide Association Studies using SNP
Recombination (Crossing Over)
Bellringer Imagine that you are in charge of a goat ranch.  The cost of fencing is high, so you must implement a breeding program that will produce shorter-legged.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
KEY CONCEPT A population shares a common gene pool.
KEY CONCEPT A population shares a common gene pool.
KEY CONCEPT A population shares a common gene pool.
Genetic Mapping Linked Genes.
KEY CONCEPT A population shares a common gene pool.
Chapter 7 Multifactorial Traits
Exercise: Effect of the IL6R gene on IL-6R concentration
KEY CONCEPT A population shares a common gene pool.
CATALYST Recall and Review: How do these terms relate to DNA?
Evaluation of power for linkage disequilibrium mapping
KEY CONCEPT A population shares a common gene pool.
KEY CONCEPT A population shares a common gene pool.
Rest of lecture 4 (Chapter 5: pg ) Statistical Inferences
KEY CONCEPT A population shares a common gene pool.
Presentation transcript:

Review of main points from last week Medical costs escalating largely due to new technology This is an ethical/social problem with major conseq. Many new technologies provide only marginal benefits cost-effectiveness frequently not evaluated FDA – is it “safe and effective” CMS – is it “necessary and reasonable” Consider genetic testing and “personalized” medicine as example of a new technology needing evaluation What benefits does/will it provide? At what cost? Are there potential harms? What ethical/legal/social issues does it raise?

Some concepts needed to understand this technology What does DNA “do”? (genes, proteins) What is its structure? What is DNA sequence? What is a SNP? How is DNA passed from parents to offspring? What are mutations, genetic variants? How can they be associated with traits, diseases, disease risks, sensitivity to particular drugs? Examples of tests offered “direct-to-consumer”

Today’s subject: gene-disease risk associations & GWAS Understand how GWAS studies have been done in order to better evaluate disease risk predictions from companies like 23andMe Understand strengths and limitations of GWAS Go over some basic ideas in statistics needed to evaluate GWAS (and other apps. in engineering!) Think about how technical complexity affects your ability to evaluate utility of this (and by example, other) new technologies

Genome-wide association studies (GWAS) = source of data for SNP associations with particular diseases Basic idea – search for chr. regions (SNPs) with diff. allele frequencies in cases vs controls If such SNPs found, it could be that: the SNP allele causes (or contributes to) the disease the SNP allele is close enough on a chr to disease-causing mutation that they have been inherited together in most people since mut’n. arose (founder effect) the SNP allele and the disease both occur at higher freq. in some ethnic group but not for genetic reasons e.g. malaria and skin pigment variants

Last possibility would be false positive result for GWAS! So GWAS studies first go to great lengths to select genetically homogeneous cases and controls and exclude genetically heterogeneous individuals How can you do this? Use multi-dimensional scaling – a data visualization tool to group similar objects in complex data sets Idea – imagine n-dim. space where each axis represents a SNP locus, and AA=0, Aa=.5, aa=1 along axis represent each individual as point in this space genetic dist. btw. people = Euclidian dist. btw. their pts.

Hard to “see” data in n-dimensional space when n is large So make 2-d plot of individuals so that dist. betw. pairs in 2-d “best” reflects Euclidean dist. in n-dimen. (imagine moving each point in 2-d map randomly to minimize discrepancy btw. 2-d and n-d distances, summed over all other individuals, then repeating for each individual until map positions converge) Genetically closely related people cluster in such a map Eliminate all outliers from GWAS study population

Implication Any positive GWAS findings are initially only “true” for a particular homogeneous group (e.g. CEU = N. Europeans) and must be retested in other populations before they can be accepted generally

Next problem – if genotype (pattern of alleles at some locus, e.g. AA vs Aa vs aa) frequencies differ betw. cases and controls, how much do they have to differ to be statistically significant? Basic idea in statistics – see if data are reasonably likely given “null” hypothesis (H 0 ) that groups (e.g., cases and controls) do not differ (in genotype frequencies) If groups are not really different, you could pool the data and calculate mean and st. dev. for the pool, then ask if you randomly chose 2 groups (of the size of the cases and controls) from this one population, how likely would the means of the 2 groups differ by as much as you observe. A “t” test gives you this probability. If it is very low, you may have reason to reject the null hypothesis.

The chi sq test is very like the “t” test Chi sq =  (Exp-Obs) 2 /Exp It’s probability distribution is known for randomly selected groups from a single population. If p(chi sq) < small # , e.g.  =.05, you might want to conclude the groups are different Traditionally, and completely arbitrarily,  = 0.05 is often taken as a cut-off. This means that if the groups are really not different, you’ll make a mistake and call them different 5% of the time. You pick the cut-off for whatever error rate you feel appropriate

Complication – if one tests for association with 20 (or n), independent things, expect  1 to have p(chi sq) <.05 (< 1/n) even when no assoc. exists (false positive, FP). Testing for assoc with any of  10 6 genes, one needs much stricter criterion than  =.05 in order to avoid lots of FP’s Simplest correction – Bonferroni: divide  by n = # of  SNPs tested; e.g. require p(chi sq.) < 0.05/10 6   in order that probability of any FP be <.05

 Example chi sq calculation  hypothetical #'s with each genotype   aa aA AA sum  dis. cases  controls  totals  If H 0 true, can pool groups for best est. of probabilities  p(aa) = 165/5000; p(aA)=1470/5000, p(AA)=3365/5000

 Then expected # aa among dis. cases = p(aa)*2000 = 66  Expected # of aA among dis. cases = p(aA)*2000 = 588  Compute remaining expected #’s same way or from totals  ->  Expected #aaaAAAsum  dis. cases  Controls  totals  Chi sq =  (exp-obs) 2 /exp = (66-45) 2 /66 + … =  p(chi sq, 2df) = 1.59x10 -9 (from table, or web) <  =  so H 0 (no association) rejected, assoc. is likely  For confirmation, repeat study in independent groups

Next problem, not really interested in p(data|hypothesis) want p(hypothesis|data) E.g., you observe freq of some SNP alllele is higher in disease group vs controls, you want to know p(dis.|genotype) not p(genotype|disease) Bayesian statistics allows you to infer p(disease|genotype) from p(genotype|disease) Basic Idea: 2 ways to calculate p of disease and genotype AA p(D|AA)p(AA) = p(AA|D)p(D) -> p(D|AA) = p(AA|D) p(D) / p(AA) have to know p(D), p(AA), and p(AA|D) to get p(D|AA)

Relative risk might be measured as p(D|AA)/p(D) but frequently expressed in terms of “odds ratio” Odds = p(event)/[1-p(event)]e.g. “2:1” if p(event)=.67 Odds ratio = odds(D|AA)/odds(D) (assume A is hi risk allele) = {p(D|AA)/[1-p(D/AA)]} / {p(D)/[1-p(D)]} note odds ratio > relative risk since [1-p(D)]/[1-p(D/AA)] >1

Look at data in GWAS paper, Nature 447:661 (2007) appreciate the magnitude, expense, complexity – and limitations  100 authors, 10 6 SNPs tested in each of 17,000 samples $1000) could study have been done if each test cost $1?

Note most disease risks measured by OR only  Do you understand most of the columns in this table? raw p(chi sq)ORsdis

Note many SNPs in region are associated with disease Example of hit region

Summary of all hits, all diseases, all chromosomes Ln 10 (p)

Limitations Do most SNP associations identify causative mutations? No, because there are many SNPs in each region – they can’t all be causative If not causative, why the association? Likely explanation – causative mutation arose sometime, not very long ago, on some chromosome in “founder” individual; he/she passed on the mutation to offspring along with adjacent chromosomal regions. Recombination between causative mutation and these regions has not yet occurred on most chromosomes bearing mutation, so SNPs near mut’n in founder remain associated in offspring = linkage disequilibrium (LD)

Implications – associated SNPs reflect fairly recent mutations, therefore may be restricted to particular ethnic groups (not enough time to spread throughout the world by migration, interbreeding); in other groups the same SNPs may be unassociated with disease; hard to find very old mutations causing disease (no LD) hits provide locational clues to causative mutations; the latter could provide leads for new rx, reduce imprecision in risk assessments most SNP associations now confirmed in independent disease group studies (see 23andMe white paper on “vetting” disease associations)

Note most relative risks (odds ratios) are small, < 2-fold Does this make most results practically insignificant? Odds ratios are much smaller than expected from estimates of heritability from family studies Example: height said to be 80% inherited but max combined effect of all associated SNPs only  5% How is “% heritability” estimated? Old way: for height, plot children’s height vs mean height of their parents; if children with tall parents tend to be tall, height could be genetic

Fraction of variance explained by parents height = 1 -  [h c -(  h p +  )] 2 /  (h c - ) 2 h children find (least sq.) best fit line:  h p +  comp. variance from best fit line to variance from global mean line More mathematically,

Does child-parent height correl. prove height is genetic? No – it may confound environment and gene effects (tall parents may eat better and provide better diet) Clever way to tease out genetic from environmental effects within families: use SNP genotypes to measure genetic relatedness between siblings and plot height differences betw. sib. pairs vs. genetic relatedness Genetic relatedness = % genes that are identical in siblings due to inheritance from the same grandparent (e.g. they both get their mothers maternal (or paternal) alleles vs one gets the maternal and the other the paternal allele); call this % identity by descent, IBD)

Plot height difference between sibs vs % IBD h diff (% IBD) Now variance from red line / variance from blue line provides estimate of effect explained by genes, controlled for environment (sib pairs expected to share environments to same extent, unaffected by their % IBD)

Can generalize to disease incidence … (don’t worry about details) Find least sq. best fit line: hhHhHH genotype disease 1 no disease 0 Fraction explained by H = 1 – (var from red line/ var from blue) to say what fx of disease incidence is “explained” by SNPs

“Missing heritability” = big embarrassment for GWAS Possible explanations: family studies overestimate heritability by confounding environmental effects disease caused by changes in gene expression not detectable by SNPs (epistasis) disease caused by very old mutation (assoc. lost due to genetic recombination over time) disease caused by rare alleles (SNPs analyzed chosen to have minor allele frequencies > 5%) Some want to push on w/ GWAS, testing rarer SNPs or sequencing genomes to look for alleles assoc. w/disease Reward may be in understanding how particular genes contribute to diseases, not in utility of risk prediction

Next problem: how to combine risks from unlinked SNPs 23andMe multiples relative risks (see 23andMe white paper). This assumes effects are independent, i.e. no gene interactions. Is this accurate? Counter example: gene that raises expression of fetal hgb decreases severity of sickle cell disease => some genes interact “non-linearly” How could one verify if predicted dis. risks are accurate? 1. Prospective studies – think about feasibility: how many subjects needed, how much time, etc. 2. Compare different companies’ risk predictions

Venter et al compared risk predictions of Navigenics and 23andMe for 13 diseases for 5 individuals Results: Qualitative discrepancies for half of people in half of tests Explanation: Companies used different sets of SNPs. Does this restore confidence in clinical validity of test?

Smallness of effects limit clinical utility - most effects comparable to risks conferred by positive family history But for some diseases, where dis. mutation identified, predicted risk increase can be large e.g. CF  100%, though severity can vary BRCA1 – some mutations elevate life-time risk from  8% to 80% (> 20x risk for early onset) Next problem – when relative risk inc. is large, is there something one can do about it? Will come back to this for BRCA in unit on screening for breast cancer

At what point, if any, should FDA regulation be required? Should it depend on magnitude of relative risk, absolute risk? On whether test results are likely to result in life-altering action? surgery drug treatment life-long screening abortion

Different measures of test validity and utility: Scientific validity – does it detect the SNPs it says it does, with what error rate? Clinical validity – does it produce valid diagnoses? Clinical utility – is the information useful in a medical setting? How do tests for BRCA mutation, CF carrier status, warfarin sensitivity rate by these criteria?

Main points basic idea of GWAS – diff allele freq. in dis. vs cont. grps many DNA regions found to affect chance of getting several common diseases most effects small, possibly limited to spec. ethnic groups provide leads for finding causative genes -> understanding disease mechanism possible applications in drug therapy (“personalized medicine”)

Homework: look over GWAS paper, try to get big ideas, don’t worry about unintelligible jargon divide and conquer papers/topics (pick one): Math Exercise on odds ratios and chi sq (2 items) Venter on comparing 23andMe and Navigenics results what are ethics of his conclusions? NYT - on behavorial effect of DTC genetic testing NEJM – on risk prediction from GWAS (2 items) 2 views of utility of warfarin genetic test (pick one) Am Coll Cardiol. - it reduces hosp. Ann Int. Med. – it is not cost-eff.