Genetic Epidemiology Michèle Sale, Ph.D. Center for Public Health Genomics Tel:
Genetic epidemiology “A science which deals with the etiology, distribution, and control of disease in groups of relatives and with inherited causes of disease in populations.” Newton E. Morton, 1982
Model for Complex Diseases + = Disease Susceptibility
Trait 2 Trait 1 Trait 3 Disease Gene 4 Gene 1 Gene 2 Gene 3 Environment 1 Environment 2 Genetics of a Complex Disease
“Monogenic” vs “Complex” Disease MendelianComplex 1 or small # of genesMany Often etiologicSusceptibility / molecular (severe phenotype)pathology ? Highly penetrantModest penetrance High Odds RatioModest/Low Odds Ratio Strong selection => Weak/No selection => Low frequency/RareHigh frequency/Common Coding SequenceNon-coding/regulation (?)
Overall steps for disease gene identification Is there a genetic component? Study design Measurement of phenotype Molecular analysis Functional analysis
Is there a genetic component? Twin studies Familial aggregation Segregation analysis Race/ethnicity differences
Twin studies Comparison of monozygotic (MZ) pairs (who share all their genome) with dizygotic (DZ) twins (who share half of their genome in common on average, same as sibs) The greater similarity of MZ twins than DZ twins is considered evidence of genetic factors The pairwise (Pr) concordance (Pr) is the proportion of affected pairs that are concordant for the disease. –The proportion of twin pairs with both twins affected of all ascertained twin pairs with at least one affected –Pr=C/(C+D), where C is the number of concordant pairs and D is the number of discordant pairs. The probandwise (Cc) concordance is the proportion of affected individuals among the co-twins of previously ascertained index cases. –Allows for double counting of doubly ascertained twin pairs and is interpretable as the recurrence risk in a co-twin of an affected individual –Cc=2C/(2C+D) In theory, complete genetic determination of a disease would equate to MZ twins having 100% concordance and DZ twins having 50% concordance
Twin studies - assumptions Random mating No interactions between genes and environment Equivalent environments for MZ and DZ twins
Concordance rates for some traits
Other types of twin studies Twins discordant for disease have been used to examine possible environmental causes. Adoption studies also permit the separation of childhood rearing effects from genetic effects by studying the similarity of adopted children with their biological and adoptive parents
Bouchard et al. Science Oct 12;250(4978):223-8.
Familial aggregation Sibling risk relative ratio s = risk to a sibling of person with disease of interest population risk
SNP Disease Variant & λ s in Diabetes Type 1Type 2 Prob of Disease (Sibling) 6%30-40% Prob of Disease (Unrelated) 0.4%7% λ s
Recurrence risks for multiple sclerosis in families Compston and Coles. Lancet. 2002;359(9313):
Segregation analysis Determines which specific model (genetic or environmental) best fits the familial aggregation E.g. –Major gene or many genes (polygenic)? –Dominant, additive, recessive inheritance?
Differences in prevalence across race/ethnicities
Time trends in the percentage of African American adolescents and adults who were overweight,
Adiposity However, rates of type 2 diabetes in African Americans still higher than Caucasian Americans after controlling for age, adiposity, and socio-economic status Other factors must be involved
Epidemiological study designs
Study designs Case series: what clinicians see Case-control: compare people with and without a disease Cohort: follow people over time to see who gets the disease Randomized controlled trial (RCT)
Other terms Retrospective vs. prospective Cross-sectional vs. longitudinal
Measurement of Phenotype What is the phenotype? –e.g., diabetes, fasting glucose, oral glucose tolerance test How is it diagnosed? –Physician’s diagnosis, clinical measurements, questionnaire How objective are the phenotypes? –Physician’s diagnosis – somewhat variable –Clinical measurements – most pretty good *The more defined the phenotype, the easier to find the gene(s) that controls it
Additional consideration in genetic studies: families or unrelated individuals?
Patient ascertainment Sib-pair: Families: Case-control: ? ? ? ? or
Molecular/Analytical approaches Linkage –Families Association –Candidate gene –Genome wide –Generally case-control –There are family-based approaches
Effect and frequency of risk alleles dictate strategies Linkage studies Association studies Unlikely to exist Frequency in population Magnitude of effect Unlikely to be found
Linkage analysis Linkage = The proximity of two or more markers on a chromosome Linkage analysis is a statistical method for detecting linkage between a disease and markers of known location by following their inheritance in families Uses recombination to define genomic interval likely to contain gene/s Single large pedigree or multiple small pedigrees
Linkage analysis Works well for Mendelian traits and more highly penetrant diseases Low resolution = fewer markers needed and resilient to allelic heterogeneity Apparently high Type I error rate for complex/non-Mendelian diseases – more loci, common variants, high phenocopy rate, and lower penetrance Large pedigrees better for rare alleles – more likely to segregate the allele Large pedigrees increase the probability of parental heterozygosity for frequent alleles Most likely to detect intermediate frequency alleles Strong pedigree signal may reflect rare Mendelian forms of complex disease –eg BRCA1 & BRCA2 mutations in breast and ovarian cancer
Genotype markers across the genome Illumina's Linkage IVb Panel: >6,000 SNPs
LOD score LOD score = Log of the Odds of linkage = log 10 Likelihood of linkage = log 10 L( <0.5) The closer two markers are to each other, the lower the odds of a recombination (crossing over event) occurring between them in meiosis. Likelihood of not being linked L( =0.5)
Linkage analysis Is there cosegregation of a chromosomal region with the phenotype?
Linkage analysis
Is there cosegregation of a chromosomal region with the phenotype? Add additional markers to region Add additional families to study
Association study Best power for common variants of modest - low effect size Search for specific genetic differences distinguishing cases from controls Cases Controls
Cross-sectional - no follow-up More efficient recruitment than families - easy to ascertain and recruit Easy to analyze Statistical power compared to family-based linkage Cases and controls must be well- matched - Drawn from same population - Randomize non-genetic confounder factors At risk for type 1 errors if incomplete matching (stratification) Case - Control is the most popular study design for complex disease genetics: X X
Genome-wide association studies (GWAS): A paradigm shift in human genetics
How can we use SNPs to find diabetes genes? Genome-Wide Association Study (GWAS) –Examination of variation across the entire human genome to identify genetic correlations with the presence or absence of diabetes Two groups: cases (have diabetes) vs. controls (don’t have diabetes) Each participant’s genome is surveyed for markers of genetic variation (SNPs) Groups compared to determine specific genetic differences between the two groups
GWAS approach Does not assume knowledge of genes/biology Investigate markers evenly spaced along genome Investigate association: Joint occurrence of two alleles (e.g. disease allele and marker allele) in a population > expected frequency
Why are GWAS now feasible? SNP identification efforts more SNPs in databases Understanding of linkage disequilibrium in the human genome (HapMap project) fewer “tagSNPs” to genotype Lower cost of genotyping platforms
Products now use >1 million SNPs!
Pairwise tagging Tags: SNP 1 SNP 3 SNP 6 3 in total Test for association: SNP 1 SNP 3 SNP 6 A/T 1 G/A 2 G/C 3 T/C 4 G/C 5 A/C 6 high r 2 AAAA TTTT G C C G G C C G T CCCCCC A CCCCCC G C C G T CCCCCC GGGG AAAA GGGG AAAA After Carlson et al. (2004) AJHG 74:106
Use of haplotypes can improve genotyping efficiency AAAA TTTT G C C G G C C G T CCCCCC A CCCCCC G C C G T CCCCCC GGGG AAAA GGGG AAAA A CCCCCC A/T 1 G/A 2 G/C 3 T/C 4 G/C 5 A/C 6 Tags: SNP 1 SNP 3 2 in total Test for association: SNP 1 captures 1+2 SNP 3 captures 3+5 “AG” haplotype captures SNP 4+6
Efficiency and power Relative power (%) Average marker density (per kb) tag SNPs random SNPs P.I.W. de Bakker et al. (2005) Nat Genet ~300,000 tag SNPs needed to cover common variation in whole genome in CEU
Genotyping platform: Illumina Cost: 370 Duo $ Y $ M $
Genotyping platform: Affymetrix
Completeness of dbSNP Vast majority of common SNPs are contained in or highly correlated with a SNP in dbSNP Nature 437, YRI CEU CHN+JPT
Comparison of coverage Paul de Bakker, pers. comm.
Association Studies Detect genes/genomic regions associated with a disease through allelic associations in case-control studies –Causal variants are associated with disease phenotype –Linked neutral variants are associated with the disease phenotype through LD with the causal variant Younger disease variants (rarer variant) –LD around the variant is stronger = better power –Associated region containing variant is broad (low genome resolution) Older disease variants (common variant) –Weaker LD = worse power –Better association map resolution
Phenotype-genotype association Marker associated with disease could be: 1. False positive result (type 1 error) 2. Co-inherited with a true causative (functional) variant 3. A true functional or causative variant
Replication and follow-up Many analytical tests – high probability of false positives Replicate in additional studies (often requires cross- study collaboration) Map the causal variant –Denser marker map –Evidence for other variants in the same gene with (perhaps smaller) independent effects (allelic heterogeneity) –Haplotype analysis –Resequencing –Sequence / genome mapping bioinformatics to identify or predict genome features in the linkage disequilibrium region of the map SNP –Expression or reporter assays
Common Disease Common Variant Hypothesis (CDCV) Genetic risk for common diseases (diabetes, CHD, hypertension, schizophrenia, asthma,..) results from common variants/polymorphisms in multiple genes The effects for each gene variant must be smaller than in monogenic disorders otherwise the prevalence of the diseases would be very high Since SNPs are a common mode of variation in the human genome - and coding SNPs lead to mongenic diseases - SNPs may be the variants that are associated with risk for common diseases
Do the Common Disease Variants Code ? Not necessarily (or usually ?) Protein coding SNPs (cSNPs) may disrupt protein fold, structure, activity Variants in Mendelian diseases with high penetrance (50-100% penetrance) often disrupt proteins But common diseases are not penetrant to the same level - genetic odds ratios : Prob (Disease | risk allele)~ Prob (Not Disease | risk allele)
Early successes Klein R et al. Complement factor H polymorphism in age-related macular degeneration. Science Apr 15; 308: –96 cases and 50 controls; 116,204 SNPs Maraganore et al. High-resolution whole- genome association study of Parkinson disease. Am J Hum Genet Nov; 77: –198,345 SNPs in 443 sibling pairs discordant for PD –1,793 PD-associated SNPs (P<.01 in tier 1) and 300 genomic control SNPs in 332 matched case-unrelated control pairs
The future Identify additional genes in diverse populations Identify causal variant/s in these genes Determine function of novel genes, and function of causative variants Explore gene x gene interactions (epistasis) and gene x environment interactions (e.g. physical activity, diet) Other technological advances: –Animal models of disease –Innovative imaging of target tissues –Functional approaches to gene expression profiling –Whole genome sequencing? Era of “personalized medicine”&/or prevention
END Questions?