Genome Variations & GWAS

Slides:



Advertisements
Similar presentations
Lecture 2 Strachan and Read Chapter 13
Advertisements

What is an association study? Define linkage disequilibrium
Genetic Analysis in Human Disease
Polymorphisms: Clinical Implications By Amr S. Moustafa, M.D.; Ph.D. Assistant Prof. & Consultant, Medical Biochemistry Dept. College of Medicine, KSU.
1 of 25 Sequence Variation in Ensembl. 2 of 25 Outline SNPs SNPs in Ensembl Linkage disequilibrium SNPs in BioMart DAS sources.
Ferdinand van ’t Hooft Cardiovascular Genetics and Genomics Group Karolinska Institutet, Stockholm, Sweden Genome-Wide Association Study GWAS
Fatchiyah, PhD Dept Biology UB Fatchiyah.lecture.ub.ac.id
Plant of the day! Pebble plants, Lithops, dwarf xerophytes Aizoaceae
Single Nucleotide Polymorphisms Jennifer Lyon Eskind Biomedical Library May 1, 2009 CRC Workshop Series.
Outline to SNP bioinformatics lecture
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
SNP database 張學偉 助理教授 高雄醫學大學 生物醫學暨環境生物學系. SNP = Single Nucleotide Polymorphism (read in SNiP)
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
SNP database 張學偉 助理教授 高雄醫學大學 生物醫學暨環境生物學系. SNP = Single Nucleotide Polymorphism (read in SNiP)
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
PolyPhen and SIFT: Tools for predicting functional effects of SNPs Epi 244 Spring 2009 Sam S. Oh.
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
DbSNP: the NCBI database of genetic variation S. T. Sherry, M.H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski and K. Sirotkin, Nucleic Acids.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Simple Nucleotide.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Genetic Analysis in Human Disease. Learning Objectives Describe the differences between a linkage analysis and an association analysis Identify potentially.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Problem Set I review BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.
- any detectable change in DNA sequence eg. errors in DNA replication/repair - inherited ones of interest in evolutionary studies Deleterious - will be.
Introduction to BST775: Statistical Methods for Genetic Analysis I Course master: Degui Zhi, Ph.D. Assistant professor Section on Statistical Genetics.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
CS177 Lecture 10 SNPs and Human Genetic Variation
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Online Mendelian Inheritance in Man (OMIM): What it is & What it can do for you Knowledge Management & Eskind Biomedical Library January 27, 2012 helen.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Methods in genome wide association studies. Norú Moreno
ABC for the AEA Basic biological concepts for genetic epidemiology Martin Kennedy Department of Pathology Christchurch School of Medicine.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Genome wide association studies (A Brief Start)
In The Name of GOD Genetic Polymorphism M.Dianatpour MLD,PHD.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
Notes: Human Genome (Right side page)
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
An atlas of genetic influences on human blood metabolites Nature Genetics 2014 Jun;46(6)
Different microarray applications Rita Holdhus Introduction to microarrays September 2010 microarray.no Aim of lecture: To get some basic knowledge about.
Genome-Wides Association Studies (GWAS) Veryan Codd.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Single Nucleotide Polymorphisms (SNPs
SNPs and complex traits: where is the hidden heritability?
Genomic Analysis: GWAS
Nucleotide variation in the human genome
upstream vs. ORF binding and gene expression?
School of Pharmacy, University of Nizwa
Gene Hunting: Design and statistics
Genome-wide Associations
Discovery tools for human genetic variations
Beyond GWAS Erik Fransen.
Linking Genetic Variation to Important Phenotypes
Genome organization and Bioinformatics
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Exercise: Effect of the IL6R gene on IL-6R concentration
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Medical genomics BI420 Department of Biology, Boston College
Medical genomics BI420 Department of Biology, Boston College
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Genome Variations & GWAS I519 Introduction to Bioinformatics, 2012 Genome Variations & GWAS

Genome variations underlie phenotypic differences cause inherited diseases sequence variations can be used for gene mapping, definition of population structure, and performance of functional studies.

1000 Genomes Project An international collaboration to produce an extensive public catalog of human genetic variation, including SNPs and structural variants, and their haplotype contexts. This resource will support genome-wide association studies and other medical research studies. The genomes of about 2500 unidentified people from about 25 populations around the world will be sequenced using next-generation sequencing technologies. Results of the pilot phase of the project published in Nature Volume: 467, Pages: 1061–1073, 2010

How do we find sequence variations? look at multiple sequences from the same genome region use base quality values to decide if mismatches are true polymorphisms or sequencing errors

Automated polymorphism discovery PolyBayes (Ref: A general approach to single-nucleotide polymorphism discovery, Marth et al., Nature Genetics, 1999) Determine if a genetic difference is due to sequencing error or it is a real SNP by using base quality values in a rigorous, Bayesian scheme to compare sequences of arbitrary quality standards. PYROBAYES (Ref: An improved base-caller for SNP discovery in pyrosequences. Nature Methods. 2008;5:179-81.)

SNP functional categories coding nonsynonymous Missense, nonsense, frame shift coding synonymous Intronic splice site mRNA utr 5' utr or 3' utr (gene) locus region (5’ or 3’ to the gene) ‘near gene’ usually means within ~2000bp of gene genomic/extragenic (distant from any gene)

SNP nomenclature The Human Genome Variation Society (http://www.hgvs.org/mutnomen/recs.html) has proposed some guidelines for SNP nomenclature, but at the moment, there is minimal consistency. Different sources will refer to the same SNP in different ways While dbSNP identifiers (rs#12345678) are becoming common, they are not required of publishing authors and not used in all cases.

SNPs at base-pair level The base-pair change is given in various forms: A/C T→G C>T 432G>C T73C The HGVS nomenclature recommendations: "c." for a coding DNA sequence (like  c.76A>T) "g." for a genomic sequence (like g.476A>T) "m." for a mitochondrial sequence (like m.8993T>C "r." for an RNA sequence (like r.76a>u)

dbSNP SNP database from NCBI, build 130 contains 63,751,769 refSNP clusters (19,576,037 validated) dbSNP contains: Single nucleotide substitutions Small insertion/deletion polymorphism Microsatellite repeats

dbSNP content The SNP database has two major classes of content: Submitted data, i.e., original observations of sequence variation: Submitted SNPs (SS) with ss# (e.g, ss5586300) Reference Cluster ID: Computed/curated data (Ref SNP with rs#, e.g., rs25) Ref SNP Ref SNP Clusters define a non-redundant set of SNPs Ref SNP clusters may contain multiple submitted SNPs

Reference SNP clusters Ref SNP clusters are computer-generated and curated by NCBI staff Ref SNP Clusters define a non-redundant set of SNPs All individual SNPs submitted by a researcher are given a submitter SNP number (ss#) and then redundant (repetitive) submitter SNPs are combined into a RefSNP cluster record, with a unique rs# Ref SNP clusters may contain multiple submitted SNPs

An example

Promises of SNPs Each person's SNP pattern is unique Most SNPs are not responsible for a disease state. But they can be located near a gene associated with a certain disease. So SNPs may serve as biological markers for pinpointing a disease on the human genome map. Application of association study can detect differences between the SNP patterns of two groups (control-disease), thereby indicating which pattern is most likely associated with the disease-causing gene. Using SNPs to study the genetics of drug response will help in the creation of "personalized" medicine.

Annotation of SNPs A straightforward and reliable method based on physical and comparative considerations that estimates the impact of an amino acid replacement on the three-dimensional structure and function of the protein (~20% of common human non-synonymous SNPs predicted to be deleterious). Ref: Human Molecular Genetics, 2001, Vol. 10, No. 6 591-597 SIFT: predicting amino acid changes that affect protein function (used to distinguish between functionally neutral and deleterious amino acid changes in mutagenesis studies and on human polymorphisms). Ref: Nucleic Acids Research, 2003, Vol. 31, No. 13 3812-3814 Review: Next generation tools for the annotation of human SNPs by Rachel Karchin, Briefings in Bioinformatics 2009 10(1):35-52

Genome-wide association study (GWAS) A genome-wide association study is an approach that involves rapidly scanning markers across the complete sets of DNA, or genomes, of many people to find genetic variations associated with a particular disease. (http://www.genome.gov/20019523) If genetic variations are more frequent in people with the disease, the variations are said to be "associated" with the disease. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls (Nature 447, 661-678) Validating, augmenting and refining genome-wide association signals (Nature Reviews Genetics 10, 318-329, 2009)

The underlying rationale for GWAS 'common disease, common variant' hypothesis, positing that common diseases are attributable in part to allelic variants present in more than 1–5% of the population SNP genotyping chips – common variants most common variants individually or in combination confer relatively small increments in risk (1.1–1.5-fold) and explain only a small proportion of heritability. E.g. at least 40 loci have been associated with human height (with an estimated heritability of about 80%), yet they explain only about 5% of phenotypic variance Ref: Finding the missing heritability of complex diseases; Nature 461, 747-753, 2009

Explaining missing heritability Rare variants; variants of low minor allele frequency (MAF; 0.5% < MAF < 5%), or of rare variants (MAF < 0.5%) Rare variants are not sufficiently frequent to be captured by current GWA genotyping arrays And they don’t carry sufficiently large effect sizes to be detected by classical linkage analysis in family studies The primary technology for the detection of rare SNPs is sequencing, which may target regions of interest, or may examine the whole genome. Structural variation, including copy number variants (CNVs, such as insertions and deletions) and copy neutral variation (such as inversions and translocations)

TA Manolio et al. Nature 461, 747-753 (2009) doi:10.1038/nature08494 Feasibility of identifying genetic variants by risk allele frequency and strength of genetic effect (odds ratio). TA Manolio et al. Nature 461, 747-753 (2009) doi:10.1038/nature08494

Genome-wide significance Associations that have been identified from a single GWA data set rarely have definitive statistical support. p values of <10-7 are required for genome-wide significance. A p value of approximately 10-7 in the GWA setting corresponds to a p value of approximately 0.05 for a traditional, classical epidemiological study in which only one hypothesis is being tested. Statistical significance for genomewide studies PNAS 100(16):9440-9445, 2003 q value; similar to the well known p value, except it is a measure of significance in terms of the false discovery rate rather than the false positive rate.

Analysis methods (a) is a baseline analysis; (b)-(e) apply further prior hypotheses

Chi-square statistic tests AA AB BB total case a b c ncase control d e f ncontrol nAA nAB nBB n Observed AA AB BB case nAAncase/n nABncase/n nBBncase/n control nAAncontrol/n nABncontrol/n nBBncontrol/n This 2x3 contingency table can be directly analyzed using an observed-expected test statistic, which has a chi-squared distribution on two degrees of freedom. Expected O1=a, E1=nAAncase/n, and so on

Population stratification Population stratification is the presence of a systematic difference in allele frequencies between subpopulations in a population possibly due to different ancestry. Case control association studies assume that any difference in the SNP genotypes between the cases and controls is due solely to their difference in disease status, but not difference in their genetic background. Potential population stratification needs to be corrected in association studies

GWAS vs genetic linkage method Genetic linkage combined with positional cloning leads to the finding of gene mutations that are involved with monogenic disease, such as cystic fibrosis and Huntington's disease. These mutations most likely alter the amino acid sequence of protein. Most loci that have been discovered through genome-wide association analysis do not map to amino acid changes in proteins (with a few important exceptions). They are predicted to affect gene expression.

Readings Bioinformatics challenges for genome-wide association studies Finding the missing heritability of complex diseases Nature (2009) 461, 747-753