Gene expression array and SNP array

Slides:



Advertisements
Similar presentations
What is an association study? Define linkage disequilibrium
Advertisements

Single Nucleotide Polymorphism Copy Number Variations and SNP Array Xiaole Shirley Liu and Jun Liu.
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Microarray Normalization
MALD Mapping by Admixture Linkage Disequilibrium.
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
Plant of the day! Pebble plants, Lithops, dwarf xerophytes Aizoaceae
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Office hours Wednesday 3-4pm 304A Stanley Hall. Fig Association mapping (qualitative)
Getting the numbers comparable
DNA microarray and array data analysis
Probe Level Analysis of AffymetrixTM Data
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
SNP chips Advanced Microarray Analysis Mark Reimers, Dept Biostatistics, VCU, Fall 2008.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
Introduce to Microarray
Office hours Wednesday 3-4pm 304A Stanley Hall Review session 5pm Thursday, Dec. 11 GPB100.
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.
Analysis of microarray data
Microarray Preprocessing
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Modes of selection on quantitative traits. Directional selection The population responds to selection when the mean value changes in one direction Here,
Data Type 1: Microarrays
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
DNA Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
Microarray - Leukemia vs. normal GeneChip System.
CS177 Lecture 10 SNPs and Human Genetic Variation
Scenario 6 Distinguishing different types of leukemia to target treatment.
Gene Hunting: Linkage and Association
Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland.
A Short Overview of Microarrays Tex Thompson Spring 2005.
Genome-Wide Association Study (GWAS)
Intro to Microarray Analysis Courtesy of Professor Dan Nettleton Iowa State University (with some edits)
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
ABC for the AEA Basic biological concepts for genetic epidemiology Martin Kennedy Department of Pathology Christchurch School of Medicine.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
Chapter 2 From Genes to Genomes. 2.1 Introduction We can think about mapping genes and genomes at several levels of resolution: A genetic (or linkage)
Genetic Testing Amniocentesis Until recently, most genetic testing occurred on fetuses to identify gender and genetic diseases. Amniocentesis is one technique.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Gene expression  Introduction to gene expression arrays Microarray Data pre-processing  Introduction to RNA-seq Deep sequencing applications RNA-seq.
Copy Number Analysis in the Cancer Genome Using SNP Arrays Qunyuan Zhang, Aldi Kraja Division of Statistical Genomics Department of Genetics & Center for.
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
Introduction to Oligonucleotide Microarray Technology
Different microarray applications Rita Holdhus Introduction to microarrays September 2010 microarray.no Aim of lecture: To get some basic knowledge about.
EQTLs.
Consideration for Planning a Candidate Gene Association Study With TagSNPs Shehnaz K. Hussain, PhD, ScM Epidemiology 243: Molecular.
Microarray Technology and Applications
Chapter 7 Multifactorial Traits
Getting the numbers comparable
Data Type 1: Microarrays
Presentation transcript:

Gene expression array and SNP array Microarrays Introduction to gene expression arrays Data pre-processing SNPs Copy number abberation Loss of heterozygosity Association studies

Basics of microarrays An “old” technology - some predict microarrays will be replaced by deep sequencing Currently – much cheaper/faster than sequencing; widely used http://www.microarraystation.com/dna-microarray-timeline/ Timeline of DNA Microarray Developments 1991:  Photolithographic printing (Affymetrix) 1994:  First cDNA collections are developed at Stanford 1995:  Quantitative monitoring of gene expression patterns with a complementary DNA microarray. 1996:  Commercialization of arrays (Affymetrix) 1997:  Genome- wide expression monitoring in S. cerevisiae (yeast) 2000:  Portraits/ Signatures of cancer. 2003:  Introduction into clinical practices 2004:  Whole human genome on one microarray 2006: All exons measured on one microarray 2005: first next-generation sequencing machine

Basics of microarrays They utilize the chemical binding between the four nucleotides. A --- T, and C --- G. The DNA structure is formed through the binding: http://content.answers.com/main/content/wp/en/f/f0/DNA_Overview.png

Basics of microarrays TTAAGTCGTACCCGTGTACGGGCGC AATTCAGCATGGGCACATGCCCGCG

Basics of microarrays Amplified DNA segments  fluorescence labeling  hybridization on the array  reading by photo scanner  digitize into fluorescence values  quantify amount of each target sequence Two strategies: One sample on each array The amount is calculated from spot intensity. (2) Two samples, differentially labeled, on each array The relative amount, is given by the ratio between the fluorescence.

Gene expression arrays exon Start codon Poly A tail DNA (2 copies) intron The amount of these guys is easy to measure. And it is positively correlated with the protein amount! mRNA (multiple copies) The amount of these guys matter! But they are hard to measure. Protein (multiple copies)

Gene expression array --- affymetrix The Affymetrix platform is one of the most widely used. http://www.affymetrix.com/

Gene expression arrays -- Affy Here we use the U133 system for illustration. Some 20 probes per gene; Selected from the 3’ end of the gene sequence; Not necessarily evenly spaced --- sequence property matters; The probes are located at random locations on the chip; TTAAGTCGTACCCGTGTACGGGCGC Target sequence AATTCAGCATGGGCACATGCCCGCG Perfect match (PM) probe AATTCAGCATGGACACATGCCCGCG Mis-match (MM) probe

Gene expression array - affy The hope was that mismatch probes won’t bind the target sequence. http://www.affymetrix.com/

Gene expression arry --- affy http://www.affymetrix.com/

? Microarray data We are going to focus on pre-processing for now. Downstream analyses are more in the realm of traditional statistics: multiple testing, clustering, classification…… They are common across different high-throughput techniques.

Microarray data Issues: Background level variation caused by variations in overall RNA concentration in the sample, image reader, etc. Within every probeset, each probe has different sensitivity/specificity, caused by cross-hybridization, different chemical properties etc. Across chips, the fluorescence intensity-concentration response curve can be different, caused by variations in sample processing, image reader etc.

Affy data --- general strategy Background correction (within chip) Normalization (across-chip) Probe-set level expression value (within chip) Presence/absence call (within chip) Probeset-level statistical analysis (combining chips)

Affy data --- general strategy There are many processing methods. The most popular include: MAS 5.0 (Affymetrix) Flawed. But it comes with the Affymetrix software. Thus widely used by non-experts. dChip (Cheng Li & Wing Wong) Good performance and versatile. Stand-alone Windows application. Can handle arrays other than expression array. RMA (Rafael Irizarry et al.) Good performance. Easily used in R/Bioconductor.

Affy data --- RMA Background correction For each array, assumes: lambda=1,miu=1,sigma=1 lambda=5, miu=1, sigma=1

Affy data --- RMA Background correction For each array, from the PM signal distribution, estimate the parameters, Find the overall mode by kernel density estimation; Find the miu and sigma from PM values lower than the overall mode (sample mean and sd) Find the lambda from PM values higher than the overall mode (1/(sample mean minus the overall mode)) then adjust the PM readings (s is PM signal; lambda is replaced by alpha in this expression): See the derivation here: http://www.biochem.ucl.ac.uk/~harry/MAD/rma_bg.pdf

Affy data --- normalization *** This is also relevant to other array platforms ! To reduce chip effect, including non-linear effect. Difficulty: the sample is different for each chip. We can’t match a gene in chip A to the same gene in chip B hoping they have the same intensity. PM MM Assumptions on the overall distributions of the signals on each chip are made. For example: Some house-keeping genes don’t change; The overall distribution of concentrations don’t change; ……

Affy data --- normalization Quantile normalization --- match the quantiles between two chips. Assumes that the distribution of gene abundances is the same between samples. xnorm = F2-1(F1(x)), x: value in the chip to be normalized F1: distribution function in the chip to be normalized F2: distribution function in the reference chip Nature Protocols 2, 2958 - 2974 (2007)

Affy data --- RMA summary Model-fitting: Median Polish (robust against outliers) alternately removing the row and column medians until convergence The remainder is the residual; After subtracting the residual, the row- and column- medians are the estimates of the effects.

Affy data ---- rma summary Remove row median Remove column median

Affy data ---- rma summary Remove row median Remove column median

Affy data ---- rma summary Remove row median Remove column median Converged. This is the residual.

Affy data ---- rma summary * This reflects the assumption that probe effects have median zero.

SNP Variations in DNA sequence. Single Nucleotide Polymorphism (SNP) --- a single letter change in the DNA. Occurs every few hundred bases. Each form is called an “allele”. Almost all SNPs have only two alleles. Allele frequencies are often different between ethnic groups. http://upload.wikimedia.org/wikipedia/commons/thumb/2/2e/Dna-SNP.svg/180px-Dna-SNP.svg.png

Correlations between SNPs Why measure the SNP alleles? DNA change in two ways during evolution: Point mutation  SNPs Recombination This happens in large segments.  Alleles of adjacent SNPs are highly dependent. Haplotype: A group of alleles linked closely enough to be inherited mostly as a unit. http://www.evolutionpages.com/images/crossing_over.gif

Why SNP? This is on the homepage of the International Hapmap Project http://www.hapmap.org/originhaplotype.html.en Figure 1: This diagram shows two ancestral chromosomes being scrambled through recombination over many generations to yield different descendant chromosomes. If a genetic variant marked by the A on the ancestral chromosome increases the risk of a particular disease, the two individuals in the current generation who inherit that part of the ancestral chromosome will be at increased risk. Adjacent to the variant marked by the A are many SNPs that can be used to identify the location of the variant.

Why SNP? Nature Genetics 26, 151 - 157 (2000) SNPs Figure 1. Schematic model of trait aetiology. The phenotype under study, Ph, is influenced by diverse genetic, environmental and cultural factors (with interactions indicated in simplified form). Genetic factors may include many loci of small or large effect, GPi, and polygenic background. Marker genotypes, Gx, are near to (and hopefully correlated with) genetic factor, Gp, that affects the phenotype. Genetic epidemiology tries to correlate Gx with Ph to localize Gp. Above the diagram, the horizontal lines represent different copies of a chromosome; vertical hash marks show marker loci in and around the gene, Gp, affecting the trait. The red Pi are the chromosomal locations of aetiologically relevant variants, relative to Ph. The gene deciding pheonotype

SNP array The SNP array Affymetrix.com

SNP array The SNP array 40 probes per SNP (20 for forward strand and 20 for reverse strand.) PM/MM strategy. Data summary (generating AA/AB/BB calls) omitted here. Affymetrix.com

SNP array Association analysis Linkage analysis Genotype calls Loss of Heterozygosity Genotype calls SNP array Signal strength Copy number abberation

CNA --- Background Copy Number Aberration (CNA): A form of chromosomal aberration Deviation from the regular 2 copies for some segments of the chromosomes One of the key characteristics of cancer CNA in cancer: Reduce the copy number of tumor-suppressor genes Increase the copy number of oncogenes Possibly related to metastasis

CNA --- the statistician’s task High density arrays allow us to identify “focused CNA”: copy number change in small DNA segments. With the high per-probeset noise, how to achieve high sensitivity AND specificity?

CNA – maximizing sensitivity/specificity Two approaches that complement each other: Reducing noise at the single probeset level: Based on dose-response (Huang et al., 2006) Based on sequence properties (Nannya et al., 2005) Segmentation methods. Smoothing; Hidden Markov Model-based methods; Circular Binary Segmentation … …

HMM data segmentation Amplified Normal Deleted Fridlyand et al. Journal of Multivariate Analysis, June 2004, V. 90, pp. 132-153 Amplified Normal Deleted

Forward-backword fragment assembling

Some example: Top: model cell line, 3 copy segment in chromosome 9 Bottom: Cancer sample

LOH Loss of Heterozygosity (LOH) Happens in segments of DNA. Keith W. Brown and Karim T.A. Malik, 2001, Expert Reviews in Molecular Medicine

LOH On SNP array, LOH will yield identical calls (AA or BB, rather than AB) for a number of consecutive SNPs. Discov Med. 2011 Jul;12(62):25-32.

GWAS http://www.mpg.de/10680/Modern_psychiatry © Pasieka, Science Photo Library

GWAS Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer Nature Genetics 41, 986 - 990 (2009) 

GWAS