Gene expression array and SNP array

Gene expression array and SNP array
Microarrays Introduction to gene expression arrays Data pre-processing SNPs Copy number abberation Loss of heterozygosity Association studies

Basics of microarrays An “old” technology - some predict microarrays will be replaced by deep sequencing Currently – much cheaper/faster than sequencing; widely used Timeline of DNA Microarray Developments 1991: Photolithographic printing (Affymetrix) 1994: First cDNA collections are developed at Stanford 1995: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. 1996: Commercialization of arrays (Affymetrix) 1997: Genome- wide expression monitoring in S. cerevisiae (yeast) 2000: Portraits/ Signatures of cancer. 2003: Introduction into clinical practices 2004: Whole human genome on one microarray 2006: All exons measured on one microarray 2005: first next-generation sequencing machine

Basics of microarrays They utilize the chemical binding between the four nucleotides. A --- T, and C --- G. The DNA structure is formed through the binding:

Basics of microarrays TTAAGTCGTACCCGTGTACGGGCGC
AATTCAGCATGGGCACATGCCCGCG

Basics of microarrays Amplified DNA segments  fluorescence labeling
 hybridization on the array  reading by photo scanner  digitize into fluorescence values  quantify amount of each target sequence Two strategies: One sample on each array The amount is calculated from spot intensity. (2) Two samples, differentially labeled, on each array The relative amount, is given by the ratio between the fluorescence.

Gene expression arrays
exon Start codon Poly A tail DNA (2 copies) intron The amount of these guys is easy to measure. And it is positively correlated with the protein amount! mRNA (multiple copies) The amount of these guys matter! But they are hard to measure. Protein (multiple copies)

Gene expression array --- affymetrix
The Affymetrix platform is one of the most widely used.

Gene expression arrays -- Affy
Here we use the U133 system for illustration. Some 20 probes per gene; Selected from the 3’ end of the gene sequence; Not necessarily evenly spaced --- sequence property matters; The probes are located at random locations on the chip; TTAAGTCGTACCCGTGTACGGGCGC Target sequence AATTCAGCATGGGCACATGCCCGCG Perfect match (PM) probe AATTCAGCATGGACACATGCCCGCG Mis-match (MM) probe

Gene expression array - affy
The hope was that mismatch probes won’t bind the target sequence.

Gene expression arry --- affy

? Microarray data We are going to focus on pre-processing for now.
Downstream analyses are more in the realm of traditional statistics: multiple testing, clustering, classification…… They are common across different high-throughput techniques.

Microarray data Issues:
Background level variation caused by variations in overall RNA concentration in the sample, image reader, etc. Within every probeset, each probe has different sensitivity/specificity, caused by cross-hybridization, different chemical properties etc. Across chips, the fluorescence intensity-concentration response curve can be different, caused by variations in sample processing, image reader etc.

Affy data --- general strategy
Background correction (within chip) Normalization (across-chip) Probe-set level expression value (within chip) Presence/absence call (within chip) Probeset-level statistical analysis (combining chips)

Affy data --- general strategy
There are many processing methods. The most popular include: MAS 5.0 (Affymetrix) Flawed. But it comes with the Affymetrix software. Thus widely used by non-experts. dChip (Cheng Li & Wing Wong) Good performance and versatile. Stand-alone Windows application. Can handle arrays other than expression array. RMA (Rafael Irizarry et al.) Good performance. Easily used in R/Bioconductor.

Affy data --- RMA Background correction
For each array, assumes: lambda=1,miu=1,sigma= lambda=5, miu=1, sigma=1

Affy data --- RMA Background correction
For each array, from the PM signal distribution, estimate the parameters, Find the overall mode by kernel density estimation; Find the miu and sigma from PM values lower than the overall mode (sample mean and sd) Find the lambda from PM values higher than the overall mode (1/(sample mean minus the overall mode)) then adjust the PM readings (s is PM signal; lambda is replaced by alpha in this expression): See the derivation here:

Affy data --- normalization
*** This is also relevant to other array platforms ! To reduce chip effect, including non-linear effect. Difficulty: the sample is different for each chip. We can’t match a gene in chip A to the same gene in chip B hoping they have the same intensity. PM MM Assumptions on the overall distributions of the signals on each chip are made. For example: Some house-keeping genes don’t change; The overall distribution of concentrations don’t change; ……

Affy data --- normalization
Quantile normalization --- match the quantiles between two chips. Assumes that the distribution of gene abundances is the same between samples. xnorm = F2-1(F1(x)), x: value in the chip to be normalized F1: distribution function in the chip to be normalized F2: distribution function in the reference chip Nature Protocols 2, (2007)

Affy data --- RMA summary
Model-fitting: Median Polish (robust against outliers) alternately removing the row and column medians until convergence The remainder is the residual; After subtracting the residual, the row- and column- medians are the estimates of the effects.

Affy data ---- rma summary
Remove row median Remove column median

Remove row median Remove column median Converged. This is the residual.

* This reflects the assumption that probe effects have median zero.

SNP Variations in DNA sequence.
Single Nucleotide Polymorphism (SNP) --- a single letter change in the DNA. Occurs every few hundred bases. Each form is called an “allele”. Almost all SNPs have only two alleles. Allele frequencies are often different between ethnic groups.

Correlations between SNPs
Why measure the SNP alleles? DNA change in two ways during evolution: Point mutation  SNPs Recombination This happens in large segments.  Alleles of adjacent SNPs are highly dependent. Haplotype: A group of alleles linked closely enough to be inherited mostly as a unit.

Why SNP? This is on the homepage of the International Hapmap Project
Figure 1: This diagram shows two ancestral chromosomes being scrambled through recombination over many generations to yield different descendant chromosomes. If a genetic variant marked by the A on the ancestral chromosome increases the risk of a particular disease, the two individuals in the current generation who inherit that part of the ancestral chromosome will be at increased risk. Adjacent to the variant marked by the A are many SNPs that can be used to identify the location of the variant.

Why SNP? Nature Genetics 26, 151 - 157 (2000) SNPs
Figure 1. Schematic model of trait aetiology. The phenotype under study, Ph, is influenced by diverse genetic, environmental and cultural factors (with interactions indicated in simplified form). Genetic factors may include many loci of small or large effect, GPi, and polygenic background. Marker genotypes, Gx, are near to (and hopefully correlated with) genetic factor, Gp, that affects the phenotype. Genetic epidemiology tries to correlate Gx with Ph to localize Gp. Above the diagram, the horizontal lines represent different copies of a chromosome; vertical hash marks show marker loci in and around the gene, Gp, affecting the trait. The red Pi are the chromosomal locations of aetiologically relevant variants, relative to Ph. The gene deciding pheonotype

SNP array The SNP array Affymetrix.com

SNP array The SNP array 40 probes per SNP (20 for forward strand and 20 for reverse strand.) PM/MM strategy. Data summary (generating AA/AB/BB calls) omitted here. Affymetrix.com

SNP array Association analysis Linkage analysis Genotype calls
Loss of Heterozygosity Genotype calls SNP array Signal strength Copy number abberation

CNA --- Background Copy Number Aberration (CNA):
A form of chromosomal aberration Deviation from the regular 2 copies for some segments of the chromosomes One of the key characteristics of cancer CNA in cancer: Reduce the copy number of tumor-suppressor genes Increase the copy number of oncogenes Possibly related to metastasis

CNA --- the statistician’s task
High density arrays allow us to identify “focused CNA”: copy number change in small DNA segments. With the high per-probeset noise, how to achieve high sensitivity AND specificity?

CNA – maximizing sensitivity/specificity
Two approaches that complement each other: Reducing noise at the single probeset level: Based on dose-response (Huang et al., 2006) Based on sequence properties (Nannya et al., 2005) Segmentation methods. Smoothing; Hidden Markov Model-based methods; Circular Binary Segmentation … …

HMM data segmentation Amplified Normal Deleted
Fridlyand et al. Journal of Multivariate Analysis, June 2004, V. 90, pp Amplified Normal Deleted

Forward-backword fragment assembling

Some example: Top: model cell line, 3 copy segment in chromosome 9
Bottom: Cancer sample

LOH Loss of Heterozygosity (LOH) Happens in segments of DNA.
Keith W. Brown and Karim T.A. Malik, 2001, Expert Reviews in Molecular Medicine

LOH On SNP array, LOH will yield identical calls (AA or BB, rather than AB) for a number of consecutive SNPs. Discov Med Jul;12(62):25-32.

GWAS Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer Nature Genetics 41, (2009)

Gene expression array and SNP array

Similar presentations

Presentation on theme: "Gene expression array and SNP array"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gene expression array and SNP array

Similar presentations

Presentation on theme: "Gene expression array and SNP array"— Presentation transcript:

Similar presentations

About project

Feedback