Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics

Similar presentations


Presentation on theme: "Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics"— Presentation transcript:

1 Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

2 Microarrays A snapshot that captures the activity pattern of thousands of genes at once. Custom spotted arrays Affymetrix GeneChip

3 Spotted Microarray Process CTRL TEST

4 Affymetrix GeneChip® Probe Arrays 24µm Each probe cell or feature contains millions of copies of a specific oligonucleotide probe Image of Hybridized Probe Array Over 250,000 different probes complementary to genetic information of interest Single stranded, fluorescently labeled DNA target Oligonucleotide probe * * * * * 1.28cm GeneChip Probe Array Hybridized Probe Cell BGT108_DukeUniv

5 Applications of microarrays Cancer research: Molecular characterization of tumors on a genomic scale; more reliable diagnosis and effective treatment of cancer Immunology: Study of host genomic responses to bacterial infections Model organisms: Multifactorial experiments monitoring expression response to different treatments and doses, over time or in different cell types etc.

6 Applications of Microarrays Compare mRNA transcript levels in different type of cells, i.e., vary –Tissue (liver vs. brain); –Treatment (Drugs A, B, and C); –State (tumor vs. normal); –Organism (yeast, different strains); –Timepoint; –etc.

7

8 Affymetrix Design PM MM GCGCCGGCTGCAGGAGCAGGAGGAG GCGCCGGCTGCACGAGCAGGAGGAG 11 – 20 Probe Pairs interrogate each gene

9 Image Analysis: Pixel Level Data 6 x 6 matrix of pixels for each PM and MM probe HG-U133A GeneChip

10 Expression Quantification PM MM GCGCCGGCTGCAGGAGCAGGAGGAG GCGCCGGCTGCACGAGCAGGAGGAG PM and MM intensities are combined to form an expression measure for the probe set (gene)

11 Expression Quantification Initially, Affymetrix signal was calculated as where j indexes the probe pairs for each probe set A. This is known as the “Average Difference” method. Problems: –Large variability in PM-MM –MM probes may be measuring signal for another gene/EST –PM-MM calculations are sometimes negative

12 Expression Quantification The mean of a random variable X is a measure of central location of the density of X. The variance of a random variable is a measure of spread or dispersion of the density of X. Var(X)=E[(X-  ) 2 ] =E(X 2 ) -  2 Standard deviation = = 

13 Expression Quantification Illustration: Average Difference.xls

14

15 Sources of Obscuring Variation in Microarray Measurements Sample handling (degree of physical manipulation, time from extripation to freezing) Microarray manufacture Sample processing (extraction procedure, RNA integrity & purity, RNA labeling) Processing differences (hybridization chambers, washing modules, scanners) Personnel differences Random differences in signal intensity in a data set which co vary with the biological process

16 Normalization The purpose of normalization is to remove experimental artifacts of no direct interest, that is, the removal of systematic effects other than differential expression. Normalization procedures often include –background subtraction, –detection of outliers, –and removal of variation due to differences in sample preparation, array differences, differences in dye labeling efficiencies, and scanning differences.

17 16 Replicate HG-133A GeneChips, Before normalization

18 16 Replicate HG-133A GeneChips, After normalization

19

20 Taxonomy of Microarray Data Analysis Methods Unsupervised Learning: The statistical analysis seeks to find structure in the data without knowledge of class labels. Supervised Learning: Class or group labels are known a priori and the goal of the statistical analysis pertains to identifying differentially expressed genes (AKA feature selection) or identifying combinations of genes that are predictive of class or group membership.

21 Unsupervised Learning Unsupervised learning or clustering involves the aggregation of samples into groups based on similarity of their respective expression patterns without knowledge of class labels. Examples of Unsupervised Learning methods include –Hierarchical clustering –k-means –k-medoids –Self Organizing Maps –Principal Components –Multidimensional Scaling

22 Supervised Learning Example methods for Class comparison/ Feature selection include –T-test / Wilcoxon rank sum test –F-test / Kruskal Wallis test –etc. Example methods for Class Prediction include –Weighted voting –K nearest neighbors –Compound Covariate Predictors –Classification trees –Support vector machines –etc.

23 Supervised Learning: Class Prediction Risk of over-fitting the data: may have a perfect discriminator for the data set at hand but the same model may perform poorly on independent data sets. Most prediction methods are intended for large ‘n’ (samples) small ‘p’ (covariates) datasets. Process is to –Fit model –Check model adequacy –Make an inference

24 Class Prediction: Checking model Adequacy Regardless of algorithm used, it is essential that once the prediction rule has been defined, an unbiased estimate of the true error rate must be calculated.

25 Class Prediction: Checking Model Adequacy In a data rich situation, – randomly divide the dataset into two parts, representing a training and test dataset. –Build the prediction algorithm using the training dataset –Once a final model has been developed, the prediction rule is applied to the test dataset to estimate the misclassification error

26 Class Prediction: Checking Model Adequacy For small sample sizes, withholding a large portion of the data for validation purposes may limit the ability of developing a prediction rule. Therefore, use cross-validation techniques to assess the error.

27 Class Prediction: Checking Model Adequacy K-fold cross-validation requires one to randomly split the dataset into K equally sized groups. Thereafter, the model is fit to K-1 parts of the data and the generalization error is calculated using the Kth remaining part of the data. This procedure is repeated so that the generalization error is estimated for each of the K parts of the data, providing an overall estimate of the generalization error and its associated standard error.

28 Class Prediction: Checking Model Adequacy 12345678910 Leave out data in group 3 Fit the model to the data in groups 1 – 2, 4 – 10 (learning dataset) Calculate the error using observations in group 3 as the test dataset Do this for each of the 10 partitions


Download ppt "Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics"

Similar presentations


Ads by Google