Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Microarrays Tzu Lip Phang, Ph.D. Associate Professor of Bioinformatics Division of Pulmonary Sciences and Critical Care Medicine University of Colorado School of Medicine
The Central Dogma Transcriptome Genome
Microarrys in the Literature
Barrett, T., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., Tomashevsky, M., et al. (2012). NCBI GEO: archive for functional genomics data sets--update. Nucleic acids research, 41(D1)
Public Data Usages Preliminary Data/Results, hypothesis generation Test Algorithm Power Analysis (sample size calculation) Enhance sample size
Array technology Basic idea: Genomic material DNA/RNA hybridizes best to exactly complementary sequences. Method: – Probes are attached to a substrate in a known location – DNA/RNA in one or more samples are fluorescently labelled – samples are hybridized to probe array, excess is washed off, and fluorescence reading are taken for each position
Microarray: Primer
Array synthesis Photolithography for oligonucleotides Cost proportional to length of oligo, not number of features (genes) per chip! Many layers compared to computer chips.
Affymetrix Probe Sets (11 to 16) 25mer AAAA.. 25mer PM MM
Gene Expression Still most common use for microarrays Aim to determine differential expression between groups of samples e.g. disease and control Generate hypotheses about the mechanisms underlying the disease of interest
Basic Statistical Analysis
Experimental Design Biological replication is essential – Technical replication not essential except for quality control studies Pooling biological samples to reduce array variability – Increase sample size without running more chips – BUT, if individual variation is important, pooling wash out the effect Power Analysis is essential
Power Analysis How many biological replication? My experience; at least 3, preferably 5, even 7 Bioconductor: SSPA
Preprocessing Including image analysis, normalization, and data transformation Data normalization: – Remove systematic errors introduced in labeling, hybridization and scanning procedures – Correct these errors while preserve biological variability / information
Why normalization?
A different look … Technical replicate difference Average Intensity Values
To normalize or not to …
AffyComp Rafael Irizarry, Dept BioStat John Hopkins University
Statistical Testing Hypothesis Testing: Is the means of two groups different from each other – Fold Change – Student-T Test
Microarray Scatter Plot
Student-T Test
What is Multiple Comparison Testing??! GenesP-values Critical levelHo Gene <=0.051 Gene <=0.051 Gene <=0.051 Gene <=0.051 Gene <=0.051 Gene 60.09<=0.050 Gene 70.05<=0.050 Gene 80.09<=0.050 Gene 90.2<=0.050 Gene 100.3<=0.050 Alpha level = 0.05
When large number of tests … GenesP-values Critical levelHo Gene <=0.051 Gene <=0.051 Gene <=0.051 Gene <=0.051 Gene <=0.051 Gene 60.09<=0.050 …………… …………… Gene <=0.050 Gene <=0.050 Alpha level = wrong genes …
Correction … Bonferroni GenesP-values Critical levelHo Gene <= Gene <= Gene <= Gene <= Gene <= Gene 60.09<= ……… … ……… … Gene <= Gene <= Alpha level = 0.05 / 1000 =
Strike the balance … BonferroniNo correction False Discovery Rate Most ConservativeMost Lenient The False Discovery Rate (FDR) of a set of predictions is the expected percent of false predictions in the set of predictions. Example: If the algorithm returns 100 genes with false discovery rate of 0.3, then we should expect 70 of them to be correct
Put them together
Result Validation RT-PCR: most common method Gene levels at the borderline of differential expression – Their measurability reduce by random error For highly differentially expressed genes, having sufficient replicates would serve as validation.
Biological Interpretation