Microarray Data Analysis The Bioinformatics side of the bench
The anatomy of your data files from MAS 5.0 (Microarray Suite 5.0).DAT.CEL.EXP.CHP.txt files generated from.CHP
Quality Control (QC) of the chip – visual inspection Look at the.DAT file or the.CHP file image –Scratches? Spots? –Corners and outside border checkerboard appearance (B2 oligo) Positive hybridization control Used by software to place grid over image –Array name is written out in oligos!
Scratch on a chip
Possible chip contamination
Internal controls B. subtilis genes (added poly-A tails) –Assessment of quality of sample preparation –Also as hybridization controls –Not used in our module
More internal controls Eukaryotic Hybridization controls (bioB, bioC, bioD, cre) –E. coli and P1 bacteriophage biotin- labeled cRNAs –Spiked into the hybridization cocktail –Assess hybridization efficiency
And still more internal controls Actin and GAPDH assess RNA sample/assay quality –Compare signal values from 3’ end to signal values from 5’ end ratio generally should not exceed 3 Percent genes present (%P) –Replicate samples - similar %P values
MAS 5.0 output files For each transcript (gene) on the chip: –signal intensity –a “present” or “absent” call (presence call) –p-value (significance value) for making that call Each gene associated with GenBank accession number (NCBI database)
How are transcripts determined to be present or absent? Probe pair (PM vs. MM) intensities –generate a detection p-value assign “Present”, “Absent”, or “Marginal” call for transcript Every probe pair in a probe SET has a potential “vote” for presence call
Discrimination score Probe pairs “vote” via discrimination score (R) R compared to a predetermined threshold: Tau –R > Tau = present –R < Tau = absent Voting result expressed as p-value –Reflects confidence of expression call
Altering Tau You can fine tune Tau yourself within MAS 5.0 Increase Tau: reduce “false positives”, may also reduce number of TRUE present calls Our rule: use the default!
Calculation of R R = (PM - MM) / (PM + MM) –(PM – MM): intensity difference of probe pair –(PM + MM): overall hybridization intensity –R value closer to 1: lower p-value (detection call is more significant) PM >> MM –R value close to 0 or negative: higher p-value (detection call is less significant) MM >/= PM –One-sided Wilcoxon’s Signed Rank test used to determine Detection p-value
Calculating signal One-Step Tukey Biweight Estimate –Yields robust weighted mean –Relatively insensitive to even extreme outliers Signal intensity value is created –related to amount of transcript present for that gene
Thank goodness for software!!! MAS 5.0 does these calculations for you –.CHP file Basic analysis in MAS 5.0, but it won’t handle replicates Import MAS 5.0 (.CHP) data into GeneSifter –web based microarray data analysis software package designed BY biologists FOR biologists
How do we want to analyze this data? Pairwise analysis is most appropriate –Control vs. DMSO List of genes that are “upregulated” or “downregulated” Determine fold up or down cutoffs –What is significant? 1.5 fold up/down? 2 fold up/down? 10 fold up/down?
Normalization “Normalizing” data allows comparisons ACROSS different chips –Intensity of fluorescent markers might be different from one batch to the other –Normalization allows us to compare those chips without altering the interpretation of changes in GENE EXPRESSION
Statistics Statistical tests allow us to determine how SIGNIFICANT the data are t-test statistic –compares the means of two groups while taking into account the standard deviations of those means p value (probability value) of </= 0.05 –(only 5 times out of 100 or less will the change in gene expression be due to chance, rather than a REAL change)
Present or absent? Can do analysis on genes that are considered “absent” under all conditions ONE transcript should be “present” in a pairwise analysis
Thresholds/cutoffs What is a significant change in gene expression? –Some think 2 fold at the lowest –Judgement call –Can also set upper limit of expression changes Remember we are talking about changes in mRNA expression –does that always mean more protein?
The output Run analysis, get output of a GENE LIST –List indicates what genes are up or down regulated –p values for t-test –Graphs of signal levels Absolute numbers not as important here as the trends you see –Now what????
Follow the links Click on a gene Find links to other databases Follow links to discover what the protein does Now the fun part begins….
Back to Biology Do the changes you see in gene expression make sense BIOLOGICALLY? If they don’t make sense, can you hypothesize as to why those genes might be changing? Leads to many, many more experiments
Validation Not enough to just do microarrays Usually “validate” microarray results via some other technique –rt-PCR –TaqMan –Northern analysis –Protein level analysis No technique is perfect…
Why microarrays? Ask a single question, and get more answers than you dreamed of! Can assess GLOBAL changes in gene expression under a certain experimental condition Can discover new pathways, gene regulation, the possibilities are almost endless
Caveat… There is NO standard way to analyze microarray data Still figuring out how to get the “best” answers from microarray experiments Best to combine knowledge of biology, statistics, and computers to get answers
One last note Microarrays are “cutting edge” technology You now have experience doing a technique that most Ph.D.s have never done Looks great on a resume…