CS491JH: Data Mining in Bioinformatics Introduction to Microarray Technology Technology Background Data Processing Procedure Characteristics of Data Data integration and Data mining
Substrates for High Throughput Arrays Nylon Membrane Glass SlidesGeneChip Single label P 33 Single label biotin streptavidin Dual label Cy3, Cy5
GeneChip ® Probe Arrays 24µm Millions of copies of a specific oligonucleotide probe Image of Hybridized Probe Array Image of Hybridized Probe Array >200,000 different complementary probes Single stranded, labeled RNA target Oligonucleotide probe * * * * *1.28cm GeneChip Probe Array Hybridized Probe Cell
GeneChip ® Expression Array Design GeneSequence Probes designed to be Perfect Match Probes designed to be Mismatch Multiple oligo probes 5´3´
Procedures for Target Preparation cDNA Fragment (heat, Mg 2+ ) LLLL Wash & Stain Scan Hybridize (16 hours) Labeled transcript Poly (A) + / Total RNA RNA AAAA IVT(Biotin-UTPBiotin-CTP) Labeled fragments L L L L Cells
Microarray Technology
NSF Soybean Functional Genomics Steve Clough / Vodkin Lab Printing Arrays on 50 slides
Cells from condition A Cells from condition B mRNA Label Dye 2 NSF / U of Illinois Microarray Workshop -Steve Clough / Vodkin Lab Ratio of expression of genes from two sources Label Dye 1 cDNA equaloverunder Mix Total or
GSI Lumonics NSF Soybean Functional Genomics Steve Clough / Vodkin Lab
Beta Actin PKG HPRT Beta 2 microglobulin Rubisco AB binding protein Major latex protein homologue (MSG) Cattle and Soy Controls Array of cattle and soy spiking controls. 50 ug of cattle brain total RNA was labeled with Cy3 (green). 1 ul each of in vitro transcribed soy Rubisco (5 ng), AB binding protein (0.5 ng) and MSG (0.05 ng) were labeled with Cy5. The two labeled samples were cohybridized on superamine slides (Telechem, Inc.). To the right of each set of spots are five negative controls (water).
IgM IgM heavy chain MYLK COL1A2 MYLK IgM Fetal Spleen-Cy3Adult Spleen-Cy5 IgM heavy chain
Placenta vs. Brain – 3800 Cattle Placenta Array cy3 cy5 GenePix Image Analysis Software
1.Experimental Design 2.Image Analysis – raw data 3.Normalization – “clean” data 4.Data Filtering – informative data 5.Model building 6.Data Mining (clustering, pattern recognition, et al) 7.Validation Microarray Data Process
Scatterplot of Normalized Data Adult Fetal
>0.3<-0.3
Characteristics of Data Data can be viewed as a NxM matrix (N >> M): N is the number of genes M is the number of data points for each gene Or Nx(M+K) K is the number of Features describing each gene(genome location, functional description, metabolic pathway et al)
Model for Data Analysis Gene Expression is a Dynamic Process Each Microarray Experiment is a snap shot of the process Need basic biological knowledge to build model For Example: Assumption – In most of experiments, only a small set of genes (100s/1000s) have been affected significantly.
Data Mining Data volumes are too large for traditional analysis methods Large number of records and high dimensional data Only small portion of data is analyzed Decision support process becomes more complex Functions of Data Mining Need for Data Mining Use the data to build predictors – prediction, classification, deviation detection, segmentation Generates more sophisticated summaries and reports to aid understanding of the data – find clusters, partitions in data
Data Mining Methods Classification, Regression (Predictive Modeling) Clustering (Segmentation) Association Discovery (Summarization) Change and deviation detection Dependency Modeling Information Visualization
Cholesterol Biosynthesis Cell Cycle Immediate Early Response Signaling and Angiogenesis Wound Healing and Tissue Remodeling Clustered display of data from time course of serum stimulation of primary human fibroblasts. Eisen et al. Proc. Natl. Acad. Sci. USA 95 (1998) pg 14865
Self Organizing Maps
Molecular Classification of Cancer
Gene Expression Profile of Aging and Its Retardation by Caloric Restriction Cheol-Koo Lee, Roger G. Klopp, Richard Weindruch, Tomas A. Prolla
Expression Landscape of cell-cycle regulated genes in yeast
Multi-dimension data visualization