Knowledge Discovery in Microarray Gene Expression Data Insilicogen Junhyung Park
The Complete Microarray Bioinformatics Solution Data Management Databases Statistical Analysis Image ProcessingAutomation Data Mining Cluster Analysis
Microarrays 3 Microarrays work by exploiting the ability of a given mRNA molecule (target) to bind specifically to, or hybridize to, the DNA template (probe) from which it originated. This mechanism acts as both an "on/off" switch to control which genes are expressed in a cell as well as a "volume control" that increases or decreases the level of expression of particular genes as necessary.
The miracle of microarray experiments 4 “I think you should be more explicit here in step two.” Problems: Experimental design Array fabrication Data analysis Normalization QC Image processing Interpretation Data mining Communication
Microarray Experiment Stages Microarray Probe and Layout Design Probe and Layout Design Spotting Array segmentation Array segmentation Image Analysis Hybridization Background intensity extraction Background intensity extraction Target detection Target intensity extraction Target intensity extraction Ratio analysis Data Mining DB Normalization Clustering PCA Hierarchical graph SOM Time series, etc...
Expression analysis experiments RNA isolation cDNA production and labeling Hybridization and washing drying Laboratory work Scanning of the microarray Image analysis Normalization Statistical analysis Computer work
Cells from condition A Cells from condition B mRNA Label Dye 2 Label Dye 1 cDNA equaloverunder Mix Total or Ratio of expression of genes from two sources
GSI Lumonics Emission Spectra : Cy3 and Cy5
cDNA Microarrays Glass slides or similar supports containing cDNA sequences that serve as probes for measuring mRNA levels in target samples cDNAs are arrayed on each slide in a grid of spots. Each spot contains thousands of copies of a sequence that matches a segment of a gene’s coding sequence. A sequence and its complement are present in the same spot. Different spots typically represent different genes, but some genes may be represented by multiple spots 9 BBSI
cDNA Microarray Probes Expressed Sequence Tags (ESTs) commonly serve as probes on cDNA microarrays. ESTs are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing the end of a cDNA that has been reverse transcribed from mRNA. 10 BBSI AAAAAAAAA...AmRNA TTTTTTTTTT...T cDNA EST
cDNA Probes on Microarrays Spotting cDNA Probes on Microarrays Solutions containing probes are transferred from a plate to a microarray slide by a robotic arrayer. The robot picks up a small amount of solution containing a probe by dipping a pin into a well on a plate. The robot then deposits a small drop of the solution on the microarray slide by touching the pin onto the slide. The pin is washed and the process is repeated for a different probe. Most arrayers use several pins so that multiple probes are spotted simultaneously on a slide. Most arrayers print multiple slides together so that probes are deposited on several slides prior to washing. 11 BBSI
Clone Tracking 12 source plates array plate 96-channel pipette mapping array plates slide 8-pin arrayer mapping
June 11, 2007 The PixSys 5500 Arraying Robot (Cartesian Technologies) Vacuum wash station The print head holds up to 32 pins in a 8x4 format Vacuum hold-down platform (50 slide capacity) Robotic arm Source: Dan Nettleton Course Notes Statistics 416/516X
Printing Arrays on 50 slides
microarray slide plate with wells holding probes in solution All spots of the same color are made at the same time. All spots in the same sector are made by the same pin. Spotting the Probes on the Microarray 8 X 4 Print Head
16 cDNA microarray slide 2cDNA microarray slide 1 TTCCAG GATATG Each spot contains many copies of a sequence along with its complement (not shown). spot for gene 201 spot for gene 576 TTCCAG GATATG spot for gene 201 spot for gene 576
Using cDNA Microarrays to Measure mRNA Levels RNA is extracted from a target sample of interest. mRNA are reverse transcribed into cDNA. The resulting cDNA are labeled with a fluorescent dye and are placed on a microarray slide. Dyed cDNA sequences hybridize to complementary probes spotted on the array. A laser excites the dye and a scanner records an image of the slide. The image is quantified to obtain measures of fluorescence intensity for each pixel. Pixel values are processed to obtain measures of mRNA abundance for each probe on the array.
Difficult to Make Meaningful Comparisons between Genes The measures of mRNA levels are affected by several factors that are partly or completely confounded with genes (e.g., EST source plate, EST well, print pin, slide position, length of mRNA sequence, base composition of mRNA sequence, specificity of probe sequence, etc.). Within-gene comparisons of multiple cell types or across multiple treatment conditions are much more meaningful.
Using cDNA Microarrays to Measure mRNA Levels ACCTG...G TTCTG...A GGCTT...C ATCTA...A ACGGG...T CGATA...G ?????????? Sample 1 Sample 2 Microarray Slide Spots (Probes) Unknown mRNA Sequences (Target)
Convert to cDNA and Label with Fluorescent Dyes ACCTG...G TTCTG...A GGCTT...C ATCTA...A ACGGG...T CGATA...G Sample 1 Sample 2 ?????????? Sample 1 Sample 2
Mix Labeled cDNA and Hybridize to the Slide ACCTG...G TTCTG...A GGCTT...C ATCTA...A ACGGG...T CGATA...G Sample 1 Sample 2 ??????????
22 Excite Dye with Laser, Scan, and Quantify Signals ACCTG...G TTCTG...A GGCTT...C ATCTA...A ACGGG...T CGATA...G67239 Sample 1 Sample 2
Oligonucleotides : Simplified Example June 11, 2007 gene 1 gene 2 shared green regions indicate high degree of sequence similarity throughout much of the transcript ATTACTAAGCATAGATTGCCGTATA oligo probe for gene 1 GCGTATGGCATGCCCGGTAAACTGG oligo probe for gene 2...
Oligo Microarray Fabrication Oligos can be synthesized and stored in solution for spotting as is done with cDNA microarrays. Oligo sequences can be synthesized on a slide or chip using various commercial technologies. In one approach, sequences are synthesized on a slide using ink-jet technology similar to that used in color printers. Separate cartridges for the four bases (A, C, G, T) are used to build nucleotides on a slide. Affymetrix uses a photolithographic approach.
Microarray Experiment Stages Microarray Probe and Layout Design Probe and Layout Design Spotting Array segmentation Array segmentation Image Analysis Hybridization Background intensity extraction Background intensity extraction Target detection Target intensity extraction Target intensity extraction Ratio analysis Data Mining DB Normalization Clustering PCA Hierarchical graph SOM Time series, etc...
Image Analysis for Microarray 1. Array segmentation 2. Background intensity extraction 3. Target detection 4. Target intensity extraction 5. Ratio analysis Next, Data Mining !
Image Processing 27
Processing of images Addressing or gridding –Assigning coordinates to each of the spots Segmentation –Classification of pixels either as foreground or as background Intensity extraction (for each spot) –Foreground fluorescence intensity pairs (R, G) –Background intensities –Quality measures
Parameters to address the spots positions Separation between rows and columns of grids Individual translation of grids Separation between rows and columns of spots within each grid Small individual translation of spots Overall position of the array in the image Processing of images-Addressing The basic structure of the images is known (determined by the arrayer)
Processing of images-Segmentation Classification of pixels as foreground or background -> fluorescence intensities are calculated for each spot as measure of transcript abundance Production of a spot mask : set of foreground pixels for each spot
Processing of images-Segmentation
Segmentation methods : –Fixed circle segmentation –Adaptive circle segmentation –Adaptive shape segmentation –Histogram segmentation Fixed circleScanAlyze, GenePix, QuantArray Adaptive circleGenePix, Dapple Adaptive shapeSpot, region growing and watershed Histogram methodImaGene, QuantArraym DeArray and adaptive thresholding
Processing of images-Fixed circle segmentation Fits a circle with a constant diameter to all spots in the image Easy to implement The spots need to be of the same shape and size Bad example !
Processing of images-Adaptive circle segmentation The circle diameter is estimated separately for each spot Dapple finds spots by detecting edges of spots (second derivative) Problematic if spot exhibits oval shapes
Processing of images-Adaptive shape segmentation Specification of starting points or seeds Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region.
Processing of images-Histogram segmentation Uses a target mask chosen to be larger than any other spot Foreground and background intensity are determined from the histogram of pixel values for pixels within the masked area Example : QuantArray –Background : mean between 5th and 20th percentile –Foreground : mean between 80th and 95th percentile Unstable when a large target mask is set to compensate for variation in spot size BkgdForeground
Processing of images-Data Quality Irregular size or shape Irregular placement Low intensity Saturation Spot variance Background variance 37 indistinguishablesaturated bad print artifactmiss alignment
Intensity Extraction : Spot intensity The total amount of hybridization for a spot is proportional to the total fluorescence at the spot Spot intensity = sum of pixel intensities within the spot mask Since later calculations are based on ratios between cy5 and cy3, we compute the ratio of medians value over the spot mask
Intensity Extraction : Background intensity Motivation : spot’s measured intensity includes a contribution of non- specific hybridization and other chemicals on the glass Fluorescence from regions not occupied by DNA should by different from regions occupied by DNA -> could be interesting to use local negative controls (spotted DNA that should not hybridize) Different background methods : Local background, morphological opening, constant background, no adjustment
Intensity Extraction : Local background Focusing on small regions surrounding the spot mask. Median of pixel values in this region Most software package implement such an approach ScanAlyzeImaGeneSpot, GenePix By not considering the pixels immediately surrounding the spots, the background estimate is less sensitive to the performance of the segmentation procedure
Quality measures (-> Flag) How good are foreground and background measurements ? –Variability measures in pixel values within each spot mask –Spot size –Circularity measure –Relative signal to background intensity Based on these measurements, one can flag a spot
Quantification of expression For each spot on the slide we calculate Red intensity = Rfg - Rbg fg = foreground, bg = background, and Green intensity = Gfg - Gbg and combine them in the log (base 2) ratio Log 2 ( Red intensity / Green intensity)
Spot intensity data
Gene Expression Data Gene expression data on p genes for n samples Genes Slides Gene expression level of gene 5 in slide 4 = Log 2 ( Red intensity / Green intensity) slide 1slide 2slide 3slide 4slide 5 … These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.
GPR Data File Format
Microarray Experiment Stages Microarray Probe and Layout Design Probe and Layout Design Spotting Array segmentation Array segmentation Image Analysis Hybridization Background intensity extraction Background intensity extraction Target detection Target intensity extraction Target intensity extraction Ratio analysis Data Mining DB Normalization Clustering PCA Hierarchical graph SOM Time series, etc...
Knowledge Discovery in Microarrays 47 Image Processing Data Cleaning Quality Control Data Imputation Normalization Gene Selection Differential Expression Feature Selection Outcomes Hypotheses Publications Treatments Pattern Assessment Annotation Expert Analysis Experiments Pattern Recognition Clustering PCA/SVD Associations
Uses of DNA microarrays Expression analysis Genotyping Resequencing In toxicology, cancer research, environmental research, population genetics, association studies, etc.
First question : What do I want to find out? Screen thousands of(differential) genes to find relevant ones in my process of interest: - Health/diseased-comparison - Marker genes of tumor changes - Genes expressed in particular physiological phase Follow a pattern of expression level changes in time series to find co- expressed genes(pathways, regulatory networks) Study presence of interesting genes(genus-specific ID of the e.g. microbial community) 2 1 3
Second question : Why do you use analysis tools? To manipulate the data so that differences due to - Different array - Time of hybridization, or - Sample Will be removed before the results from different arrays can be compared to each other To manipulate the data so that differences due to - Different array - Time of hybridization, or - Sample Will be removed before the results from different arrays can be compared to each other Normalization
Second question : Why do you use analysis tools? To find genes that are significantly differentially expressed - for that you need statistics to evaluate your data and to separate true results from noise or error To find genes that are significantly differentially expressed - for that you need statistics to evaluate your data and to separate true results from noise or error
Second question : Why do you use analysis tools? To connect the information to other biological data - Co-expressed genes may be involved in the same function / state / condition - Do a group (cluster) of genes with similar expression profile have a common function? To connect the information to other biological data - Co-expressed genes may be involved in the same function / state / condition - Do a group (cluster) of genes with similar expression profile have a common function? Data mining
Normalization: Why normalize data? To balance the fluorescence intenities of the two dyes(green Cy3 and red Cy5 dye) To allow the comparison of expression levels across experiments(slides) To adjust scale of the relative gene expression levels(as measured by log ratios) across replicate experiments
Source of error Pin geometry Slide heterogeneity mRNA preparation Fluorescent labeling Hybridization conditions Scanning and image analysis
Data Normalization 55 Calibrated, red and green equally detectedUncalibrated, red light under detected
M vs. A plot
Box plot
Normalization: Channel biases Before Normalization…
Normalization: Channel biases After Normalization…
Additional Normalization Dye swap –Combine relative expression levels without explicit normalization –Compute lowess fit for log 2 (RR’/GG’)/2 vs. log 2 (A + A’)/2 –Normalized ratio is log 2 (R/G) - c(A) where c(A) is the lowess prediction 60
What is clustering? Clustering methods are used to –find genes from the same biological process –group the experiments to similar conditions Different clustering methods can give different results. The physically motivated ones are more robust. Focusing on subsets of the genes and conditions can uncover structure that is masked when using all genes and conditions
Football Booing Cheering
Cluster Analysis Cluster genes based on expression profiles –Gene expression across several treatments Hypothesis: Genes with similar function have similar expression profiles 63
tumour 2 tumour 5 tumour 1 tumour 4 tumour 3 healthy 4 healthy 1 healthy 3 healthy 2 healthy pearson correlation log (ratio) pearson correlation Hierarchical clustering Example, two dimensions
calculate distance matrix calculate averages of most similar Clustering methods hierarchical clustering
calculate distance matrix 1234 calculate averages of most similar Dendrogram Clustering methods hierarchical clustering (avg. linkage)
Look at branching pattern when assessing similarity, not simply the sample (or gene) order ! Hierarchical clustering Isomorphism
Analysis. k-means clustering. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage) Step 1. genes are randomly divided into k groups. k is assigned by the user. k =4 for this illustration.
Analysis. k-means clustering. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage) Step 2. centroid for each group is determined.
Analysis. k-means clustering. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage) Step 3. genes are reassigned to its closest centroids
Analysis. k-means clustering. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage) Step 4. centroids’ positions are recalculated
Analysis. k-means clustering. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage) Step 5. reiterate assignment/ recalculation steps until all genes are assigned to the immutable groups ( times)
Analysis. Self-organizing map. similar to k-means clustering however, SOM arranges groups in a two-dimensional map in addition to dividing genes into groups based on expression patterns useful for visualizing of distinct expression patterns and determining which of these patterns are variants of one another. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage)
Self-Organizing Maps Situate grid of nodes along a plane where datapoints are distributed Silicon Genetics, 2003
Self-Organizing Maps Sample a gene and subject the closest node and neighboring nodes to its ‘gravitational’ influence Silicon Genetics, 2003
Self-Organizing Maps Silicon Genetics, 2003
Self-Organizing Maps Sample another gene… Silicon Genetics, 2003
Self-Organizing Maps …and so on, and so on… Silicon Genetics, 2003
Self-Organizing Maps …until all genes have been sampled several times over. Each cluster is defined with reference to a node, specifically comprised by those genes for which it represents the closest node. Silicon Genetics, 2003
Analysis. QT clustering. user-defined minimum number of genes in cluster and maximum diameter starts with a random single gene, then adds the gene that is closest to the starting gene therefore increasing diameter of the cluster. Adds new genes until no genes can be added without the diameter growing beyond the allowed cutoff. some genes may not be clustered Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage)
Expression Group A B Fold change cut-off global sliding window T down T up Identifying differential expression Fold change
Expression Group A B Comparing means (1-2 groups) student’s t Comparing means (>2 groups) 1-way ANOVA Comparing means (multifactor) 2-way ANOVA Expression Group A B H 0 : A = B, A - B = 0 H 1 : A B Expression A 1 Identifying differential expression Parametric statistics
Challenge : Data analysis across multiple dimensions of biology
Consult the litterature on about 300 genes?
TP53 alone gives you publications... Too many data
Search every gene in PubMed?
Harmony of text-mining tool, database and visualization for pathway analysis Pubmed Open access Google Entity-based index Semantic Index Automatic reader’s digest Document Summary Indexing the scientific literature Extracting interactions to create databases for systems biology Text-mining Tool Text-mining Tool Software solution for Knowledge management and pathway analysis of the high-throughput data
Pathway Analysis based on DEG
Companies Affymetrix –Single dye, high density, oligonucleotide arrays Nimblegen –Similar to Affy, but specializing in custom arrays Agilent –cDNA microarrays, scanners, etc. Axon Instruments –Scanners, software, etc. 91
Data Sources Several central microarray data databases exist –Stanford Microarray Database –GEO (NCBI) –READ (RIKEN) Find more at – tmlhttp://ihome.cuhk.edu.hk/~b400559/arraysoft_public.h tml 92
R and Bioconductor R is an open-source clone of S-Plus – Bioconductor is a collection of microarray data analysis tools – –Primarily written by research scientists Very wide range of statistical and machine learning analyses Command-line/programming language –More flexible, but more difficult to use and learn 93
TM4 TIGR – MADAM –Microarray data manager MEV –Multiple experiment viewer Spotfinder –Image processing ArrayViewer
Other Software A more comprehensive list – htmlhttp://ihome.cuhk.edu.hk/~b400559/arraysoft. html Pathway Analysis Data management Annotation … 95
Thank you!