Knowledge Discovery in Microarray Gene Expression Data Insilicogen Junhyung Park.

Slides:



Advertisements
Similar presentations
MicroArray Image Analysis Robin Liechti
Advertisements

M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.
MicroArray Image Analysis
MicroArray Image Analysis Robin Liechti
1 MicroArray -- Data Analysis Cecilia Hansen & Dirk Repsilber Bioinformatics - 10p, October 2001.
Mathematical Statistics, Centre for Mathematical Sciences
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Microarray Data Analysis Stuart M. Brown NYU School of Medicine.
Introduction to DNA Microarrays Todd Lowe BME 88a March 11, 2003.
Microarrays Dr Peter Smooker,
Microarray Data Preprocessing and Clustering Analysis
Microarray analysis Golan Yona ( original version by David Lin )
Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.
Sample preparation 1. Design experiment Question? Replicates? Test? 2. Perform experiment 4. Label RNA Amplification? Direct or indirect? Label? wild.
Microarray Technology Types Normalization Microarray Technology Microarray: –New Technology (first paper: 1995) Allows study of thousands of genes at.
Image Analysis Class web site: Statistics for Microarrays.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Fuzzy K means.
Introduce to Microarray
Scanning and image analysis Scanning -Dyes -Confocal scanner -CCD scanner Image File Formats Image analysis -Locating the spots -Segmentation -Evaluating.
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
By Moayed al Suleiman Suleiman al borican Ahmad al Ahmadi
Analysis of microarray data
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
with an emphasis on DNA microarrays
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Affymetrix vs. glass slide based arrays
Copyright 2000, Media Cybernetics, L.P. Array-Pro ® Analyzer Software.
DNA MICROARRAYS WHAT ARE THEY? BEFORE WE ANSWER THAT FIRST TAKE 1 MIN TO WRITE DOWN WHAT YOU KNOW ABOUT GENE EXPRESSION THEN SHARE YOUR THOUGHTS IN GROUPS.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
CDNA Microarrays MB206.
Data Type 1: Microarrays
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Analysis of Microarray Data Analysis of images Preprocessing of gene expression data Normalization of data –Subtraction of Background Noise –Global/local.
A New Oklahoma Bioinformatics Company. Microarray and Bioinformatics.
Agenda Introduction to microarrays
Microarray - Leukemia vs. normal GeneChip System.
Scenario 6 Distinguishing different types of leukemia to target treatment.
ARK-Genomics: Centre for Comparative and Functional Genomics in Farm Animals Richard Talbot Roslin Institute and R(D)SVS University of Edinburgh Microarrays.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Intro to Microarray Analysis Courtesy of Professor Dan Nettleton Iowa State University (with some edits)
What Is Microarray A new powerful technology for biological exploration Parallel High-throughput Large-scale Genomic scale.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Genomics I: The Transcriptome
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Microarray Technology. Introduction Introduction –Microarrays are extremely powerful ways to analyze gene expression. –Using a microarray, it is possible.
Microarray (Gene Expression) DNA microarrays is a technology that can be used to measure changes in expression levels or to detect SNiPs Microarrays differ.
Microarray hybridization Usually comparative – Ratio between two samples Examples – Tumor vs. normal tissue – Drug treatment vs. no treatment – Embryo.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
DNA Microarray Overview and Application. Table of Contents Section One : Introduction Section Two : Microarray Technique Section Three : Types of DNA.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Introduction to Oligonucleotide Microarray Technology
Microarray: An Introduction
©2003/04 Alessandro Bogliolo Analysis of gene expression by means of Microarrays.
Gene Expression Analysis
Microarray - Leukemia vs. normal GeneChip System.
The Basics of cDNA Microarray Technology
Introduction to cDNA Microarray Technology
The Basics of Microarray Image Processing
Microarray Data Analysis
Data Type 1: Microarrays
Presentation transcript:

Knowledge Discovery in Microarray Gene Expression Data Insilicogen Junhyung Park

The Complete Microarray Bioinformatics Solution Data Management Databases Statistical Analysis Image ProcessingAutomation Data Mining Cluster Analysis

Microarrays 3 Microarrays work by exploiting the ability of a given mRNA molecule (target) to bind specifically to, or hybridize to, the DNA template (probe) from which it originated. This mechanism acts as both an "on/off" switch to control which genes are expressed in a cell as well as a "volume control" that increases or decreases the level of expression of particular genes as necessary.

The miracle of microarray experiments 4 “I think you should be more explicit here in step two.” Problems: Experimental design Array fabrication Data analysis Normalization QC Image processing Interpretation Data mining Communication

Microarray Experiment Stages Microarray Probe and Layout Design Probe and Layout Design Spotting Array segmentation Array segmentation Image Analysis Hybridization Background intensity extraction Background intensity extraction Target detection Target intensity extraction Target intensity extraction Ratio analysis Data Mining DB Normalization Clustering PCA Hierarchical graph SOM Time series, etc...

Expression analysis experiments RNA isolation cDNA production and labeling Hybridization and washing drying Laboratory work Scanning of the microarray Image analysis Normalization Statistical analysis Computer work

Cells from condition A Cells from condition B mRNA Label Dye 2 Label Dye 1 cDNA equaloverunder Mix Total or Ratio of expression of genes from two sources

GSI Lumonics Emission Spectra : Cy3 and Cy5

cDNA Microarrays Glass slides or similar supports containing cDNA sequences that serve as probes for measuring mRNA levels in target samples cDNAs are arrayed on each slide in a grid of spots. Each spot contains thousands of copies of a sequence that matches a segment of a gene’s coding sequence. A sequence and its complement are present in the same spot. Different spots typically represent different genes, but some genes may be represented by multiple spots 9 BBSI

cDNA Microarray Probes Expressed Sequence Tags (ESTs) commonly serve as probes on cDNA microarrays. ESTs are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing the end of a cDNA that has been reverse transcribed from mRNA. 10 BBSI AAAAAAAAA...AmRNA TTTTTTTTTT...T cDNA EST

cDNA Probes on Microarrays Spotting cDNA Probes on Microarrays Solutions containing probes are transferred from a plate to a microarray slide by a robotic arrayer. The robot picks up a small amount of solution containing a probe by dipping a pin into a well on a plate. The robot then deposits a small drop of the solution on the microarray slide by touching the pin onto the slide. The pin is washed and the process is repeated for a different probe. Most arrayers use several pins so that multiple probes are spotted simultaneously on a slide. Most arrayers print multiple slides together so that probes are deposited on several slides prior to washing. 11 BBSI

Clone Tracking 12 source plates array plate 96-channel pipette mapping array plates slide 8-pin arrayer mapping

June 11, 2007 The PixSys 5500 Arraying Robot (Cartesian Technologies) Vacuum wash station The print head holds up to 32 pins in a 8x4 format Vacuum hold-down platform (50 slide capacity) Robotic arm Source: Dan Nettleton Course Notes Statistics 416/516X

Printing Arrays on 50 slides

microarray slide plate with wells holding probes in solution All spots of the same color are made at the same time. All spots in the same sector are made by the same pin. Spotting the Probes on the Microarray 8 X 4 Print Head

16 cDNA microarray slide 2cDNA microarray slide 1 TTCCAG GATATG Each spot contains many copies of a sequence along with its complement (not shown). spot for gene 201 spot for gene 576 TTCCAG GATATG spot for gene 201 spot for gene 576

Using cDNA Microarrays to Measure mRNA Levels RNA is extracted from a target sample of interest. mRNA are reverse transcribed into cDNA. The resulting cDNA are labeled with a fluorescent dye and are placed on a microarray slide. Dyed cDNA sequences hybridize to complementary probes spotted on the array. A laser excites the dye and a scanner records an image of the slide. The image is quantified to obtain measures of fluorescence intensity for each pixel. Pixel values are processed to obtain measures of mRNA abundance for each probe on the array.

Difficult to Make Meaningful Comparisons between Genes The measures of mRNA levels are affected by several factors that are partly or completely confounded with genes (e.g., EST source plate, EST well, print pin, slide position, length of mRNA sequence, base composition of mRNA sequence, specificity of probe sequence, etc.). Within-gene comparisons of multiple cell types or across multiple treatment conditions are much more meaningful.

Using cDNA Microarrays to Measure mRNA Levels ACCTG...G TTCTG...A GGCTT...C ATCTA...A ACGGG...T CGATA...G ?????????? Sample 1 Sample 2 Microarray Slide Spots (Probes) Unknown mRNA Sequences (Target)

Convert to cDNA and Label with Fluorescent Dyes ACCTG...G TTCTG...A GGCTT...C ATCTA...A ACGGG...T CGATA...G Sample 1 Sample 2 ?????????? Sample 1 Sample 2

Mix Labeled cDNA and Hybridize to the Slide ACCTG...G TTCTG...A GGCTT...C ATCTA...A ACGGG...T CGATA...G Sample 1 Sample 2 ??????????

22 Excite Dye with Laser, Scan, and Quantify Signals ACCTG...G TTCTG...A GGCTT...C ATCTA...A ACGGG...T CGATA...G67239 Sample 1 Sample 2

Oligonucleotides : Simplified Example June 11, 2007 gene 1 gene 2 shared green regions indicate high degree of sequence similarity throughout much of the transcript ATTACTAAGCATAGATTGCCGTATA oligo probe for gene 1 GCGTATGGCATGCCCGGTAAACTGG oligo probe for gene 2...

Oligo Microarray Fabrication Oligos can be synthesized and stored in solution for spotting as is done with cDNA microarrays. Oligo sequences can be synthesized on a slide or chip using various commercial technologies. In one approach, sequences are synthesized on a slide using ink-jet technology similar to that used in color printers. Separate cartridges for the four bases (A, C, G, T) are used to build nucleotides on a slide. Affymetrix uses a photolithographic approach.

Microarray Experiment Stages Microarray Probe and Layout Design Probe and Layout Design Spotting Array segmentation Array segmentation Image Analysis Hybridization Background intensity extraction Background intensity extraction Target detection Target intensity extraction Target intensity extraction Ratio analysis Data Mining DB Normalization Clustering PCA Hierarchical graph SOM Time series, etc...

Image Analysis for Microarray 1. Array segmentation 2. Background intensity extraction 3. Target detection 4. Target intensity extraction 5. Ratio analysis Next, Data Mining !

Image Processing 27

Processing of images Addressing or gridding –Assigning coordinates to each of the spots Segmentation –Classification of pixels either as foreground or as background Intensity extraction (for each spot) –Foreground fluorescence intensity pairs (R, G) –Background intensities –Quality measures

Parameters to address the spots positions  Separation between rows and columns of grids  Individual translation of grids  Separation between rows and columns of spots within each grid  Small individual translation of spots  Overall position of the array in the image Processing of images-Addressing The basic structure of the images is known (determined by the arrayer)

Processing of images-Segmentation Classification of pixels as foreground or background -> fluorescence intensities are calculated for each spot as measure of transcript abundance Production of a spot mask : set of foreground pixels for each spot

Processing of images-Segmentation

Segmentation methods : –Fixed circle segmentation –Adaptive circle segmentation –Adaptive shape segmentation –Histogram segmentation Fixed circleScanAlyze, GenePix, QuantArray Adaptive circleGenePix, Dapple Adaptive shapeSpot, region growing and watershed Histogram methodImaGene, QuantArraym DeArray and adaptive thresholding

Processing of images-Fixed circle segmentation Fits a circle with a constant diameter to all spots in the image Easy to implement The spots need to be of the same shape and size Bad example !

Processing of images-Adaptive circle segmentation The circle diameter is estimated separately for each spot Dapple finds spots by detecting edges of spots (second derivative) Problematic if spot exhibits oval shapes

Processing of images-Adaptive shape segmentation Specification of starting points or seeds Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region.

Processing of images-Histogram segmentation Uses a target mask chosen to be larger than any other spot Foreground and background intensity are determined from the histogram of pixel values for pixels within the masked area Example : QuantArray –Background : mean between 5th and 20th percentile –Foreground : mean between 80th and 95th percentile Unstable when a large target mask is set to compensate for variation in spot size BkgdForeground

Processing of images-Data Quality Irregular size or shape Irregular placement Low intensity Saturation Spot variance Background variance 37 indistinguishablesaturated bad print artifactmiss alignment

Intensity Extraction : Spot intensity The total amount of hybridization for a spot is proportional to the total fluorescence at the spot Spot intensity = sum of pixel intensities within the spot mask Since later calculations are based on ratios between cy5 and cy3, we compute the ratio of medians value over the spot mask

Intensity Extraction : Background intensity Motivation : spot’s measured intensity includes a contribution of non- specific hybridization and other chemicals on the glass Fluorescence from regions not occupied by DNA should by different from regions occupied by DNA -> could be interesting to use local negative controls (spotted DNA that should not hybridize) Different background methods : Local background, morphological opening, constant background, no adjustment

Intensity Extraction : Local background Focusing on small regions surrounding the spot mask. Median of pixel values in this region Most software package implement such an approach ScanAlyzeImaGeneSpot, GenePix By not considering the pixels immediately surrounding the spots, the background estimate is less sensitive to the performance of the segmentation procedure

Quality measures (-> Flag) How good are foreground and background measurements ? –Variability measures in pixel values within each spot mask –Spot size –Circularity measure –Relative signal to background intensity Based on these measurements, one can flag a spot

Quantification of expression For each spot on the slide we calculate Red intensity = Rfg - Rbg fg = foreground, bg = background, and Green intensity = Gfg - Gbg and combine them in the log (base 2) ratio Log 2 ( Red intensity / Green intensity)

Spot intensity data

Gene Expression Data Gene expression data on p genes for n samples Genes Slides Gene expression level of gene 5 in slide 4 = Log 2 ( Red intensity / Green intensity) slide 1slide 2slide 3slide 4slide 5 … These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.

GPR Data File Format

Microarray Experiment Stages Microarray Probe and Layout Design Probe and Layout Design Spotting Array segmentation Array segmentation Image Analysis Hybridization Background intensity extraction Background intensity extraction Target detection Target intensity extraction Target intensity extraction Ratio analysis Data Mining DB Normalization Clustering PCA Hierarchical graph SOM Time series, etc...

Knowledge Discovery in Microarrays 47 Image Processing Data Cleaning Quality Control Data Imputation Normalization Gene Selection Differential Expression Feature Selection Outcomes Hypotheses Publications Treatments Pattern Assessment Annotation Expert Analysis Experiments Pattern Recognition Clustering PCA/SVD Associations

Uses of DNA microarrays Expression analysis Genotyping Resequencing In toxicology, cancer research, environmental research, population genetics, association studies, etc.

First question : What do I want to find out? Screen thousands of(differential) genes to find relevant ones in my process of interest: - Health/diseased-comparison - Marker genes of tumor changes - Genes expressed in particular physiological phase Follow a pattern of expression level changes in time series to find co- expressed genes(pathways, regulatory networks) Study presence of interesting genes(genus-specific ID of the e.g. microbial community) 2 1 3

Second question : Why do you use analysis tools? To manipulate the data so that differences due to - Different array - Time of hybridization, or - Sample Will be removed before the results from different arrays can be compared to each other To manipulate the data so that differences due to - Different array - Time of hybridization, or - Sample Will be removed before the results from different arrays can be compared to each other Normalization

Second question : Why do you use analysis tools? To find genes that are significantly differentially expressed - for that you need statistics to evaluate your data and to separate true results from noise or error To find genes that are significantly differentially expressed - for that you need statistics to evaluate your data and to separate true results from noise or error

Second question : Why do you use analysis tools? To connect the information to other biological data - Co-expressed genes may be involved in the same function / state / condition - Do a group (cluster) of genes with similar expression profile have a common function? To connect the information to other biological data - Co-expressed genes may be involved in the same function / state / condition - Do a group (cluster) of genes with similar expression profile have a common function? Data mining

Normalization: Why normalize data? To balance the fluorescence intenities of the two dyes(green Cy3 and red Cy5 dye) To allow the comparison of expression levels across experiments(slides) To adjust scale of the relative gene expression levels(as measured by log ratios) across replicate experiments

Source of error Pin geometry Slide heterogeneity mRNA preparation Fluorescent labeling Hybridization conditions Scanning and image analysis

Data Normalization 55 Calibrated, red and green equally detectedUncalibrated, red light under detected

M vs. A plot

Box plot

Normalization: Channel biases Before Normalization…

Normalization: Channel biases After Normalization…

Additional Normalization Dye swap –Combine relative expression levels without explicit normalization –Compute lowess fit for log 2 (RR’/GG’)/2 vs. log 2 (A + A’)/2 –Normalized ratio is log 2 (R/G) - c(A) where c(A) is the lowess prediction 60

What is clustering? Clustering methods are used to –find genes from the same biological process –group the experiments to similar conditions Different clustering methods can give different results. The physically motivated ones are more robust. Focusing on subsets of the genes and conditions can uncover structure that is masked when using all genes and conditions

Football Booing Cheering

Cluster Analysis Cluster genes based on expression profiles –Gene expression across several treatments Hypothesis: Genes with similar function have similar expression profiles 63

tumour 2 tumour 5 tumour 1 tumour 4 tumour 3 healthy 4 healthy 1 healthy 3 healthy 2 healthy pearson correlation log (ratio) pearson correlation Hierarchical clustering Example, two dimensions

calculate distance matrix calculate averages of most similar Clustering methods hierarchical clustering

calculate distance matrix 1234 calculate averages of most similar Dendrogram Clustering methods hierarchical clustering (avg. linkage)

Look at branching pattern when assessing similarity, not simply the sample (or gene) order ! Hierarchical clustering Isomorphism

Analysis. k-means clustering. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage) Step 1. genes are randomly divided into k groups. k is assigned by the user. k =4 for this illustration.

Analysis. k-means clustering. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage) Step 2. centroid for each group is determined.

Analysis. k-means clustering. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage) Step 3. genes are reassigned to its closest centroids

Analysis. k-means clustering. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage) Step 4. centroids’ positions are recalculated

Analysis. k-means clustering. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage) Step 5. reiterate assignment/ recalculation steps until all genes are assigned to the immutable groups ( times)

Analysis. Self-organizing map. similar to k-means clustering however, SOM arranges groups in a two-dimensional map in addition to dividing genes into groups based on expression patterns useful for visualizing of distinct expression patterns and determining which of these patterns are variants of one another. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage)

Self-Organizing Maps Situate grid of nodes along a plane where datapoints are distributed Silicon Genetics, 2003

Self-Organizing Maps Sample a gene and subject the closest node and neighboring nodes to its ‘gravitational’ influence Silicon Genetics, 2003

Self-Organizing Maps Silicon Genetics, 2003

Self-Organizing Maps Sample another gene… Silicon Genetics, 2003

Self-Organizing Maps …and so on, and so on… Silicon Genetics, 2003

Self-Organizing Maps …until all genes have been sampled several times over. Each cluster is defined with reference to a node, specifically comprised by those genes for which it represents the closest node. Silicon Genetics, 2003

Analysis. QT clustering. user-defined minimum number of genes in cluster and maximum diameter starts with a random single gene, then adds the gene that is closest to the starting gene therefore increasing diameter of the cluster. Adds new genes until no genes can be added without the diameter growing beyond the allowed cutoff. some genes may not be clustered Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage)

Expression Group A B Fold change cut-off  global  sliding window T down T up Identifying differential expression Fold change

Expression Group A B Comparing means (1-2 groups)  student’s t Comparing means (>2 groups)  1-way ANOVA Comparing means (multifactor)  2-way ANOVA Expression Group A B H 0 :  A =  B,  A -  B = 0 H 1 :  A   B Expression A 1 Identifying differential expression Parametric statistics

Challenge : Data analysis across multiple dimensions of biology

Consult the litterature on about 300 genes?

TP53 alone gives you publications... Too many data

Search every gene in PubMed?

Harmony of text-mining tool, database and visualization for pathway analysis Pubmed Open access Google Entity-based index Semantic Index Automatic reader’s digest Document Summary Indexing the scientific literature Extracting interactions to create databases for systems biology Text-mining Tool Text-mining Tool Software solution for Knowledge management and pathway analysis of the high-throughput data

Pathway Analysis based on DEG

Companies Affymetrix –Single dye, high density, oligonucleotide arrays Nimblegen –Similar to Affy, but specializing in custom arrays Agilent –cDNA microarrays, scanners, etc. Axon Instruments –Scanners, software, etc. 91

Data Sources Several central microarray data databases exist –Stanford Microarray Database –GEO (NCBI) –READ (RIKEN) Find more at – tmlhttp://ihome.cuhk.edu.hk/~b400559/arraysoft_public.h tml 92

R and Bioconductor R is an open-source clone of S-Plus – Bioconductor is a collection of microarray data analysis tools – –Primarily written by research scientists Very wide range of statistical and machine learning analyses Command-line/programming language –More flexible, but more difficult to use and learn 93

TM4 TIGR – MADAM –Microarray data manager MEV –Multiple experiment viewer Spotfinder –Image processing ArrayViewer

Other Software A more comprehensive list – htmlhttp://ihome.cuhk.edu.hk/~b400559/arraysoft. html Pathway Analysis Data management Annotation … 95

Thank you!