COT 6930 HPC & Bioinformatics Microarray Data Analysis

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

Microarray Data Analysis Day 2
1 MicroArray -- Data Analysis Cecilia Hansen & Dirk Repsilber Bioinformatics - 10p, October 2001.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Microarray Data Analysis Stuart M. Brown NYU School of Medicine.
Gene Expression Chapter 9.
DNA microarray and array data analysis
Microarrays Dr Peter Smooker,
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Microarray Data Preprocessing and Clustering Analysis
Microarray analysis Golan Yona ( original version by David Lin )
Chip arrays and gene expression data. With the chip array technology, one can measure the expression of 10,000 (~all) genes at once. Can answer questions.
The Human Genome Project and ~ 100 other genome projects:
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
DNA Arrays …DNA systematically arrayed at high density, –virtual genomes for expression studies, RNA hybridization to DNA for expression studies, –comparative.
Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Arrays: Narrower terms include bead arrays, bead based arrays, bioarrays, bioelectronic arrays, cDNA arrays, cell arrays, DNA arrays, gene arrays, gene.
What are microarrays? Microarrays consist of thousands of oligonucleotides or cDNAs that have been synthesized or spotted onto a solid substrate (nylon,
Introduce to Microarray
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
Analysis of microarray data
with an emphasis on DNA microarrays
Image Processing (I) Fundamental units  2D – pixel  3D – voxel Orthogonal views: transverse (or axial), coronal, sagittal Image processing: preprocessing,
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Affymetrix vs. glass slide based arrays
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
Lecture 22 Introduction to Microarray
CDNA Microarrays MB206.
Data Type 1: Microarrays
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Microarrays.
Microarray - Leukemia vs. normal GeneChip System.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Gene expression analysis
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Lecture 7. Functional Genomics: Gene Expression Profiling using
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Microarray Technology. Introduction Introduction –Microarrays are extremely powerful ways to analyze gene expression. –Using a microarray, it is possible.
Microarray (Gene Expression) DNA microarrays is a technology that can be used to measure changes in expression levels or to detect SNiPs Microarrays differ.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Introduction to Microarrays. The Central Dogma.
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Gene Expression Biology 224 Instructor: Tom Peavy October 4 & 6, 2010
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Microarray: An Introduction
Gene Expression Analysis
Microarray - Leukemia vs. normal GeneChip System.
Microarray Technology and Applications
Microarray Data Analysis
Data Type 1: Microarrays
Presentation transcript:

COT 6930 HPC & Bioinformatics Microarray Data Analysis Xingquan Zhu Dept. of Computer Science and Engineering

Protein structure databases Gene expression database DNA RNA protein transcription translation DNA RNA protein phenotype Protein sequence databases cDNA ESTs UniGene Genomic DNA Databases

Outline Gene Expression and Biological Network DNA Microarray What, Why, and How DNA Microarray Microarray Construction Comparative Hybridization Data Analysis Public Databases

Gene Expression Gene expression Biologically Genes are expressed when they are transcribed onto RNA Amount of mRNA indicates gene activity No mRNA → gene is off mRNA present → gene is on & performing function Biologically Some genes are always expressed in all tissues Estimated 10,000 housekeeping / ubiquitous genes Other genes are selectively on Depending on tissue, disease, and/or environment Change in environment → change in gene expression So organism can respond

Biological Network Gene expression does not happen in isolation Individual genes code for function Produce mRNA → protein performing function Sets of genes can form pathways Gene products can turn on / off other genes Sets of pathways can form networks When pathways interact Biology is a study of networks Genes Proteins Etc…

Type of Biological Networks Genetic network Interactions between genes, gene products Gene regulation network Network of control decisions to turn genes on / off Subset of genetic network Metabolic network Network of interactions between proteins Synthesize / break down molecules (enzymes, cofactors)

An example of Genetic Network

Gene Regulation Network

An example of Metabolic network

Examining Biological Networks – Benefits Learn about gene function / regulation Tissue differentiation Response to environmental factors Identify / treat diseases Discover genetic causes of disease Evaluate effect of drugs Detect impact of DNA sequence variation (mutations) Detection of mutations (e.g., SNPs) Genetic typing

Examining Biological Networks – Approach Measure protein / mRNA in cells In different tissues (e.g., brain vs. muscle) Find gene / protein with tissue-specific function As environment changes Find genes / proteins responsible for response In healthy & diseased tissues Find proteins / genes responsible for disease (if any) Help identify diseases based on gene expression In different individuals Detect DNA sequence variation

Examining Biological Networks Direct approach Measure protein production / interaction in cell 2D electrophoresis Mass spectroscopy Protein microarray Advantages Precise results on proteins Disadvantages Low throughput (for now)

Examining Biological Networks Indirect approach Measure mRNA production (gene expression) in cell Random ESTs DNA microarray Advantages High throughput Can test large variety of mRNA simultaneously Disadvantages RNA level not always correlated with protein level / function Misses changes at protein level Results may thus be less precise

Outline Gene Expression and Biological Network DNA Microarray What, Why, and How DNA Microarray Microarray Construction Comparative Hybridization Data Analysis Public Databases

DNA Microarray Question How to determine whether a gene is expressed, or how to measure mRNA?

DNA Microarray

Hybridization to the Chip

The Chip is Scanned

Images

Video: http://www.youtube.com/watch?v=VNsThMNjKhM

Oligonucleotide (GeneChip) vs. Spotted Arrays GeneChip Microarray A gene is a probe set A set of (11-16) probes form a probe set Probe length: 25 bp Can use small amount of RNA Efficient hybridization Spotted Microarray One probe per gene Probe length: hundreds to 1k bp Less expensive

GeneChip: Chip->Probeset->Probe pair->Probe 1.28 cm 1.28 cm Probe set PM MM Probe cell Probe Pair PM MM MM

GeneChip Array Design 25-mer unique oligo mismatch in the middle nuclieotide multiple probes (11~16) for each gene from Affymetrix Inc.

Affymetrix GeneChip The second technology used in microarray experiments is used by Affymetrix. This technology is based upon growning specific oligo’s on a silicon substrate. Thus these are often called “gene chips”. Multiple variants are placed for each gene, with specific one base varianets as internal controls. (how many genes on this chip?) typically no replicates...

Affymetrix GeneChip Here we can see an annotated close-up of an affymetrix chip, with the regions relating to several genes highlighted.

DNA Microarray Design & Analysis Microarray construction Array design Choosing probe sequences Comparative Hybridization (data collection) Measure relative amount of mRNA Image processing of scanned images Spot detection, normalization, quantization Data Analysis Statistical test, noise handling (low-level) Clustering, classification (high-level)

cDNA Complementary DNA Sequences are the complements of the original mRNA sequences Why don’t we simple capture mRNA The environment is full of RNA-digesting enzymes Free RNA is quickly degraded To prevent the experimental samples from being lost, they are reverse-transcribed back into more stable DNA form

cDNA

DNA Microarray Construction Drops (spots) of cDNA fragments as probes Attach to glass slide / nylon array at known locations Use mechanical pins & robotics Use Label cDNA with fluorescent dyes (fluor) Measure contrast in intensity Use laser / CCD scanner

DNA Microarray: Automatic Detection

DNA Microarray Choice of probe Can use software to help choose probes Include genes of interest Examine sequence databases Avoid redundancy No duplicate probes Avoid cross hybridization Genechip alleviates this problem by using probe pairs PM MM Can use software to help choose probes Or simply buy pre-designed arrays Complete genomes of yeast, Drosophila, C. elegans 33,000+ human genes from GenBank RefSeq on 2 microarrays Expensive but labor-saving

DNA Microarray Design & Analysis Microarray construction Spotted cDNA arrays, in situ photolithography… Array design Choosing probe sequences Comparative Hybridization (data collection) Measure relative amount of mRNA Image processing of scanned images Spot detection, normalization, quantization Data Analysis Statistical test, noise handling (low-level) Clustering, classification (high-level)

Comparative Hybridization Goal Measure relative amount of mRNA expressed Algorithm Choose cell populations mRNA extraction and reverse transcription Fluorescent labeling of cDNA’s (normalized) Hybridization to microarray Scan the hybridized array Interpret scanned image

Comparative Hybridization

Comparative Hybridization

Comparative Hybridization Color determined by relative RNA concentrations Brightness determined by total concentration

DNA Microarray Methodology Anatomy of a Comparative Gene Expression Study http://www.cs.wustl.edu/~jbuhler/research/array/#diagram Flash Animation http://www.bio.davidson.edu/courses/genomics/chip/chip.html

DNA Microarray Design & Analysis Microarray construction Spotted cDNA arrays, in situ photolithography… Array design Choosing probe sequences Comparative Hybridization (data collection) Measure relative amount of mRNA Image processing of scanned images Spot detection, normalization, quantization Data Analysis Statistical test, noise handling (low-level) Clustering, classification (high-level)

Streamlined Array Analysis Normalize Filter Raw data •Present/Absent •Minimum value •Fold change Significance Classification Clustering •Hierarchical CL •Biclustering •t-test •Machine learning Gene lists Function (Genome Ontology)

Microarray data E 1 E 2 E 3 Gene 1 Gene 2 Exp 2 Exp 3 Exp 1 Gene N

Microarray data analysis begin with a data matrix (gene expression values versus samples) Typically, there are many genes (>> 10,000) and few samples (~ 10)

Low-Level Data Analysis Normalization: when you have variability in measurements, you need replication and statistics to find real differences Significance test: It’s not just the genes with 2 fold increase, but those with a significant p-value across replicates

Sources of Variability in Raw Data Biological variability Sample preparation Probe labeling RNA extraction Experimental condition temperature, time, mixing, etc. Scanning laser and detector, chemistry of the flourescent label Image analysis identifying and quantifying each spot on the array

Data Normalization Can control for many of the experimental sources of variability (systematic, not random or gene specific) Bring each image to the same average brightness Can use simple math or fancy: divide by the mean (whole chip or by sectors) LOESS (locally weighted regression) No sure biological standards

Scatter plots One of the most common visualization method for microarray data. Useful to compare gene expression values from two microarray experiments (e.g. control, experimental) Each dot corresponds to a gene expression value Most dots fall along a line Outliers represent up-regulated or down-regulated genes Page 193

Scatter plot analysis of microarray data expression level high low up down

Differential Gene Expression in Different Tissue and Cell Types Brain Astrocyte Fibroblast We are interested in outliers The major goal of scatter plot is to identify genes that are differentially regulated between different experimental conditions.

DNA Microarray Design & Analysis Microarray construction Spotted cDNA arrays, in situ photolithography… Array design Choosing probe sequences Comparative Hybridization (data collection) Measure relative amount of mRNA Image processing of scanned images Spot detection, normalization, quantization Data Analysis Statistical test, noise handling (low-level) Clustering, classification (high-level)

Higher Level Data Analysis Computational tasks: Clustering Classification Statistical validation Data visualization Pattern detection Biological problems: Discovery of common sequences in co-regulated genes Meta-studies using data from multiple experiments Linkage between gene expression data and gene sequence/function/metabolic pathways databases

Microarray data E 1 E 2 E 3 Gene 1 Gene 2 Exp 2 Exp 3 Exp 1 Gene N

Why care about “clustering” ? Gene 1 Gene 2 Gene N E1 E2 E3 Gene N Gene 1 Gene 2 Discover functional relation Similar expression functionally related Assign function to unknown gene Find which gene controls which other genes

Types of Clustering Methods Hierarchical Link similar genes, build up to a tree of all K-mean Clustering Self Organizing Maps (SOM) Split all genes into similar sub-groups Finds its own groups (machine learning) Bi-Clustering

Some distance measures Given vectors x = (x1, …, xn), y = (y1, …, yn) Euclidean distance: Manhattan distance: Correlation distance:

Finding a Centroid We use the following equation to find the n dimensional centroid point amid k n dimensional points: Let’s find the midpoint between 3 2D points, say: (2,4) (5,2) (8,9)

Hierarchical Clustering Treat each example as a cluster While (clusters >1) Merge two clusters with the least distance Update cluster centroid Clusters-- Endwhile Easy No need to specify the number of clusters beforehand Trouble to interpret “tree” structure Hard to interpret the relation between nodes, e.g. one group of gene repress another group, they are anti-correlated and far away from each other

K-means Algorithm Choose k initial center points randomly Cluster data using Euclidean distance (or other distance metric) Calculate new center points for each cluster using only points within the cluster Re-Cluster all data using the new center points This step could cause data points to be placed in a different cluster Repeat steps 3 & 4 until the center points have moved such that in step 4 no data points are moved from one cluster to another or some other convergence criteria is met

An example with k=2 We Pick k=2 centers at random We cluster our data around these center points

K-means example with k=2 We recalculate centers based on our current clusters

K-means example with k=2 We re-cluster our data around our new center points

K-means example with k=2 We repeat the last two steps until no more data points are moved into a different cluster

Cluster Quality Since any data can be clustered, how do we know our clusters are meaningful? The size (diameter) of the cluster vs. The inter-cluster distance Distance between the members of a cluster and the cluster’s center Diameter of the smallest sphere

Cluster Quality Continued distance=5 size=5 distance=20 Quality of cluster assessed by ratio of distance to nearest cluster and cluster diameter size=5

Cluster Quality Continued Quality can be assessed simply by looking at the diameter of a cluster A cluster can be formed even when there is no similarity between clustered patterns. This occurs because the algorithm forces k clusters to be created.

k-means comments Strength Weakness Easy Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Weakness Sensitive to the initial seeds Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes

A Problem of K-means Sensitive to outliers When mean is not meaningful Outlier: objects with extremely large values May substantially distort the distribution of the data When mean is not meaningful K-medoids: the most centrally located object in a cluster + + 1 2 3 4 5 6 7 8 9 10

A Problem K-means: Differing Density Original Points K-means (3 Clusters)

Clusters with non-convex shapes Original Points K-means (2 Clusters)

A parallel k-means package Parallel K-Means Data Clustering http://www.ece.northwestern.edu/~wkliao/Kmeans/index.html

Other clustering methods Self Organizing Maps (SOM) Determine its own groups by using neural networks Bi-clustering Simultaneously merge columns and rows into clusters Group of genes Group of examples

Two-way clustering of genes (y-axis) and cell lines (x-axis)

Outline Gene Expression and Biological Network DNA Microarray What, Why, and How DNA Microarray Microarray Construction Comparative Hybridization Data Analysis Public Databases

Public Databases Gene Expression data is an essential aspect of annotating the genome Publication and data exchange for microarray experiments Data mining/Meta-studies Common data format - XML MIAME (Minimal Information About a Microarray Experiment)

GEO at the NCBI

Array Express at EMBL

Array Express at EMBL

Outline Gene Expression and Biological Network DNA Microarray What, Why, and How DNA Microarray Microarray Construction Comparative Hybridization Data Analysis Public Databases