Novel data clustering for microarrays and image segmentation Andrew Knyazev Image from

Slides:



Advertisements
Similar presentations
NASC Normalisation and Analysis of the Affymetrix Data David J Craigon.
Advertisements

05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.
Application of available statistical tools Development of specific, more appropriate statistical tools for use with microarrays Functional annotation of.
Modularity and community structure in networks
Modeling sequence dependence of microarray probe signals Li Zhang Department of Biostatistics and Applied Mathematics MD Anderson Cancer Center.
Image Analysis Phases Image pre-processing –Noise suppression, linear and non-linear filters, deconvolution, etc. Image segmentation –Detection of objects.
1. Principles and important terminology 2. RNA Preparation and quality controls 3. Data handling 4. Costs 5. Protocols 6. Information for collaboration.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Clustering II CMPUT 466/551 Nilanjan Ray. Mean-shift Clustering Will show slides from:
Microarray Normalization
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
CS 584. Review n Systems of equations and finite element methods are related.
DNA microarray and array data analysis
DNA Microarray: A Recombinant DNA Method. Basic Steps to Microarray: Obtain cells with genes that are needed for analysis. Isolate the mRNA using extraction.
Unsupervised Learning of Categories from Sets of Partially Matching Image Features Dominic Rizzo and Giota Stratou.
Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.
Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from
Eigensolvers for analysis of microarray gene expression data
Multigrid Eigensolvers for Image Segmentation Andrew Knyazev Supported by NSF DMS This presentation is at
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Introduce to Microarray
Affymetrix GeneChip Data Analysis Chip concepts and array design Improving intensity estimation from probe pairs level Clustering Motif discovering and.
Multigrid Eigensolvers for Image Segmentation Andrew Knyazev Supported by NSF DMS This presentation is at
Eigenvalue solvers for data clustering and image segmentation Andrew Knyazev Images courtesy
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
GeneChips and Microarray Expression Data
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Analysis of microarray data
Microarray Preprocessing
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Gene expression & Clustering (Chapter 10)
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
Data Type 1: Microarrays
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Segmentation Course web page: vision.cis.udel.edu/~cv May 7, 2003  Lecture 31.
Chapter 14: SEGMENTATION BY CLUSTERING 1. 2 Outline Introduction Human Vision & Gestalt Properties Applications – Background Subtraction – Shot Boundary.
Microarray Technology
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
A New Oklahoma Bioinformatics Company. Microarray and Bioinformatics.
Microarray - Leukemia vs. normal GeneChip System.
Scenario 6 Distinguishing different types of leukemia to target treatment.
A Short Overview of Microarrays Tex Thompson Spring 2005.
Intro to Microarray Analysis Courtesy of Professor Dan Nettleton Iowa State University (with some edits)
ImArray - An Automated High-Performance Microarray Scanner Software for Microarray Image Analysis, Data Management and Knowledge Mining Wei-Bang Chen and.
GeneChip® Probe Arrays
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
Microarray Data Analysis The Bioinformatics side of the bench.
EE150a – Genomic Signal and Information Processing On DNA Microarrays Technology October 12, 2004.
DNA Microarray Overview and Application. Table of Contents Section One : Introduction Section Two : Microarray Technique Section Three : Types of DNA.
 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Introduction to Oligonucleotide Microarray Technology
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.
Microarray: An Introduction
Lecture 5 Graph Theory prepped by Lecturer ahmed AL tememe 1.
Microarray - Leukemia vs. normal GeneChip System.
The Basics of Microarray Image Processing
Data Type 1: Microarrays
Presentation transcript:

Novel data clustering for microarrays and image segmentation Andrew Knyazev Image from This presentation is at 1

We develop novel algorithms and software on parallel computers for data clustering of large datasets. We are interested in applying our approach, e.g., for analysis of large datasets of microarrays or tiling arrays in molecular biology and for segmentation of high resolution images. Many data clustering codes are available, but for large datasets most of them either are prohibitively slow, or give unreliable spurious clusters due to their ad hoc, rather than mathematically based, nature. We use spectral clustering, which has mathematical foundations in spectral theory of graph Laplacians, principal component analysis, random reversible Markov walks, and models of mechanical vibrations of mass-spring systems. Spectral clustering produces high quality clusters, but requires numerical solution of eigenvalue problems of mega-million sizes. Our main expertize, in developing parallel eigenvalue solvers, allows us to efficiently handle such problems, thus opening an opportunity for quality clustering of record-size datasets. We have some experience analyzing Affimetrix microarrays, taking into account match and mismatch data, for clustering genes and experiments. Some of our ideas are already incorporated in MATLAB's Bioinformatics toolbox function PROBESETVALUES. We have preliminary results of segmenting multimegapixel resolution 2D and 3D images on a number of computing systems, ranging from a modern desktop to top 10 world most powerful parallel systems. We have a direct access to and run our tests on IBM BG/L system with several thousand processors. E.g., a 24 megapixel 2D image is segmented on IBM BG/L in a matter of seconds. The 3D segmentation can be applied is a variety of situations such as electronic microscopy, 3D MRI scans, and tracking objects in movies. High resolution 3D image segmetation is especially computationally challenging and requires both powerful computing resources and our sophisticated software. 2

Microarrays---a massively parallel experiment Clustering: why? Clustering: how? Spectral clustering Connection to image segmentation Eigensolvers for spectral clustering 3

Affymetrix GeneChip DNA Microarrays Image Courtesy: Affymetrix 4 Microarrays-massively parallel experiment 1/5

GeneChip: oligonucleotide sequences are photo-lithographed on a quartz wafer in a pattern of ~10 micrometers dots. Oligonucleotide sequences (oligos) probes: 25 nucleotide chains for selected parts of a gene complementary to mRNA. GeneChips are manufactured to include all currently known and predicted genes of a particular organism, e.g., H. sapience. The information about physical locations of oligo probes for each gene on the chip is contained in the *.cdf file. A sample of mRNA extracted from cells of an organism after pre- processing is hybridized with GeneChip giving PM and MM values which characterize genes expressions in the cells. Microarrays-massively parallel experiment 2/5 For every gene there are (depending on chip design) of different oligo probes called perfect matches (PM). In addition, there are mismatch oligos (MM) corresponding to each of the PMs that differ in the middle base pair. 5

Microarrays-massively parallel experiment 3/5 Labelled cRNA targets derived from the mRNA of an experimental sample are hybridized to oligo probes. During hybridization, complementary nucleotides line up and bind together via hydrogen bonds in the same way as two strands of DNA bound together. The chip is then scanned with a laser giving the amount of each mRNA species represented. Image Courtesy: cnx.org 6

A pool of mRNA is extracted from the cells of an organism and converted to a Biotin labelled strand (cRNA) that binds to the oligo probes on the GeneChip during hybridization. The higher the concentration of a particular mRNA in the testing pool--- the greater the hybridization level of the PM probes and thus the amount of the hybridized material on the processed GeneChip. Then a fluorescent stain is applied that binds to the Biotin and the GeneChip is processed through a scanner that illuminates each dot of the GeneChip with a laser, causing dots to fluoresce. The image data of the scanned probe array is stored in a *.dat file. The Affymetrix GCOS software processes the *.dat file and generates a *.cel file, containing all numerical data of the GeneChip experiment, e.g., probe locations and PM and MM intensities. The processing involves computing a square grid locating the dots for probes, intensity normalization, using internal controls, and detecting the outliers. More sophisticated *.dat-->*.cel algorithms, e.g., taking into account the cRNA saturation, are being developed elsewhere. Microarrays-massively parallel experiment 4/5 7

The PM and MM values are not normally used directly for high- level statistical analysis, instead they are first converted into the gene expression values, which involves:  Detecting unreliable data by comparing PM and MM  Adjustment for background and noise  Calculating the single array gene expression intensities, basically by averaging adjusted PM values for each probe set Alternatively, the Comparison Analysis (Experiment versus Baseline arrays) detects and quantifies changes in gene expressions between two arrays, applying normalization of data and using the Signal Log Ratio algorithms. Either way, the absolute or comparison gene expression values are stored in a *.chp file, which serves as the input for high-level statistical analysis. Typically, multiple GeneChip tests are performed giving multiple *.chp files with gene expression values. Microarrays-massively parallel experiment 5/5 8

When conducting microarray experiments there are multiple microarrays involved typically: Studying a process over time, e.g., to measure the response to a drug or food. Looking for differences between states, e.g., normal cells versus cancer cells. A typical goal is Finding Gene Networks, i.e., groups of genes that change expression inter-dependently across samples. Having a significantly large number of microarrays, we want to reverse engineer the regulatory network that controls gene expressions. We need computer clustering on the microarray data to select a small (ideally) number of co-expressed genes of a gene network. Separate experiments using gene knockout on the selected genes can then be performed to confirm the discovered regulatory network biologically. Clustering: why? 9

Clustering: how? The overview There is no good widely accepted definition of clustering. The traditional graph-theoretical definition is combinatorial in nature and computationally infeasible. Heuristics rule! Good open source software, e.g., METIS and CLUTO. Clustering can be performed hierarchically by agglomeration (bottom-up) and by division (top-down). 10 Agglomeration clustering example

Clustering: how? Co-clustering Two-way clustering, co-clustering or bi-clustering are clustering methods where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously!bi-clusteringdata matrix 11 Image courtesy

Clustering: how? Algorithms Partitioning means determining clusters. Partitioning can be used recursively for hierarchical division. Many partitioning methods are known. Here we cover: Spectral partitioning using Fiedler vectors = Principal Components Analysis (PCA)‏ PCA/spectral partitioning is known to produce high quality clusters, but is considered to be expensive as solution of large-scale eigenproblems is required. Our expertise in eigensolvers comes to the rescue! 12

Eigenproblems in mechanical vibrations Free transversal vibration without damping of the mass-spring system is described by the ODE system 13 Standing wave assumption leads to the eigenproblem Images courtesy Component x i describes the up-down movement of mass i.

14 Spectral clustering in mechanics A 4-degree-of-freedom system has 4 modes of vibration and 4 natural frequencies: partition into 2 clusters using the second eigenvector: Images Courtesy: Russell, Ketteriung U. The main idea comes from mechanical vibrations and is intuitive: in the spring-mass system the masses which are tightly connected will have the tendency to move together synchronically in low-frequency free vibrations. Analysing the signs of the components corresponding to different masses of the low-frequency vibration modes of the system allows us to determine the clusters of the masses!

15 A = symmetric adjacency matrix D = diagonal degree matrix Laplacian matrix L = D – A Spectral clustering for simple graphs Undirected graphs with no self- loops and no more than one edge between any two different vertices L=K describes transversal vibrations of the spring-mass system (as well as Kirchhoff's law for electrical circuits of resistors)

16 The Fiedler eigenvector gives bi-partitioning by separating the positive and negative components only By running the K-means on the Fiedler eigenvector one could find more then 2 partitions if the vector is close to piecewise-constant after reordering The same idea for more eigenvectors of Lx= λ x Rows sum to zero Spectral clustering for simple graphs Example Courtesy: Blelloch CMU The Laplacian matrix L is symmetric with the zero smallest eigenvalue and constant eigenvector (free boundary). The second eigenvector, called the Fiedler vector, describes the partitioning.

PCA clustering for simple graphs 17 Fiedler vector is an eigenvector of Lx= λ x, in the spring-mass system this corresponds to the stiffness matrix K=L and to the mass matrix M=I (identity) Should not the masses with a larger adjacency degree be heavier? Let us take the mass matrix M=D -the degree matrix So-called N-cut smallest eigenvectors of Lx= λD x are the largest for Ax= µD x with µ=1-λ since L=D-A PCA for D -1 A computes the largest eigenvectors, which then can be used for clustering by the K-means D -1 A is row-stochastic and describes the Markov random walk probabilities on the simple graph

18 Connection to image segmentation Image pixels serve as graph vertices. Weighted graph edges are computed by comparing pixel colours. Here is an example displaying 4 Fiedler vectors of an image: We generate a sparse Laplacian, by comparing neighboring pixels here when computing the weights for the edges. Genes correspond to vertices in microarrays, but we have to compare all genes, possibly getting a Laplacian with a large fill-in.

19 Eigensolvers for spectral clustering Our BLOPEX-LOBPCG software has proved to be efficient for large-scale eigenproblems for Laplacians from PDE's and for image segmentation using multiscale preconditioning of hypre The LOBPCG for massively parallel computers is available in our Block Locally Optimal Preconditioned Eigenvalue Xolvers (BLOPEX) package BLOPEX is built-in in and is included as an external package in PETSc, see On BlueGene/L 1024 CPU we can compute the Fiedler vector of a 24 megapixel image in seconds (including the hypre algebraic multigrid setup).

Work in Progress Segmentation of 3D images at the pixel level Multi-level/resolution segmentation Mathematical foundation of clustering Algorithm and Software development for simultaneous clustering of genes and experiments Accurate clustering for large datasets 20 Need collaborators from medical imaging and molecular biology communities to give us the data