Fuzzy K means.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
BioInformatics (3).
Basic Gene Expression Data Analysis--Clustering
Outlines Background & motivation Algorithms overview
Gene Regulation and Microarrays. Overview A. Gene Expression and Regulation B. Measuring Gene Expression: Microarrays C. Finding Regulatory Motifs.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Computational Biology, Part 12 Expression array cluster analysis Robert F. Murphy, Shann-Ching Chen Copyright  All rights reserved.
Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Bioinformatics - Tutorial no. 12
 Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Inferring the nature of the gene network connectivity Dynamic modeling of gene expression data Neal S. Holter, Amos Maritan, Marek Cieplak, Nina V. Fedoroff,
What is Cluster Analysis?
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Bryan Heck Tong Ihn Lee et al Transcriptional Regulatory Networks in Saccharomyces cerevisiae.
Cluster Analysis Hierarchical and k-means. Expression data Expression data are typically analyzed in matrix form with each row representing a gene and.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Assigning Numbers to the Arrows Parameterizing a Gene Regulation Network by using Accurate Expression Kinetics.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Analysis of microarray data
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?
Chapter 11 Table of Contents Section 1 Control of Gene Expression
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Eukaryotic Gene Expression The “More Complex” Genome.
More on Microarrays Chitta Baral Arizona State University.
Finish up array applications Move on to proteomics Protein microarrays.
Microarrays.
Gene regulation results in differential gene expression, leading to cell specialization.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Predicting protein degradation rates Karen Page. The central dogma DNA RNA protein Transcription Translation The expression of genetic information stored.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 Dept. Computer Science and Information Engineering.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Fuzzy C-Means Clustering
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS) LECTURE 13 ANALYSIS OF THE TRANSCRIPTOME.
1 baySeq homework HS analysis: Out of 7388 genes with data, 1995 genes were DE at FDR
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Microarray: An Introduction
Computational Biology
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Hierarchical clustering approaches for high-throughput data
Chapter 12.5 Gene Regulation.
Cell Signaling.
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Unit III Information Essential to Life Processes
Dimension reduction : PCA and Clustering
Predicting Gene Expression from Sequence
Presentation transcript:

Fuzzy K means

Fuzzy K means A gene can be assigned to several clusters Each gene is assigned to a cluster with a membership value between 0 and 1 The membership values of a gene add up to one Genes with lower membership values are not well represented by the cluster centroid Expression of genes with high membership values are close to cluster centroid

Centroid During the centroid refinement in each clustering cycle, new centroids were calculated on the basis of the weighted mean of all the gene –expression patterns in the data set according to

Membership Function Each gene’s membership m (a continuous variable from 0 to 1) is defined as:

Fuzzy K means The gene weight is (only on the seconed and the third round) empirically defined as: Where is the Pearson Correlation between Xi and Xn and is the correlation cutoff

Fuzzy K means In each clustering cycle , the centroids were iteratively refined until the average change was <0.001. Around 85 % of the centroids , stabilized within approximately 15 iterations , some of centroids required more : about 40 -60 iterations before stabilizing.

Fuzzy K means After each clustering cycle , each centroid was compared to all other centroids in the set , and centroid pairs correlated >0.9 were replaced by their average .

Visualization Tools

Cells respond to environment Various external messages Heat Biology Background Cells respond to the environment. This is clear because the same copy of the genome is located in every cell in an organism, but genes are expressed at different levels in different cells (otherwise, we’d have uniform behavior of cells, which is clearly not the case). Cells respond to different factors, including: environmental conditions, heat, food supply, and various external messages. Cells differentiate during development. Gene expression has a lot to do with what proteins are required at that particular point in the cells development or location. Responds to environmental conditions Food Supply

Genome is fixed – Cells are dynamic A genome is static Every cell in our body has a copy of same genome A cell is dynamic Responds to external conditions Saccharomyces cerevisiae cells follow a cell cycle of division and also budding. Cells differentiate during development

Gene regulation Gene regulation is responsible for dynamic cell Gene expression varies according to: Cell type External conditions

Transcription Factors Binding to DNA Transcription regulation: Certain transcription factors bind DNA Binding recognizes DNA substrings: Regulatory motifs

Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA Gene Regulatory Element

Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA Regulatory Element Gene

Regulation of Genes New protein RNA polymerase Transcription Factor DNA Regulatory Element Gene

The Challenges of Gene Expression Data Many genes have expression data patterns that are similar to multiple, distinct gene groups.

Results of Clustering Gene Expression CLUSTER is simple and easy to use De facto standard for microarray analysis Limitations: Hierarchical and other method clustering in general is not robust Genes may belong to more than one cluster

Gene can be co expressed with different gene groups in response to different conditions.

Saccharomyces cerevisiae The yeast Saccharomyces cerevisiae possesses sophisticated mechanisms to choreograph the expression of its 6200 genes in order to thrive or at list to survive in a wide range of environmental conditions.

The gene expression of 40 Yap1p targets, these genes were coordinately induced in responds to subset of conditions shown here ( labeled in red)

What is a microarray

What is a microarray (2) A 2D array of DNA sequences from thousands of genes Each spot has many copies of same gene Allow mRNAs from a sample to hybridize Measure number of hybridizations per spot

Goal of Microarray Experiments Measure level of gene expression across many different conditions: Expression Matrix M: {genes}{conditions}: Mij = |genei| in conditionj Deduce gene function Genes with similar function are expressed under similar conditions

Fuzzy K-Means clustering Each gene can belong to many clusters Soft (fuzzy) assignment of genes to clusters Each gene has 1.0 membership units, allocated amongst clusters based on correlation with means Cluster means are calculated by taking the weighted average of all the genes in the cluster A gene can be assigned to several clusters Each gene is assigned to a cluster with a membership value between 0 and 1 The membership values of a gene add up to one Genes with lower membership values are not well represented by the cluster centroid Expression of genes with high membership values are close to cluster centroid

Fuzzy K-Means clustering Algorithm: Use PCA to initialize cluster means 3 iterations of fuzzy k-means clustering, find k/3 clusters per iteration In each iteration, start with brand new clusters and initializations And a few more heuristic tricks

Initialization Use PCA to find a few eigenvectors for initialization These features capture the directions of maximum variance Must be orthonormal

Example Initialization k/3 centroids defined from k/3 first eigenvectors

Example First iteration of clustering

Iteration of the approach Remove genes that have a Pearson Correlation with a particular cluster greater than 0.7 Intuition: These strong signal from these genes has been accounted for Repeat

Removing Duplicate Centroids Centroids with Pearson correlation > 0.9 will be averaged. Allows selecting a large initial number of clusters, since duplicates will be removed

Repeat 3 times Output Cluster means Gene assignments to clusters

Regulatory systems that govern the expression of overlapping sets of genes in yeast.

Fuzzy K means ADVANTAGES The method can present overlapping clusters , revealing distinct features of each gene’s function and regulation. The resulting implication can be used to assign refined hypothetical functions to uncharacterized gene products and additional cellular roles of well none studied proteins .

Fuzzy K means ADVANTAGES It present more comprehensive groups of conditionally co regulate genes. It elucidate the environmental conditions that trigger changes in gene expression. It requires no a priori information about the dataset.

Fuzzy K means DISADVANTAGES Assignment of genes to the cluster requires a user – defined cutoff and selecting meaningful cutoff is a challenge. Fuzzy K means failed to identify a small number of groups that were identified by hierarchical clustering.

My opinion The unique advantages of fuzzy K means clustering make the technique a valuable tool for gene expression analysis , it’s flexibility can be used to reveal more complex correlations between gene expression patterns, promoting refined hypotheses of the role and regulation of gene expression changes.

In order to get over the limitations… combining hierarchical clustering with fuzzy K means can be useful..

Thank you !