SPARCLE = SPArse ReCovery of Linear combinations of Expression Presented by: Daniel Labenski Seminar in Algorithmic Challenges in Analyzing Big Data in.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Data Mining Classification: Alternative Techniques
Proposed concepts illustrated well on sets of face images extracted from video: Face texture and surface are smooth, constraining them to a manifold Recognition.
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.
Mutual Information Mathematical Biology Seminar
A Study of Approaches for Object Recognition
Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson
Reduced Support Vector Machine
Ensemble Learning: An Introduction
Three kinds of learning
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
For Better Accuracy Eick: Ensemble Learning
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Machine Learning CS 165B Spring 2012
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Gwangju Institute of Science and Technology Intelligent Design and Graphics Laboratory Multi-scale tensor voting for feature extraction from unstructured.
Data mining and machine learning A brief introduction.
Genetic network inference: from co-expression clustering to reverse engineering Patrik D’haeseleer,Shoudan Liang and Roland Somogyi.
Cs: compressed sensing
Networks and Interactions Boo Virk v1.0.
Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Benk Erika Kelemen Zsolt
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Ensemble Methods: Bagging and Boosting
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Cluster validation Integration ICES Bioinformatics.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques,
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
1 Using Graph Theory to Analyze Gene Network Coherence José A. Lagares Jesús S. Aguilar Norberto Díaz-Díaz Francisco A. Gómez-Vela
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Network applications Sushmita Roy BMI/CS 576 Dec 9 th, 2014.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Martina Uray Heinz Mayer Joanneum Research Graz Institute of Digital Image Processing Horst Bischof Graz University of Technology Institute for Computer.
David Amar, Tom Hait, and Ron Shamir
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Boosting and Additive Trees (2)
Supervised Time Series Pattern Discovery through Local Importance
Research in Computational Molecular Biology , Vol (2008)
Boosting and Additive Trees
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Goodfellow: Chapter 14 Autoencoders
1 Department of Engineering, 2 Department of Mathematics,
Anastasia Baryshnikova  Cell Systems 
Ensemble learning Reminder - Bagging of Trees Random Forest
Label propagation algorithm
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

SPARCLE = SPArse ReCovery of Linear combinations of Expression Presented by: Daniel Labenski Seminar in Algorithmic Challenges in Analyzing Big Data in Biology and Medicine; Prof. Ron Shamir Tel Aviv University 1

Outline Introduction The SPARCLE representation method SPARCLE-based learning Results Discussion Conclusions 2

INTRODUCTION 3

Introduction Large-scale RNA expression measurements produce enormous amounts of data. Many methods were developed for extracting insights regarding the interrelationships between genes. We do not know yet the function of all genes. Motivation: Find functional associations between genes 4

Introduction Key idea: Gene regulatory network is sparsely structured: the expression of any gene is directly regulated by only a few other genes. Look for a concise representation of genes that explain the differential phenomena in gene expression For example, genes that regulate specific pathways would correlate linearly with their targets. 5

SPARCLE The SPARCLE representation method 6

Sparse Representation Of Expression 7

8

Linear Algebra Reminder Objective geneOther genes in the genome Coefficients‘ variables 9

SPARCLE illustration 10

Sparse Representation Of Expression 11

Sparse Representation Of Expression 12

Robustness test of expression profile representations 13

Robustness test of expression profile representations 14

Robustness test of expression profile representations 15

SPARCLE-BASED LEARNING 16

Prediction of Pairwise associations between genes We want to use the results from the sparse reconstruction in order to reveal associated gene function Represented by Gene Ontology annotations Represented by PPI map Step #1: run SPARCLE on the dataset Step #2: for each pair of genes, create a feature vector from the sparse representations found by SPARCLE Step #3: use the feature vectors to create a training model with AdaBoost. Step #4: predict the relationships between genes. 17

Extraction of feature vectors The following features were extracted: 1. Coefficient of each gene in the other’s SPARCLE representation. 2. The number of genes in the intersection of the genes’ support sets 3. The number of support sets containing both of the genes 4. The L 1 distance of each gene’s expression from the convex hull of the other genes’ vectors 5. The Euclidean distance of the expression profile each gene from the subspace spanned by the other’s supporting set. 18

Extraction of feature vectors 6. Support size for each gene 7. Number of appearances of each gene in the other supports of each genes (entire genome) 8. Average and Standard Deviation of features 1-7 over 20 perturbation runs of SPARCLE on the same data (25% genes were randomly removed) 9. Pearson’s correlation between the two genes expression profiles (normalized and unnormalized) 10. Mean, median and SD of Pearson’s correlation of each gene with the other genes in the genome 19

GO Annotations Set of biological phrases (terms) which are applied to genes GO is structured as 3 directed acyclic graphs Deeper => higher resolution (more specific annotation) is_a Part_of 20

GO Annotations In order to compare between the genes’ GO annotations, we need to choose the resolution of interest lower resolution depth and high resolution depth were selected. 21

AdaBoost (adaptive boosting) A supervised Machine-Learning algorithm. A general method for generating a strong classifier out of a set of weak classifiers. It is an iterative algorithm. During each round of training: a new weak learner is added to the ensemble. a weighting vector is adjusted to focus on examples that were misclassified in previous rounds. 22

SPARCLE-based learning algorithm overview 23

Datasets The methods were applied on two datasets: Saccharomyces cerevisiaePlasmodium Falciparum The Gene expression measurements were extracted from the GEO database. 24

Datasets Saccharomyces CerevisiaePlasmodium Falciparum A parasite (causes malaria) 208 experiments 4365 genes poorly annotated A species of yeast. 170 experiments 6254 genes knowledge-rich 25

RESULTS 26

27

(A) Support sizes of the solutions to the SPARCLE optimization problem (the number of genes used to reconstruct each particular gene), for all 6254 yeast genes analyzed. Sparse reconstruction of yeast genes expression profiles by SPARCLE. 28

Understanding The noise parameter ε. Cross-validation test for SPARCLE robustness. Support sizes of the solutions to the SPARCLE optimization problem (number of non-zero entries in the solution) of the yeast measurements, for (a) ε=0.25, (d) ε=0.5, and (g) ε=0.75. The cross-validation (CV) scores for each reconstructed support for an expression profile are plotted against the score obtained for a random support of the same size for (b) ε=0.25, (e) ε=0.5, and (h) ε=0.75. The cross- validation scores for each reconstructed support for an expression profile are plotted against the score obtained by a restricted SPARCLE run over 85 random profiles for (c) ε=0.25, (f) ε=0.5, and (i) ε=0.75; see Methods for details. Inset: Histogram of p- values for the SPARCLE CV scores. 29

(C) Genes in the support of MEP1. The objective gene (MEP1) is indicated by a red square. Note that the majority of the genes are part of a PPI network. Sparse reconstruction of yeast genes expression profiles by SPARCLE. 30

(D) Sample of four objective genes (marked by red squares) whose supports are indicated by poor connectivity and a fragmented PPI network. PPI connectivity is retrieved from the BioGrid ( repository. Graphics are based on Pathway Palette (Askenazi et al., 2010). Sparse reconstruction of yeast genes expression profiles by SPARCLE. 31

Prediction of PPI and GO annotations (B) Prediction of PPI, as represented by the STRING database, by supervised learning from SPARCLE results (SPARCLE+AB).Accuracy is traded off with coverage by applying certainty thresholds on the classifier output. Other methods for predicting genes interrelationships are as follows: Pearson’s correlation of the expression profiles (Correlations), and a transitive correlations method (SPath, see Section 2). (C) Prediction of associations for the GO Slim annotations, covering CC ontology. 32

Prediction of genes’ associations according to GO A comparison of SPARCLE-based AdaBoost learning (SPARCLE+AB), correlation-based AdaBoost learning (Correlations+AB), correlations-based shortest path (SPath) (Zhou et al., 2002) and pairwise correlations for the raw data (Correlations) for S.cerevisiae (A–C) and P.falciparum (D–F) transcriptomes. The ontology branches CC (A– D), BP (B–E) and MF (C–F) were examined. 33

Discussion The method was not compared with clustering methods. The biological interpretation of the support sets was mostly indirect. The correlation measure performed very poorly on the malaria data and somewhat better on the yeast data. 34

Conclusion We introduced a natural, yet unexplored, approach for the problem of gene expression analysis. Geometrically, we uncovered linear subspaces that are overpopulated with expression profiles in the multidimensional space of the experiments set. The sparse representation is general and proved to be robust. The high performance of SPARCLE-based AdaBoost learning should be considered as evidence for the principal information that is embedded in the geometric properties of the data. 35

QUESTIONS? 36

Bibliography Y. Prat, M. Fromer, N. Linial, M. Linial, Recovering key biological constituents through sparse representation of gene expression. Bioinformatics 27, 655 (2011). B. Berger, J. Peng, M. Singh, Computational solutions for omics data, Nature 14, 343 (2013) 37

Thank you! 38