Gene-Markers Representation for Microarray Data Integration Boston, 14-17 October 2007 Elena Baralis, Elisa Ficarra, Alessandro Fiori, Enrico Macii Department.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Face Recognition and Biometric Systems Eigenfaces (2)
A P2P REcommender system based on Gossip Overlays (PREGO) ‏ R.Baraglia, P.Dazzi M.Mordacchini, L.Ricci A P2P REcommender system based on Gossip Overlays.
Yue Han and Lei Yu Binghamton University.
The Painter’s Feature Selection for Gene Expression Data Lyon, August 2007 Daniele Apiletti, Elena Baralis, Giulia Bruno, Alessandro Fiori.
COLLABORATIVE FILTERING Mustafa Cavdar Neslihan Bulut.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Principal Component Analysis
Mutual Information Mathematical Biology Seminar
Microarray Data Preprocessing and Clustering Analysis
Mining Sequence Patterns from Wind Tunnel Experimental Data Zhenyu Liu †, Wesley W. Chu †, Adam Huang ‡, Chris Folk ‡, Chih-Ming Ho ‡
Yeast Dataset Analysis Hongli Li Final Project Computer Science Department UMASS Lowell.
Reduced Support Vector Machine
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
LSDS-IR’08, October 30, Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis.
Unsupervised clustering in mRNA expression profiles D.K. Tasoulis, V.P. Plagianakos, and M.N. Vrahatis Computational Intelligence Laboratory (CILAB), Department.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Cleaver – Classification of Expression Array Version 1.0 Hongli Li Spring Computational Biology Computer Science Department UMASS Lowell.
Selecting Informative Genes with Parallel Genetic Algorithms Deodatta Bhoite Prashant Jain.
Statistical Analysis of Microarray Data
1 A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data Jinwook Seo, Ben Shneiderman University of Maryland Hyun Young Song.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Introduction to Data Mining Engineering Group in ACL.
A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and Katheleen Gardiner University.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Implicit An Agent-Based Recommendation System for Web Search Presented by Shaun McQuaker Presentation based on paper Implicit:
南台科技大學 資訊工程系 Automatic Website Summarization by Image Content: A Case Study with Logo and Trademark Images Evdoxios Baratis, Euripides G.M. Petrakis, Member,
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
The Broad Institute of MIT and Harvard Classification / Prediction.
Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering.
1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.
11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
ImArray - An Automated High-Performance Microarray Scanner Software for Microarray Image Analysis, Data Management and Knowledge Mining Wei-Bang Chen and.
Extracting quantitative information from proteomic 2-D gels Lecture in the bioinformatics course ”Gene expression and cell models” April 20, 2005 John.
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Data Processing Technologies for DNA Microarray Nini Rao School of Life Science And Technology UESTC14/11/2004.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.
A Report on CAMDA’01 Biointelligence Lab School of Computer Science and Engineering Seoul National University Kyu-Baek Hwang and Jeong-Ho Chang.
Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Clustering by soft-constraint affinity propagation: applications to gene- expression data Michele Leone, Sumedha and Martin Weight Bioinformatics, 2007.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Selection and Recombination Temi avanzati di Intelligenza Artificiale - Lecture 4 Prof. Vincenzo Cutello Department of Mathematics and Computer Science.
AN INTRODUCTION TO GENE EXPRESSION ANALYSIS BY MICROARRAY TECHNIQUE (PART II) DR. AYAT B. AL-GHAFARI MONDAY 10 TH OF MUHARAM 1436.
Lab 5 Unsupervised and supervised clustering Feb 22 th 2012 Daniel Fernandez Alejandro Quiroz.
Experience Report: System Log Analysis for Anomaly Detection
Data Mining K-means Algorithm
Information Management course
Presentation transcript:

Gene-Markers Representation for Microarray Data Integration Boston, October 2007 Elena Baralis, Elisa Ficarra, Alessandro Fiori, Enrico Macii Department of Control and Computer Engineering Politecnico di Torino (Italy)

2 Introduction Goals Integrate heterogeneous datasets Build a system independent to a-priori knowledge New representation of data and synergies among genes Open problems of integration Scaling issues Error bias Experimental condition Different technology or protocol

3 Framework purpose Representation of synergies between genes Gene-markers selection Common to all the datasets Base of the new space representation Gene-markers characteristics Common to all the datasets “Highly” representative for each dataset No outliers Independency

4 Innovation Independence of a-priori knowledge Biological information Data distribution Fully automated Applicable to problems With no knowledge Few weak hypotheses Kangl and al., “Integrating heterogeneous microarray data sources using correlation signatures,” Data Integration in the Life Sciences, vol. 3615/2005, pp. 105–120, 2006

5 Framework Integration Microarray repository Microarray datasets Dataset selection Filtering Feature selection and ranking Gene-marker selection Gene representation

6 Filtering Remove flat genes Variance of a gene Filter

7 Feature selection Eliminate less relevant features in K gene set Different techniques Supervised Unsupervised ANOVA in version 1.0 (Jeffery 2006) Rank based on F-value Binary and multi-class scenarios Jeffery and al., “Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data”, BMC Bioinformatics, vol. 7, no. 1, p. 359, July 2006

8 Gene-marker selection Merge ranks Extraction of gene-markers Gene with highest score removed from global rank and inserted in the gene-markers set Pruning of the genes with average quadratic correlation with the selected gene-markers higher than a threshold (i.e. 20%) Repeating procedure until L gene-markers are selected

9 Space transformation New representation Matrix G, N tot xL dimensions g ij elements measure distance Cosine correlation Pearson correlation Euclidean Manhattan m1 m3 m2 gi

10 Experimental design Entropy evaluation Evaluation of noise reduction Stability of the model Conservative propriety with respect to biological information DatasetsPatientsGenesClasses DLBCL Leukemia Brain Tumors

11 Entropy evaluation Description of data distribution High value implies uniform distribution Entropy distance based (Manoranjan 2002) Tests Raw vs. transformed data Impact of filtering phase Manoranjan and al., “Feature selection for clustering - a filter solution”, IEEE International Conference on Data Mining (ICDM), pp , 2002

12 Entropy on transformation Datasets Cosine correlationPearson correlation RawTransformedRawTransformed DLBCL Leukemia Brain Tumors

13 Impact of filtering phase DatasetsRaw data Data transformed without filter Data transformed with filter DLBCL Leukemia Brain Tumors

14 Subset genes ReferenceDescription TITriosephosphate Isomerase HMG IHigh mobility group protein gene exons 1-8 MIFMacrophage migration inhibitory factor gene PDE4BPhosphodiesterase 4B, cAMP - specific (dunce (Drosophila) - homolog phosphodiesterase E4) LDHALactate dehydrogenase A PRKCB1clones lambda - hPKC - beta [15, 802]) protein kinase C - beta - 1 MINOR_1Mitogen induced nuclear orphan receptor (MINOR_1) mRNA PDE4APhosphodiesterase 4A, cAMP - specific (dunce (Drosophila) - homolog phosphodiesterase E2) ENO1ENO1 Enolase 1 (alpha) MINOR_2Mitogen induced nuclear orphan receptor (MINOR_2) mRNA PKM2Pyruvate kinase, muscle amin4carb5-aminoimidazole-4-carboxamide-1-beta-Dribonucleotide transformylase/inosinicase SLC HSPD1Heat shock 60 kD protein 1 PGAM1Phosphoglycerate mutase 1 (brain)

15 Stability of the model

16 Conclusion New method: Based on dataset characteristics Automatic selection of gene-markers based on microarray data Independent on a-priori or pregressive knowledge Definition of a new space representation Results Reduction of entropy Biological information content conservation Improvement of knowledge about biological links between genes Future work: Implementation of unsupervised and supervised feature selection methods Integration of different kinds of information (ontologies)

17 Thanks for the attention!