Tutorial on "GRID Computing“ EMBnet Conference 2008 CNR - ITB GRID distribution supporting chaotic map clustering on large mixed microarray.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
BioInformatics (3).
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Cluster analysis for microarray data Anja von Heydebreck.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic.
Automatic Identification of ROIs (Regions of interest) in fMRI data.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Mutual Information Mathematical Biology Seminar
Model and Variable Selections for Personalized Medicine Lu Tian (Northwestern University) Hajime Uno (Kitasato University) Tianxi Cai, Els Goetghebeur,
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Genomic Signal Processing: Ensemble Dependence Model for Classification and Prediction of Cancer Based on Gene Expression Data Joseph DePasquale Engineering.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Modularity in Biological networks.  Hypothesis: Biological function are carried by discrete functional modules.  Hartwell, L.-H., Hopfield, J. J., Leibler,
Topologically biased random walks with application for community finding Vinko Zlatić Dep. Of Physics, “Sapienza”, Roma, Italia Theoretical Physics Division,
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Bioinformatics - Tutorial no. 12
Fuzzy K means.
JAVED KHAN ET AL. NATURE MEDICINE – Volume 7 – Number 6 – JUNE 2001
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Clustering Unsupervised learning Generating “classes”
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Expression profiling of peripheral blood cells for early detection of breast cancer Introduction Early detection of breast cancer is a key to successful.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Genetic network inference: from co-expression clustering to reverse engineering Patrik D’haeseleer,Shoudan Liang and Roland Somogyi.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munafò, L. Muscariello (Politecnico di Torino)
More on Microarrays Chitta Baral Arizona State University.
Microarrays to Functional Genomics: Generation of Transcriptional Networks from Microarray experiments Joshua Stender December 3, 2002 Department of Biochemistry.
BIOINFOGRID: Bioinformatics Grid Application for Life Science Giorgio Maggi INFN and Politecnico di Bari
Finish up array applications Move on to proteomics Protein microarrays.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
INFSO-RI Enabling Grids for E-sciencE BioDCV: a grid-enabled complete validation setup for functional profiling EGEE User Forum.
Gene expression analysis
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Cluster validation Integration ICES Bioinformatics.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CSTgrid: a whole genome comparison tool for.
Learning disjunctions in Geronimo’s regression trees Felix Sanchez Garcia supervised by Prof. Dana Pe’er.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GRID distribution supporting chaotic map clustering on large mixed.
Solving Weakened Cryptanalysis Problems for the Bivium Keystream Generator in the Volunteer Computing Project Oleg Zaikin, Alexander Semenov,
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
High-throughput Biological Data The data deluge
Dimension reduction : PCA and Clustering
A, unsupervised hierarchical clustering of the expression of probe sets differentially expressed in the oral mucosa of smokers versus never smokers. A,
Presentation transcript:

Tutorial on "GRID Computing“ EMBnet Conference 2008 CNR - ITB GRID distribution supporting chaotic map clustering on large mixed microarray data sets Angelica Tulipano ITB - CNR, Section of Bari

Tutorial on "GRID Computing“, EMBnet Conference 2008, 17 September 2008 Microarray data Microarray data contain the expression values of thousands of genes This allows to detect the collective cell behavior Genes with similar temporal and spatial gene expression patterns are believed to be governed by a common regulatory logic Hundreds of experiments with the same array design are available

Tutorial on "GRID Computing“, EMBnet Conference 2008, 17 September 2008  comparing expression levels over a wide range of experiments can reveal new and valuable information about behaviours of genes. Microarray data

Tutorial on "GRID Computing“, EMBnet Conference 2008, 17 September 2008 Microarray data We selected 587 data sets covering 24 biological experiments from a collection of experiments of the Affimetrix microarray ‘Human Genome U133 Array Set HG-U133A’ related to genes. Data have been organized in an expression matrix D = n g x n s, where n g (=22215) is the number of genes and n s (=587) is the number of samples (experiments).

Tutorial on "GRID Computing“, EMBnet Conference 2008, 17 September 2008 Clustering To analyse the data, having no a priori knowledge on their structure, e.g. the number of classes or the geometric distribution, we have chosen an unsupervised hierarchical clustering algorithm, the Chaotic Map Clustering (CMC) [1] [1] L. Angelini, F. De Carlo, C. Marangi, M. Pellicoro and S. Stramaglia, Clustering Data by Inhomogeneous Chaotic Map Lattices, Phys. Rev. Lett., Vol. 85, No. 3, pp (2000).

Tutorial on "GRID Computing“, EMBnet Conference 2008, 17 September 2008 Chaotic Map Clustering CMC does not utilize the distances directly for data partitioning, but it generates ‘chaotic’ trajectories by assigning a dynamical variable (i.e. a chaotic map) to each data point. Pairs of maps are then coupled by means of a decreasing function of the distance of the corresponding data points. The mutual information between pairs of maps, in the stationary regime, is then used as the similarity index for clustering the data set. By setting different threshold values for the mutual information, a hierarchical partition of the data can be obtained as a tree of clusters.

Tutorial on "GRID Computing“, EMBnet Conference 2008, 17 September 2008 Control of the clustering partition To choose the optimal partition of the data between all the hierarchical levels found by the CMC we used the modularity [2], a measure of the quality of the division in clusters, looking for its peak in the tree of clusters. [4] Khan J, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nat. Med., Vol. 7, No. 6, pp (2001).. e ij connections between elements of cluster i and cluster j a i =  j e ij all connections to elements of cluster i for random connection: e ij = a i a j MODULARITY: Q =  i (e ii - a i 2 )

Tutorial on "GRID Computing“, EMBnet Conference 2008, 17 September 2008 Efficiency of CMC Comparison between CMC and Deterministic Annealing (DA). Maximal efficiency of CMC  87% DA  73%

Tutorial on "GRID Computing“, EMBnet Conference 2008, 17 September 2008 Cluster validation Cluster validation method based on resampling [5] : a cross-validation procedure where subsets of the data under investigation are constructed randomly, and the clustering algorithm is applied to each subset Connectivity matrix for each resampled matrix Tij = Those were compared to the connectivity matrix of the original matrix Starting from the overlap [6] of the original and resampling connectivity matrices we can define “sensitivity”, “positive predictive value” and “specifity”, useful quality measures of a clustering result. [5] Levine E. and Domany E., Resampling method for unsupervised estimation of cluster validity, Neural Comp., Vol. 13, pp , (2001). [6] Van deer Laan, M.J., Bryan, J., Gene espression analysis with the parametric bootstrap, Biostatistics, Vol. 2, No. 4: pp , (2001).

Tutorial on "GRID Computing“, EMBnet Conference 2008, 17 September 2008 Problems of resampling To validate the partition we have obtained by applying CMC to the original matrix 22215x587, we generated 50 randomly resampled matrices 16661x587 PROBLEM - clustering a single matrix of such a size takes about 2 hours; - clustering the whole set of resampled matrices would totally occupy one single CPU for about 4 days.

Tutorial on "GRID Computing“, EMBnet Conference 2008, 17 September 2008 Grid distribution A job submission tool [7] was used for the submission of a large number of jobs [7] Scheme of the Grid process UI DB RB SE Farm1 Farm2 Farm3 RB

Tutorial on "GRID Computing“, EMBnet Conference 2008, 17 September 2008 Results I The entire set of matrices was analysed using 59 worker nodes of the EGEE infrastructure within the Biomed VO The clustering process of a matrix of such size is very intensive, occupying more than 1.5 GB of RAM and this is a requirement which not all the worker nodes have. Some statistics about the distribution - 59 jobs on 59 worker nodes: - 50 succesfull - 4 failed - 5 stopped (64 bit) 3 hours of running (due to queue of the worker nodes)

Tutorial on "GRID Computing“, EMBnet Conference 2008, 17 September 2008 Results II The avg value of spec is 0.95 ppv is 0.65 sens is 0.81 these results validate our clusters but also reflect the complexity of the problem here considered

Tutorial on "GRID Computing“, EMBnet Conference 2008, 17 September 2008 Conclusion Using the GRID infrastructure we are able to validate efficiently and accurately large amount of evaluated clusters We demonstrated that CMC is a potential clustering algorithm for large and heterogeneous microarray data We showed an application where we clearly differentiated when to use the GRID infrastructure and when to use local resources

Tutorial on "GRID Computing“, EMBnet Conference 2008, 17 September 2008 Future perspectives An authomatic procedure to perform the analysis A larger number of data sets from heterogeneous experiments A bigger set of resampling matrices to improve the statistical validation Up to 500 WN’s to distribute the jobs

Tutorial on "GRID Computing“, EMBnet Conference 2008, 17 September 2008 Acknowledgements - CNR, IAC Sezione di Bari: Carmela Marangi - CNR, ITB Sezione di Bari: Andreas Gisel, Giulia De Sario - Dipartimento Interateneo di Fisica, Bari: Leonardo Angelini - INFN, Sezione di Bari: Giacinto Donvito, Giorgio Maggi

Tutorial on "GRID Computing“, EMBnet Conference 2008, 17 September