Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Affymetrix Gene Expression Microarrays Application to Pulmonary Arterial Hypertension Bob Stearman 02/24/2014.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Introduction to BioConductor Friday 23th nov 2007 Ståle Nygård Statistical methods and bioinformatics for the analysis of microarray.
Oncomine Database Lauren Smalls-Mantey Georgia Institute of Technology June 19, 2006 Note: This presentation contains animation.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Abstract BarleyBase ( is a USDA-funded public repository for plant microarray data. BarleyBase houses raw and normalized expression.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Computational Biology: A Measurement Perspective Alden Dima Information Technology Laboratory
Stephan Gammeter, Lukas Bossard, Till Quack, Luc Van Gool.
Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis Jonsson.
Microarray Data Preprocessing and Clustering Analysis
Reduced Support Vector Machine
ISB Notice and preparing for the implementation of the new IAPT Data Standard Shaun Crowe Mental Health, Employment and IAPT Mental Health Collaborative.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Aleksi Kallio CSC – IT Center for Science Chipster and collaboration with other bioinformatics platforms.
Cluster Analysis Hierarchical and k-means. Expression data Expression data are typically analyzed in matrix form with each row representing a gene and.
Presented by Karen Xu. Introduction Cancer is commonly referred to as the “disease of the genes” Cancer may be favored by genetic predisposition, but.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Introduction The goal of translational bioinformatics is to enable the transformation of increasingly voluminous genomic and biological data into diagnostics.
Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.
MASTERS THESIS DEFENSE QBANK A Web-Based Dynamic Problem Authoring Tool BY ANN PAUL ADVISOR: PROFESSOR CLIFF SHAFFER JUNE 2013 Computer Science Department.
GMOD in the Cloud Genome Informatics November 3, 2011 Scott Cain GMOD Project Coordinator Ontario Institute for Cancer Research
MATISSE - Modular Analysis for Topology of Interactions and Similarity SEts Igor Ulitsky and Ron Shamir Identification.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Data Curation and Management activities within the UCT Computational Biology Group Dr Nicky Mulder.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Event-Based Model for Reconciling Digital Entries Thesis Proposal Ahmet Fatih Mustacoglu 10/3/20151Ahmet.
Abstract BarleyBase is a USDA-funded public repository for plant microarray data. BarleyBase houses raw and normalized expression data from the 22K Affymetrix.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Agenda Introduction to microarrays
Analyzing and Interpreting Quantitative Data
GenePattern Overview for MAGE-TAB Workshop Ted Liefeld January 24, 2007.
Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,
VectorBase Gene expression data in VectorBase Fotis Kafatos, George Christophides, Bob MacCallum & Seth Redmond Imperial College London (thanks also to.
Visualization and analysis of microarray and gene ontology data with treemaps Eric H Baehrecke, Niem Dang, Ketan Babaria and Ben Shneiderman Presenter:
Pathway Interaction Database (PID) Market Research BioPortals Tiger Team Meeting Mervi Heiskanen January 31, 2013.
Lucian Voinea Visualizing the Evolution of Code The Visual Code Navigator (VCN) Nunspeet,
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Data Integration and Management A PDB Perspective.
Developed at the Broad Institute of MIT and Harvard Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, and Mesirov JP. GenePattern 2.0. Nature Genetics 38.
Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
GeWorkbench John Watkinson Columbia University. geWorkbench The bioinformatics platform of the National Center for the Multi-scale Analysis of Genomic.
Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix.
Sushi – An exquisite recipe for NGS data analysis Hubert Rehrauer & Masaomi Hatakeyama Supporting User for SHell-script Integration.
Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al. Presented by Brandon Smith Computer Vision.
HW2: exome sequencing and complex disease Jacquemin Jonathan de Bournonville Sébastien.
Final Year Project – I Smart Recruiter Group Members: Uzair Siddiqui [05363] Rehma Ather [05625] Meeran Khan [05364] Syed Maaz Alam [05284] Supervisor.
Introduction and Applications of Microarray Databases Chen-hsiung Chan Department of Computer Science and Information Engineering National Taiwan University.
Copyright OpenHelix. No use or reproduction without express written consent1.
The iPlant Collaborative
Call in: Participant Passcode: Centra: Meeting ID: ICR_WShttp://ncicb.centra.com August 11, 2010 ICR-WS Meeting.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
AN INTRODUCTION TO GENE EXPRESSION ANALYSIS BY MICROARRAY TECHNIQUE (PART II) DR. AYAT B. AL-GHAFARI MONDAY 10 TH OF MUHARAM 1436.
Introduction to Oncomine Xiayu Stacy Huang. Oncomine is a cancer-specific microarray database and has a web-based data-mining platform aimed at facilitating.
Bioinformatics Shared Resource Introduction to Gene Expression Omnibus (GEO) bsrweb.sanfordburnham.org
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
ArrayExpress Ugis Sarkans EMBL - EBI
Overview and Demo of CaIntegrator2 A Tool for Publishing and Analyzing Integrated Study Data.
MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.
Bioinformatics for biologists (2) Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
基于 R/Bioconductor 进行生物芯片数据分析 曹宗富 博奥生物有限公司
Enhancements to Galaxy for delivering on NIH Commons
David Amar, Tom Hait, and Ron Shamir
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Presentation transcript:

Unlocking the potential of public available gene expression data for large-scale analysis Jonatan Taminau PhD defense, November 2012

22 Introduction In this thesis: Focus on data to information step. Focus on microarrays technology. Data KnowledgeInformation

33 Introduction Data Information Data Repositories: + Massive amounts + Examples: GEO, ArrayExpress + Publicly available! Analysis Software: + Commercial: CLC Bio, Spotfire, etc. + Free: Bioconductor, Genepattern, Galaxy, etc. + A lot of existing research

44 Introduction “Although hundreds of thousands of samples are publicly available, and several powerful analysis software solutions exist, the research community is facing a chasm between these two resources.” (Coletta et al, 2012) “One of the challenges for the future is how to integrate all the DNA microarray data that have been generated and deposited in public databases.” (Larsson et al, 2006) ?

55 Introduction We identified two hurdles for large-scale microarray analysis: ① Consistent retrieval of individual datasets. ② Integrative analysis of multiple data sets.

66 Outline Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6Chapter 7 Chapter 8 Chapter 9

77 Outline Retrieval of data Integrative Analysis Problem Statement inSilico DB Problem Statement Meta-AnalysisMerging Application

88 Outline Retrieval of data Problem Statement inSilico DB Problem Statement Meta-AnalysisMerging Application Integrative Analysis

99 Retrieval of genomic data Data is online, freely available But: difficult to consistently retrieve the data (Example: Baggerly & Combes, 2011 ) What does it mean? Data retrieval is reproducible and tractable No manual intervention needed All data is preprocessed the same

10 Retrieval of genomic data Typical microarray workflow: Image CEL file ScannerPrepro- cessing DNA microarray Image Analysis numerical (‘raw’) data Gene expression matrix

11 Retrieval of genomic data CEL file Prepro- cessing numerical (‘raw’) data Gene expression matrix Complex + normalization/background correction + probe-to-gene mapping + versioning issues + etc. Not Documented! “only 48% of all data in GEO and ArrayExpress was submitted with raw data” (Larsson et al. 2006)

12 Retrieval of genomic data + Features + Genes or probes + range: 20k-30k + Instances + Patients, tissues, etc. + range: Gene Expression Value: + Expression of gene i in sample j + range between log2 scaled x ij

13 Retrieval of genomic data What about phenotypical data or meta-data ? Extra information about the samples (age, gender, disease, etc.) No standard way of formatting this information MIAME / Ontologies / Free text / etc. Also still an open problem

14 Retrieval of genomic data Why is consistent retrieval from public repositories so important? Reproducibility of results Comparison of new results with existing studies Combining different studies

15 Outline Retrieval of data Problem Statement inSilico DB Problem Statement Meta-AnalysisMerging Application Integrative Analysis

16 The inSilico Database Result of InSilico project Innoviris ( ) 8 persons from VUB & ULB Provides consistently preprocessed and expert-curated genomic data Being commercialized

17 The inSilico Database What makes the inSilico Database so valuable ? Not the fact that all data is precomputed But how it is precomputed What is the underlying engine ? Genomic Pipelines Backbone

18 The inSilico DB | Genomic Pipelines For every data type there is a different pipeline Microarray pipeline: Jobs Dependencies Backbone

19 The inSilico DB | Backbone Automatic Workflow System Barely manual intervention needed Control of intermediate results Pre-computation saves time (for the user) Streamlined Error management Automatic Monitoring

20 The inSilico DB | Backbone How does it works? Java daemon (recently replaced by application server) Configuration Files

21 inSilicoDb package One thing missing for large-scale analysis... Programmatic access via scripting Contains the basic functionality of InSilico DB Makes automatic retrieval of data possible! Seamlessly integrates with other bioconductor analysis tools Published in Bioinformatics, download > 2000 times

22 Outline Retrieval of data Problem Statement inSilico DB Problem Statement Meta-AnalysisMerging Application Integrative Analysis

23 Integrative Analysis “Combining the information of multiple, independent but related studies in order to extract more general and more reliable results” Problem: How to do it ? Two approaches: Meta-Analysis Merging

24 Integrative Analysis MergingMeta-Analysis

25 Outline Retrieval of data Problem Statement inSilico DB Problem Statement Meta-AnalysisMerging Application Integrative Analysis

26 Meta-Analysis + Combining p-values + Combining effect sizes + Combining Ranks + Vote Counting + etc. + Depends on goal + Much focus on finding DEGs + Defines what the results look like + Consistent Retrieval is essential ! + inSilicoDb package

27 Meta-Analysis | Stable Genes 365 studies were screened for stable genes Motivation: Interested in reference genes Currently used genes (housekeeping genes) are not ideal Need a compact and diverse list of genes that are stable under most conditions In collaboration with Dr Bram de Craene (VIB-UGent)

28 Meta-Analysis | Stable Genes (1) Retrieve Data + inSilicoDb package + All 365 datasets downloaded in less than 100 min (2) Calculate Stability Scores + For each gene: + Coefficient of Variation (CV) sd / mean + avoid lowly expressed genes (3) Combine Stability Scores + For each gene take median of CVs + Rank and take top 100 (4) Semantic Similarity Filtering + Exclude genes that are related + Uses gene annotation from GO + Innovative Step! + From 100 to 10 genes

29 Meta-Analysis | Stable Genes Status: August 2012 | waiting for results… September 2012 | first positive results! November 2012 | second test case, positive feedback from NAR, manuscript in preparation…

30 Outline Retrieval of data Problem Statement inSilico DB Problem Statement Meta-AnalysisMerging Application Integrative Analysis

31 Merging + Consistent Retrieval is essential ! + inSilicoDb package + Batch effects + Methods to remove - Location-scale - Matrix Factorization - Discretization + Makes data compatible + Preprocessing not sufficient + Same as with single studies + Increased sample size !

32 Merging | Batch Effects Illustrative Example what batch effects can cause: We merged 4 different studies with thyroid samples All studies contained normal and tumor samples In collaboration with Wilma Van Staveren (IRIBHM, ULB) Samples are plotted in MDS space We expect two clusters

33 Merging | Batch Effects Merging without batch effect removalMerging with batch effect removal Legend: + symbol for study + color for normal/tumor

34 inSilicoMerging package R/Bioconductor package combining: 6 different merging methods 5 visual inspection tools 6 quantitative measures Only resource so far combining all this functionality ! Seamlessly integrates with inSilicoDb package

35 Outline Retrieval of data Problem Statement inSilico DB Problem Statement Meta-AnalysisMerging Application Integrative Analysis

36 Identification of DEGs in Lung Cancer Idea: compare meta-analysis and merging approaches for integrative analysis We used lung cancer as case based on the content of inSilico DB. Ignore subtypes: DEGs can be seen as playing a role in the basic mechanisms of lung cancer

37 Identification of DEGs in Lung Cancer What is our hypothesis ? Due to the small sample sizes of individual studies there are a lot or False Negatives when using meta- analysis Can we avoid this by using merging as an alternative approach?

38 Identification of DEGs in Lung Cancer MergingMeta-Analysis Constraints: + fRMA preprocessed + > 30 samples + both normal and tumor + GPL96 or GPL570 Methodology: + apply limma - p-value 2 + robustness test iterations with 90% of data - resampling + inSilicoMerging package + take intersection

39 Identification of DEGs in Lung Cancer Meta-Analysis:

40 Identification of DEGs in Lung Cancer Merging:

41 Identification of DEGs in Lung Cancer Findings: Resampling helps to remove false positives Relatively low impact of batch effect removal methods More DEGs identified through merging (102) than via meta-analysis (25) “Deriving separate statistics and then averaging is often less powerful than directly computing statistics from aggregated data.” (Xu et al, 2008) no False Positives? + checked literature + initial pathway analysis

42 Outline Retrieval of data Problem Statement inSilico DB Problem Statement Meta-AnalysisMerging Application Integrative Analysis + Contributions + Conclusions

43 Contributions Genomic pipelines / backbone (Ch 4) Release of 2 publicly available R/Bioconductor packages (Ch 4 & 7) Survey of batch effect removal methods (Ch 7) Two applications Identification of stable genes via meta-analysis (Ch 6) Screening of potential biomarkers via integrative analysis (Ch 8)

44 Conclusions We identified two hurdles for large-scale microarray analysis: ① Consistent retrieval of individual datasets. ② Integration of multiple data sets for integrative analysis.

45 Conclusions ① Consistent retrieval of individual datasets.  inSilicoDb package ② Integration of multiple data sets for integrative analysis.  inSilicoMerging package Paving the road towards unlocking the potential of public available gene expression studies

46 Thanks! + InSilico Team! + Jury! + Audience! + Yann-Michaël!