From Expression to Regulation: the online analysis of microarray data Gert Thijs K.U.Leuven, Belgium ESAT-SCD.

Slides:



Advertisements
Similar presentations
Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De.
Advertisements

Basic Gene Expression Data Analysis--Clustering
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Automatic Identification of Bacterial Types using Statistical Image Modeling Sigal Trattner, Dr. Hayit Greenspan, Prof. Shimon Abboud Department of Biomedical.
1 MicroArray -- Data Analysis Cecilia Hansen & Dirk Repsilber Bioinformatics - 10p, October 2001.
Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen.
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 1 Introduction Aleppo University Faculty of technical engineering.
Introduction to DNA Microarrays Todd Lowe BME 88a March 11, 2003.
Microarray Data Preprocessing and Clustering Analysis
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Transcription factor binding motifs (part I) 10/17/07.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Introduction to BioInformatics GCB/CIS535
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Gibbs biclustering of microarray data Yves Moreau & Qizheng Sheng Katholieke Universiteit Leuven ESAT-SCD (SISTA) on leave at Center for Biological Sequence.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Review of important points from the NCBI lectures. –Example slides Review the two types of microarray platforms. –Spotted arrays –Affymetrix Specific examples.
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Presented by Liu Qi An introduction to Bioinformatics Algorithms Qi Liu
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
Inferring Cellular Networks Using Probabilistic Graphical Models Jianlin Cheng, PhD University of Missouri 2009.
Reconstruction of regulatory modules based on heterogeneous data sources Karen Lemmens PhD Defence September 29th 2008.
Whole Genome Expression Analysis
From motif search to gene expression analysis
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Gene Expression and Networks. 2 Microarray Analysis Supervised Methods -Analysis of variance -Discriminate analysis -Support Vector Machine (SVM) Unsupervised.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
EB3233 Bioinformatics Introduction to Bioinformatics.
Gibbs sampling for motif finding Yves Moreau. 2 Overview Markov Chain Monte Carlo Gibbs sampling Motif finding in cis-regulatory DNA Biclustering microarray.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Cluster validation Integration ICES Bioinformatics.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Flat clustering approaches
I.U. School of Informatics Motif Discovery from Large Number of Sequences: A Case Study with Disease Resistance Genes in Arabidopsis thaliana by Irfan.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Other uses of DNA microarrays
The Transcriptional Landscape of the Mammalian Genome
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
A Very Basic Gibbs Sampler for Motif Detection
Microarray Technology and Applications
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Presentation transcript:

From Expression to Regulation: the online analysis of microarray data Gert Thijs K.U.Leuven, Belgium ESAT-SCD

K.U.Leuven Founded in 1425 Situated in the center of Belgium Some numbers: students researchers professors University Hospital with beds

ESAT-SCD Faculty of Engineering Mathematical engineering (120) – Systems and control – Data mining and Neural Nets – Biomedical signal processing – Telecommunications – Bioinformatics – Cryptography

Bioinformatics team Research in medical informatics and bioinformatics Research on algorithmic methods Interdisciplinary team – 15 researchers (1 full professor, 4 post-docs, 10 Ph.D. students) – Engineering, physics, mathematics, computer science, biotech, and medicine Collaborative research with molecular biologists and clinicians – VIB MicroArray Facility: primary analysis of microarray data – University of Gent-VIB, Plant Genetics: motif discovery – KUL-VIB, Center for Human Genetics Neuronal development in mice neurons Targets of PLAG1 (pleiomorphic adenoma gene) – KUL, Obstetrics and Gynecology Diagnosis of ovarian tumors from ultrasonography (IOTA) Microarray analysis of ovarian tumor biopsies

Overview 1. Short introduction to microarrays 2. Exploratory analysis of microarray data 3. Clustering gene expression profiles 4. Upstream sequence retrieval 5. Motif finding in sets of co-expressed genes

cDNA microarrays Collaboration with VIB microarray facility cDNAs (genes, ESTs) spotted on array – Cy3, Cy5 labeling of samples – Hybridization (test, control) – Laser scanning & image analysis – Arabidopsis, mouse, and human

Microarray experiment 1.Collecting samples 2.Extracting mRNA 3.Labeling 4.Hybridizing 5.Scanning 6.Visualizing

Microarray production Clones Plasmide preparation PCR amplification Reordering Spotting Zoom - pins

From expression to regulation Clustering start Blast Gibbs sampler Microarrays A1234 Z4321 GenBank

Exploratory data analysis

Data exploration Subset selection based on – Gene Ontology functional classes – Keywords, gene names Check the expression profiles of individual genes Visualization expression profiles of gene families Link to upstream sequence retrieval

Gene Ontology

Subset selection

Profile inspection

Profile visualization

Sequence Retrieval

Clustering

Goal of clustering Exploration of microarray data Form coherent groups of – Genes – Patient samples (e.g., tumors) – Drug or toxin response Study these groups to get insight into biological processes – Genes in same clusters can have the same function or same regulation

K-means Initialization – Choose the number of clusters K and start from random positions for the K centers Iteration – Assign points to the closest center – Move each center to the center of mass of the assigned points Termination – Stop when the centers have converged or maximum number of iterations Initialization

K-means Initialization – Choose the number of clusters K and start from random positions for the K centers Iteration – Assign points to the closest center – Move each center to the center of mass of the assigned points Termination – Stop when the centers have converged or maximum number of iterations Iteration 1

Iteration 1 K-means Initialization – Choose the number of clusters K and start from random positions for the K centers Iteration – Assign points to the closest center – Move each center to the center of mass of the assigned points Termination – Stop when the centers have converged or maximum number of iterations

Iteration 3 K-means Initialization – Choose the number of clusters K and start from random positions for the K centers Iteration – Assign points to the closest center – Move each center to the center of mass of the assigned points Termination – Stop when the centers have converged or maximum number of iterations

Hierarchical clustering Construction of gene tree based on correlation matrix

K-means clustering Need for new clustering algorithms Noisy genes deteriorate consistency of profiles in cluster All genes forced into cluster

Adaptive quality-based clustering For discovery, biologists are looking for highly coherent, reliable clusters Other needs for clustering microarray data – Fast + limited memory (need to analyze thousands of genes) – No need to specify number of clusters in advance – Few and intuitive parameters AQBC = 2 step algorithm – Cluster center localization – Cluster radius estimation with EM Read more: – De Smet et al. (2002) Bioinformatics, in press.

Step 1: localization of cluster center

Step 2: re-estimation of cluster radius Distance from cluster center randomly distributed except for small group (= cluster elements) Size of cluster can be estimated automatically by EM Step 3: remove cluster points and look for new cluster

K-means:A.Q.B.C. User defined parameters Quality criterion (QC): % defines how significant a cluster should be separated from background Minimal number of genes in a cluster Advantages Outcome not sensitive to parameter setting Number of clusters is determined automatically Based on QC an optimal radius is calculated for each cluster Set of smaller clusters containing genes with highly similar expression profile (fewer false positives) Noisy genes are rejected User-defined parameters Number of clusters Number of iterations Disadvantages Outcome sensitive towards parameter setting Extensive fine-tuning required to find optimal number of clusters Separation and merging of clusters based on visual inspection and not on statistical foundation No quality criterion: more false positives All genes will be clustered (noisy clusters) Disadvantages Some information is rejected: clusters too small Advantages Fewer true positives are rejected Comparison with K-means

Adaptive Quality-Based Clustering Web Interface

Cluster results page Upstream Sequence Retrieval

Upstream sequence retrieval

Upstream Sequence Retrieval 1. Identify all genes in cluster based on given accession number and gene name. 2. Delineate upstream region based on sequence annotation. 3. Check for presence of annotated upstream gene. 4. IF upstream gene found THEN select intergenic region ELSE blast gene to find genomic DNA where gene is annotated. 5. Parse blast reports to find intergenic regions 6. Report results in GFF.

Gene Identification

Selected sequences & genes to be blasted

Results blast report parsing

Selected sequences

Motif Finding

Transcriptional regulation Complex integration of multiple signals determines gene activity Combinatorial control

Identifying regulatory elements from expression data Cluster genes from microarray expression data to build clusters of co-expressed genes Co-expressed genes may share regulatory mechanisms Most regulatory sequences are found in the upstream region of the genes (up to 2kb from A. thaliana) Motifs that are statistically overrepresented in the upstream regions are candidate regulatory sequences

Upstream sequence model Motifs are hidden in noisy background sequence. Data set contains two types of sequences: – Sequences with one or more copies of the common motif. – Sequences with no copy of the common motif.

Motif Sampler Algorithm based on the original Gibbs Sampling algorithm (Lawrence et al. 1993, Science 262: ) Probabilistic sequence model Changes and additions: – Use of higher-order background model. – Use of probability distribution to estimate number of copies. – Different motifs are found and masked in consecutive runs of the algorithm. Read more: – Thijs et al. (2001) Bioinformatics 17(12), – Thijs et al. (2002) J.Comp.Biol. 9(2),

Background model Representation of DNA sequence by higher-order Markov Chain: Core promotergene Intergenic region Reliable model can be build from selected intergenic DNA sequences. Intergenic sequence = non-coding region between two consecutive genes. Only regions that contain core promoter are selected.

Algorithm: Initialization Calculate background model score Start from random set of motif positions Create initial motif model

Algorithm: iterative procedure 1. Score sequences with current motif model 2. Calculate distribution 3. Sample new alignment position 4. Iterate for fixed number of steps

Algorithm: Convergence Select best scoring positions from Wx to create motif and alingment

Motif Sampler

Motif Sampler results page

Example: Plant wounding 150 Arabidopsis genes Mechanical plant wounding 7 (or 8) time points over a 24h period Adaptive quality-based clustering produces 8 clusters of which 4 contain 5 or more genes. Search for a motif of length 8 and a motif of length 12 in 4 clusters Reymond, P et al Differential gene expression in response to mechanical wounding and insect feeding in Arabidopsis. Plant Cell 12(5):

Results: Cluster 1 TAArTAAGTCAC 7TGAGTCA tissue specific GCN4-motif CGTCA MeJA-responsive element ATTCAAATTT 8ATACAAAT element associated to GCN4-motif CTTCTTCGATCT 5TTCGACC elicitor responsive element

Results: Cluster 2 CCCGCGTTTCAA 4 CCCCCGenhancer like element TTGACyCGy 5 TGACGMeJa responsive element (T)TGAC(C)Box-W1, elicitor responsive element mACGTCACct 7 CGTCAMeJA responsive element ACGTAbcissic response element

Results: Cluster 4 wATATATATmTT 5 TATATATATA-box like element TCTwCnTC 9 TCTCCCTTCCC-motif, part of light responsive element ATAAATAkGCnT 7 - -

Results: Cluster 8 yTGACCGTCcsa9CCGTCCmeristem specific activation of H4 gene CCGTCCA-box, light or elicitor responsive element TGACGMeJA responsive element CGTCAMeJA responsive element CACGTGG5CACGTGG-box, light responsive element ACGTAbcissic acid response element GCCTymTT8-- AGAATCAAT6--

Conclusions Gene expression data can reveal useful information on transcriptional regulation. Adaptive quality-based clustering finds coherent groups of co-expressed genes. Use of higher-order background models improves performance of Motif Sampler. INCLUSive enables online analysis from clustering to motif finding

Acknowledgements ESAT-SCD Prof. Bart De Moor Dr. Yves Moreau Dr. Kathleen Marchal Frank De Smet Stein Aerts all others STWW Project Pierre Rouzé (VIB Gent, INRA) Stephane Rombauts (VIB Gent, INRA) Magali Lescot (LGPD, Marseille) IWT-Vlaanderen