Query-driven search methods for large microarray databases Matt Hibbs Troyanskaya Laboratory for BioInformatics and Functional Genomics.

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
More Microarray Analysis: Unsupervised Approaches Matt Hibbs Troyanskaya Lab.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Extraction and comparison of gene expression patterns from 2D RNA in situ hybridization images BIOINFORMATICS Gene expression Vol. 26, no. 6, 2010, pages.
Getting the numbers comparable
Mutual Information Mathematical Biology Seminar
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Microarray Data Preprocessing and Clustering Analysis
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.
Introduce to Microarray
Statistical Analysis of Microarray Data
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Analysis of microarray data
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
Multiple testing correction
Affymetrix vs. glass slide based arrays
Genetic network inference: from co-expression clustering to reverse engineering Patrik D’haeseleer,Shoudan Liang and Roland Somogyi.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
CDNA Microarrays MB206.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Biostatistics in Practice Peter D. Christenson Biostatistician LABioMed.org /Biostat Session 6: Case Study.
Lecture 20: Cluster Validation
Verna Vu & Timothy Abreo
Microarray - Leukemia vs. normal GeneChip System.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Gene expression analysis
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Metabolomics Metabolome Reflects the State of the Cell, Organ or Organism Change in the metabolome is a direct consequence of protein activity changes.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Ritesh Krishna Department Of Computer Science WPCCS July 1, 2008.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Design of Micro-arrays Lecture Topic 6. Experimental design Proper experimental design is needed to ensure that questions of interest can be answered.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
1 Machine Learning for Functional Genomics I Matt Hibbs
March 4, Visualization Approaches for Gene Expression Data Matt Hibbs Assistant Professor The Jackson Laboratory.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
DNAmRNAProtein Small molecules Environment Regulatory RNA How a cell is wired The dynamics of such interactions emerge as cellular processes and functions.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Microarray Data Analysis The Bioinformatics side of the bench.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004 Jo Hardin Pomona College
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Microarray: An Introduction
DNA Microarray. Microarray Printing 96-well-plate (PCR Products) 384-well print-plate Microarray.
Functional Genomics in Evolutionary Research
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Presentation transcript:

Query-driven search methods for large microarray databases Matt Hibbs Troyanskaya Laboratory for BioInformatics and Functional Genomics

Broad Goals/Challenges Characterize the function of proteins Learn the mechanisms of gene expression and regulation under many conditions –Growing amounts of data facilitate this goal Noise, heterogeneity, and biases in available data must be addressed

Specific Goals Large collection of S. cerevisiae microarray data –From > 80 publications –Totaling ~2400 conditions –Divided into ~130 “datasets” How can such a large amount of data be leveraged? –What can we learn? Or not learn? –Accessibility, usefulness to community

Outline Microarray methodology Analysis concerns Functional Biases Improved Approaches Preliminary Conclusions

Outline Microarray methodology Analysis concerns Functional Biases Improved Approaches Preliminary Conclusions

Central Dogma Transcription factors recruit or repress polymerase Transcription –DNA  mRNA Translation –mRNA  Proteins Proteins do work DNA mRNA Proteins Ribosome TF Polymerase

Molecular Measurements Measurements of protein abundance in a variety of conditions can suggest function –Difficult to measure accurately in a large-scale manner One off: measure abundance of mRNA transcripts as a proxy –Much easier to measure on a large scale –Several competing technologies reaching maturity

Basic Microarray Methodology Step 1: Prepare cDNA spots Step 2: Add mRNA to slide for Hybridization Step 3: Scan hybridized array reference mRNAtest mRNA add green dye add red dye hybridize

Microarray Outputs Measure amounts of green and red dye on each spot Represent level of expression as a log ratio between these amounts Raw Image from Spellman et al., 98

Microarray Outputs Experiments Genes Log ratios in data matrix Missing values present Potentially high levels of noise

Additional Technology Two-color (homemade, Agilent) –Process just described, with 2 labeled samples undergoing competitive hybridization Single-color (Affymetrix) –Highly calibrated hybridization spots –Match and Mis-match spots for each oligo Other techniques/tricks –Randomized layouts, barcode arrays, tiling arrays, etc.

Outline Microarray methodology Analysis concerns Functional Biases Improved Approaches Preliminary Conclusions

Noise Sources Transcriptional noise –mRNA transcripts not a direct reflection of protein levels –Process of isolating mRNA can stress cells Especially true of older protocols/data Chemical noise –Fluorescent labels sensitive to environment Operator noise –High variation between scientists running the same experiment

Missing Values Several choices: –Ignore missing values –Remove genes with missing values –Impute missing values KNN-Impute –Replace missing values with a weighted average of the K-nearest neighbors –Used for analysis presented later

Normalization “Bright” arrays –Whole arrays often normalized by average intensity Two-color –Choice of reference population can affect measurements –Avoid divide by zero errors Affymetrix –Convert hybridization values to log ratios Divide by average value Log transform

Clustering Analysis Distance metrics –Euclidean –Pearson –Spearman –… Algorithms –Hierarchical –K-means –SOM –…

Megaclustering Combining data from multiple sources can cause problems –Normalization differences –Technology differences –Noise biases Requires unified pre-processing and smart application of statistics

Apples to Apples Pearson correlation distributions not always normal –Large dependence on number of conditions 6 condition dataset 40 condition dataset Histograms of Pearson correlation coefficients

Apples to Apples Fischer’s Z-score transform normalizes the distributions –Z = ln[(r+1)/(r-1)] / 2, where r = Pearson corr. coeff. 6 condition dataset 40 condition dataset Histograms of Z-scores

Evaluation Measurements Gene Ontology (GO) –Hierarchical organization of biological processes, molecular functions, and cellular components –Cross-organism structure, organism-specific annotations –Closest available approximation of a “gold standard” True Positives and False Positives can be defined from the ontology –Node size, depth, expert voting used for cutoffs

Precision / Recall Calculate and sort distances between all pairs of genes Determine a cutoff, all pairs below cutoff are predicted “true,” above “false” Given these predictions, can calculate precision and recall –Precision = TP / (TP + FP) –Recall = TP / TotalPositives Slide the cutoff from smallest to largest distance to create a curve of precision / recall pairs –Ramp down from few, high confidence predictions to many, low confidence predictions

Example Precision/Recall of various data types

Outline Microarray methodology Analysis concerns Functional Biases Improved Approaches Preliminary Conclusions

Functional Biases Microarray experiments often targeted at a particular process, pathway, or function However, several “global” signals are often present –Ribosomal response –General Stress Response Some datasets do contain more targeted “local” signals as well

Ribosome Bias Precision/Recall of various data types

Ribosome Bias Precision/Recall excluding Ribosome Biogenesis

Process-specific P/R Can generate PR-curves on a per-GO term basis –TPs are pairs of genes annotated to term –TFs are pairs with one gene in term, with smallest common ancestor in very large term –Normalize by size of GO term Results for individual data sets can expose functional biases

Per-dataset Biases Typical Results

Per-dataset Biases Poor Results

Per-dataset Biases Diverse Results

Z-test for significance Difference between pair-wise distances for all genes in a term vs. background

A Global View Z-test P-values Columns - datasets Rows - GO terms Red at a cutoff of

A Global View

A Local View

Outline Microarray methodology Analysis concerns Functional Biases Improved Approaches Preliminary Conclusions

Bi-clustering Traditional clustering will be driven by “global” signals and ignore “local” signals Bi-clustering identifies groups of genes and conditions rather than just genes Traditional clustering Bi-clustering

Bi-clustering goals/issues Better capture biological reality –Genes only cooperate in certain conditions –Genes can have multiple functions –Datasets have functional biases Computationally difficult problem –Reducible to bi-clique finding NP-complete Heuristics, simplifications, approximations –e.g.  -biclusters, SAMBA, PISA

Bi-clustering goals/issues Microarray noise can lead to spurious output –As compendiums increase in size, patterns by chance increase –Datasets have “smallest logical groupings” Restrict co-expression to these groups Long running times + large result sets –Difficult to validate results –Scientifically frustrating

Query-driven approach Allow users to specify a starting point for search –Leverages expert knowledge of domain –Known to be useful in other contexts bioPIXIE Identify conditions/datasets of interest based on the set of query genes Expand query set to include additional related genes in these conditions

Query-driven approach Reduces problem complexity to allow for real- time results Fast results allow for user-driven refinement of search criterions Extensible to larger data compendiums and more complex organisms –Locality sensitive hashing –Pre-processing

Query Weighting Identify data conditions related in query set –Average correlation, distance, etc. –Signal to Noise ratio of query –Centroid significance Additional genes related to query –Correlation, distance, etc. weighted by identified condition sets

Simple Scheme Weighted by correlation of query

Simple Scheme Results, weighted sum of correlation to query decreasing correlation

Ongoing Work Compare query weighting schemes UI challenges Scalability concerns –Indexing, Locality Sensitive Hashing –Human data Assess biological usefulness

Preliminary Conclusions Noise, functional biases, collection sizes require consideration in microarray analysis Evaluation metrics can be influenced by biases creating misleading results Query-driven approaches show promise –Targeted search –Computational feasibility / Real-time results –Extensibility

Acknowledgements Olga Troyanskaya Chad Myers Curtis Huttenhower Kai Li and lab Botstein and Kruglyak labs Kara Dolinski, Maitreya Dunham Jessy