Presentation is loading. Please wait.

Presentation is loading. Please wait.

Query-driven search methods for large microarray databases Matt Hibbs Troyanskaya Laboratory for BioInformatics and Functional Genomics.

Similar presentations


Presentation on theme: "Query-driven search methods for large microarray databases Matt Hibbs Troyanskaya Laboratory for BioInformatics and Functional Genomics."— Presentation transcript:

1 Query-driven search methods for large microarray databases Matt Hibbs Troyanskaya Laboratory for BioInformatics and Functional Genomics

2 Broad Goals/Challenges Characterize the function of proteins Learn the mechanisms of gene expression and regulation under many conditions –Growing amounts of data facilitate this goal Noise, heterogeneity, and biases in available data must be addressed

3 Specific Goals Large collection of S. cerevisiae microarray data –From > 80 publications –Totaling ~2400 conditions –Divided into ~130 “datasets” How can such a large amount of data be leveraged? –What can we learn? Or not learn? –Accessibility, usefulness to community

4 Outline Microarray methodology Analysis concerns Functional Biases Improved Approaches Preliminary Conclusions

5 Outline Microarray methodology Analysis concerns Functional Biases Improved Approaches Preliminary Conclusions

6 Central Dogma Transcription factors recruit or repress polymerase Transcription –DNA  mRNA Translation –mRNA  Proteins Proteins do work DNA mRNA Proteins Ribosome TF Polymerase

7 Molecular Measurements Measurements of protein abundance in a variety of conditions can suggest function –Difficult to measure accurately in a large-scale manner One off: measure abundance of mRNA transcripts as a proxy –Much easier to measure on a large scale –Several competing technologies reaching maturity

8 Basic Microarray Methodology Step 1: Prepare cDNA spots Step 2: Add mRNA to slide for Hybridization Step 3: Scan hybridized array reference mRNAtest mRNA add green dye add red dye hybridize

9 Microarray Outputs Measure amounts of green and red dye on each spot Represent level of expression as a log ratio between these amounts Raw Image from Spellman et al., 98

10 Microarray Outputs Experiments Genes Log ratios in data matrix Missing values present Potentially high levels of noise

11 Additional Technology Two-color (homemade, Agilent) –Process just described, with 2 labeled samples undergoing competitive hybridization Single-color (Affymetrix) –Highly calibrated hybridization spots –Match and Mis-match spots for each oligo Other techniques/tricks –Randomized layouts, barcode arrays, tiling arrays, etc.

12 Outline Microarray methodology Analysis concerns Functional Biases Improved Approaches Preliminary Conclusions

13 Noise Sources Transcriptional noise –mRNA transcripts not a direct reflection of protein levels –Process of isolating mRNA can stress cells Especially true of older protocols/data Chemical noise –Fluorescent labels sensitive to environment Operator noise –High variation between scientists running the same experiment

14 Missing Values Several choices: –Ignore missing values –Remove genes with missing values –Impute missing values KNN-Impute –Replace missing values with a weighted average of the K-nearest neighbors –Used for analysis presented later

15 Normalization “Bright” arrays –Whole arrays often normalized by average intensity Two-color –Choice of reference population can affect measurements –Avoid divide by zero errors Affymetrix –Convert hybridization values to log ratios Divide by average value Log transform

16 Clustering Analysis Distance metrics –Euclidean –Pearson –Spearman –… Algorithms –Hierarchical –K-means –SOM –…

17 Megaclustering Combining data from multiple sources can cause problems –Normalization differences –Technology differences –Noise biases Requires unified pre-processing and smart application of statistics

18 Apples to Apples Pearson correlation distributions not always normal –Large dependence on number of conditions 6 condition dataset 40 condition dataset Histograms of Pearson correlation coefficients

19 Apples to Apples Fischer’s Z-score transform normalizes the distributions –Z = ln[(r+1)/(r-1)] / 2, where r = Pearson corr. coeff. 6 condition dataset 40 condition dataset Histograms of Z-scores

20 Evaluation Measurements Gene Ontology (GO) –Hierarchical organization of biological processes, molecular functions, and cellular components –Cross-organism structure, organism-specific annotations –Closest available approximation of a “gold standard” True Positives and False Positives can be defined from the ontology –Node size, depth, expert voting used for cutoffs

21 Precision / Recall Calculate and sort distances between all pairs of genes Determine a cutoff, all pairs below cutoff are predicted “true,” above “false” Given these predictions, can calculate precision and recall –Precision = TP / (TP + FP) –Recall = TP / TotalPositives Slide the cutoff from smallest to largest distance to create a curve of precision / recall pairs –Ramp down from few, high confidence predictions to many, low confidence predictions

22 Example Precision/Recall of various data types

23 Outline Microarray methodology Analysis concerns Functional Biases Improved Approaches Preliminary Conclusions

24 Functional Biases Microarray experiments often targeted at a particular process, pathway, or function However, several “global” signals are often present –Ribosomal response –General Stress Response Some datasets do contain more targeted “local” signals as well

25 Ribosome Bias Precision/Recall of various data types

26 Ribosome Bias Precision/Recall excluding Ribosome Biogenesis

27 Process-specific P/R Can generate PR-curves on a per-GO term basis –TPs are pairs of genes annotated to term –TFs are pairs with one gene in term, with smallest common ancestor in very large term –Normalize by size of GO term Results for individual data sets can expose functional biases

28 Per-dataset Biases Typical Results

29 Per-dataset Biases Poor Results

30 Per-dataset Biases Diverse Results

31 Z-test for significance Difference between pair-wise distances for all genes in a term vs. background

32 A Global View Z-test P-values Columns - datasets Rows - GO terms Red at a cutoff of 10 -10

33 A Global View

34

35 A Local View

36

37 Outline Microarray methodology Analysis concerns Functional Biases Improved Approaches Preliminary Conclusions

38 Bi-clustering Traditional clustering will be driven by “global” signals and ignore “local” signals Bi-clustering identifies groups of genes and conditions rather than just genes Traditional clustering Bi-clustering

39 Bi-clustering goals/issues Better capture biological reality –Genes only cooperate in certain conditions –Genes can have multiple functions –Datasets have functional biases Computationally difficult problem –Reducible to bi-clique finding NP-complete Heuristics, simplifications, approximations –e.g.  -biclusters, SAMBA, PISA

40 Bi-clustering goals/issues Microarray noise can lead to spurious output –As compendiums increase in size, patterns by chance increase –Datasets have “smallest logical groupings” Restrict co-expression to these groups Long running times + large result sets –Difficult to validate results –Scientifically frustrating

41 Query-driven approach Allow users to specify a starting point for search –Leverages expert knowledge of domain –Known to be useful in other contexts bioPIXIE Identify conditions/datasets of interest based on the set of query genes Expand query set to include additional related genes in these conditions

42 Query-driven approach Reduces problem complexity to allow for real- time results Fast results allow for user-driven refinement of search criterions Extensible to larger data compendiums and more complex organisms –Locality sensitive hashing –Pre-processing

43 Query Weighting Identify data conditions related in query set –Average correlation, distance, etc. –Signal to Noise ratio of query –Centroid significance Additional genes related to query –Correlation, distance, etc. weighted by identified condition sets

44 Simple Scheme Weighted by correlation of query

45 Simple Scheme Results, weighted sum of correlation to query decreasing correlation

46 Ongoing Work Compare query weighting schemes UI challenges Scalability concerns –Indexing, Locality Sensitive Hashing –Human data Assess biological usefulness

47 Preliminary Conclusions Noise, functional biases, collection sizes require consideration in microarray analysis Evaluation metrics can be influenced by biases creating misleading results Query-driven approaches show promise –Targeted search –Computational feasibility / Real-time results –Extensibility

48 Acknowledgements Olga Troyanskaya Chad Myers Curtis Huttenhower Kai Li and lab Botstein and Kruglyak labs Kara Dolinski, Maitreya Dunham Jessy


Download ppt "Query-driven search methods for large microarray databases Matt Hibbs Troyanskaya Laboratory for BioInformatics and Functional Genomics."

Similar presentations


Ads by Google