Interpreting Microarray Expression Data Using Text Annotating the Genes Michael Molla, Peter Andreae, Jeremy Glasner, Frederick Blattner, Jude Shavlik.

Slides:



Advertisements
Similar presentations
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Advertisements

ICEPU – IC ENTRY PHSYICAL UPDATE Use this function to update or clear the completed runs after you complete entry and reconciliation. Post the adjustments.
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Transcriptomics Breakout. Topics Discussed Transcriptomics Applications and Challenges For Each Systems Biology Project –Host and Pathogen Bacteria Viruses.
1.2 Row Reduction and Echelon Forms
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Genetic algorithms applied to multi-class prediction for the analysis of gene expressions data C.H. Ooi & Patrick Tan Presentation by Tim Hamilton.
1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Darlene Goldstein 29 January 2003 Receiver Operating Characteristic Methodology.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Microarray-based Disease Prognosis using Gene Annotation Signatures Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005.
Lasso regression. The Goals of Model Selection Model selection: Choosing the approximate best model by estimating the performance of various models Goals.
Scalable Text Mining with Sparse Generative Models
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
A Probabilistic Approach to Protein Backbone Tracing in Electron Density Maps Frank DiMaio, Jude Shavlik Computer Sciences Department George Phillips Biochemistry.
Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Friday 17 rd December 2004Stuart Young Capstone Project Presentation Predicting Deleterious Mutations Young SP, Radivojac P, Mooney SD.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
Learning Regulatory Networks that Represent Regulator States and Roles Keith Noto and Mark Craven K. Noto and M. Craven, Learning Regulatory.
CS Learning Rules1 Learning Sets of Rules. CS Learning Rules2 Learning Rules If (Color = Red) and (Shape = round) then Class is A If (Color.
GENIE: Automated Feature Extraction for Pathology Applications Neal R. Harvey Kim Edlund Los Alamos National Laboratory
Supplemental figure 1: Correlation coefficients between signal intensities from biological replicates of wild.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison.
Wang Y 1,2, Damaraju S 1,3,4, Cass CE 1,3,4, Murray D 3,4, Fallone G 3,4, Parliament M 3,4 and Greiner R 1,2 PolyomX Program 1, Department.
Agent-Based Hybrid Intelligent Systems and Their Dynamic Reconfiguration Zili Zhang Faculty of Computer and Information Science Southwest University
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Changes in Gene Regulation in Δ Zap1 Strain of Saccharomyces cerevisiae due to Cold Shock Jim McDonald and Paul Magnano.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Identifying conserved segments in rearranged and divergent genomes Bob Mau, Aaron Darling, Nicole T. Perna Presented by Aaron Darling.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A.
A meta-analysis of differential coexpression across age Jesse Gillis.
Microarray (Gene Expression) DNA microarrays is a technology that can be used to measure changes in expression levels or to detect SNiPs Microarrays differ.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Cluster validation Integration ICES Bioinformatics.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Microarray Data Analysis The Bioinformatics side of the bench.
De-anonymizing Genomic Databases Using Phenotypic Traits Humbert et al. Proceedings on Privacy Enhancing Technologies 2015 (2) :
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
3.2 Combinations.
Experiments: Three data sets : Ecoli, Yeast, Fly Evaluate each classifier using 5-fold cross validation Results: Feature selection (wrapper model) improves.
Title: Assign Pathways to Gene Set June 21, 2007 Guanming Wu.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Canadian Bioinformatics Workshops
Microsoft Office Access 2010 Lab 2
Genetic Engineering.
Results for all features Results for the reduced set of features
Minimal English Test vs. TOEIC®
Trevor Walker, Gautam Kunapuli, Noah Larsen, David Page, Jude Shavlik
Understanding Results
Dideoxy chain termination method and the human genome project
Evaluating classifiers for disease gene discovery
CSc4730/6730 Scientific Visualization
Lessons Vocabulary Access 2016.
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Structural properties of 2954 OsMADS1-bound sequences in three data sets (intergenic, gene body, and A-tract). Structural properties of 2954 OsMADS1-bound.
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Evaluating Classifiers for Disease Gene Discovery
Statistical chart of significantly differentially expressed genes
Presentation transcript:

Interpreting Microarray Expression Data Using Text Annotating the Genes Michael Molla, Peter Andreae, Jeremy Glasner, Frederick Blattner, Jude Shavlik University of Wisconsin – Madison

The Basic Task Given Microarray Expression Data & Text Annotations of Genes Generate Model of Expression

Motivation Lots of Data Available on the Internet –Microarray Expression Data –Text Annotations of Genes Maybe we can Make the Scientist’s Job Easier –Generate a Model of Expression Automatically –Easier First Step for the Human

Microarray Expression Data Each spot represents a gene in E. coli Colors Indicate Up- or Down-Regulation Under Antibiotic Shock Four our Purpose 3 Classes –Up-Regulated –Down-Regulated –No-Change

Microarray Expression Data From “Genome-Wide Expression in Escheria Coli K-12”, Blattner et al., 1999

Our Microarray Experiment 4290 genes 574 up-regulated 333 down-regulated 2747 un-regulated 636 non enough signal

Text Annotations of Genes The text from a sample SwissProt entry (b1382) –The “description” field HYPOTHETICAL 6.8 KDA PROTEIN IN LDHA-FEAR INTERGENIC REGION –The “keyword” field HYPOTHETICAL PROTEIN

Sample Rules From a Model for Up-Regulation IF –The annotation contains FLAGELLAR AND does NOT contain HYPOTHETICAL OR –The annotation contains BIOSYNTHESIS THEN –The gene is up-regulated

Why use Machine Learning? Concerned with machines learning from available data Informed by text data, the leaner can make first-pass model for the scientist

Desired Properties of a Model Accurate –Measure with cross validation Comprehensible –Measure with model size Stable to Small Changes in the Data –Measure with random subsampling

Approaches Naïve Bayes –Statistical method –Uses all of the words (present or absent) PFOIL –Covering algorithm –Chooses words to use one at a time

Naïve Bayes For each word w i, there are two likelihood ratios (lr): lr (w i present) = p(w i present | up) / p(w i present | down) lr (w i absent) = p(w i absent | up) / p(w i absent | down) For each annotation, the lrs are combined to form a lr for a gene: where X is either present or absent.

P FOIL Learn rules from data Produces multiple if-then rules from data Builds rules by adding one word at a time Easy to interpret models

Accuracy/Comprehensibility Tradeoff

Stabilized P FOIL Repeatedly run PFOIL on randomly sampled subsets For each word, count the number of models it appears in Restrict PFOIL to only those words that appear in a minimum of m models Rerun PFOIL with only those words

Stability Measure After running the algorithm N times to generate N rule sets: Where: U = the set of words appearing in any rule set count(w i ) = number of rule sets containing word w i

Accuracy/Stability Tradeoff

Discussion Not very severe tradeoffs in Accuracy –vs. stability –vs. comprehensibility P FOIL not as good at characterizing data –suggests not many dependencies –need for “softer” rules

Future Directions M of N rules Permutation Test More Sources of Text Data

Take-Home Message This is just a first step toward an aid for understanding expression data Make expression models based on text in stead of DNA sequence.

Acknowledgements This research was funded by the following grants: NLM 1 R01 LM , NSF IRI , NIH 2 P30 CA , and NIH 5 T32 GM08349.