Nucleotide Level We define four statistics to describe how results are scored at the nucleotide level. If a base is part of an actual site and is predicted.

Slides:

Advertisements

Similar presentations

Microarray statistical validation and functional annotation

Advertisements

Author: Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, Thomas Ball MIT CSAIL.

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Alternate Software Development Methodologies

Computational Biology: A Measurement Perspective Alden Dima Information Technology Laboratory

Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.

A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.

Sensemaking and Ground Truth Ontology Development Chinua Umoja William M. Pottenger Jason Perry Christopher Janneck.

MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.

The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.

3 Chapter Needs Assessment.

Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.

CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.

+ Doing More with Less : Student Modeling and Performance Prediction with Reduced Content Models Yun Huang, University of Pittsburgh Yanbo Xu, Carnegie.

SDLC. Information Systems Development Terms SDLC - the development method used by most organizations today for large, complex systems Systems Analysts.

BIOCMS: Resource Integration and Web Application Framework for Bioinformatics DHUNDY R BASTOLA †, *, ANIL KHADKA †, MOHAMMAD SHAFIULLAH † AND HESHAM ALI.

Science and Engineering Practices

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

1 Prediction of Software Reliability Using Neural Network and Fuzzy Logic Professor David Rine Seminar Notes.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.

Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.

DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.

Leslie Luyt Supervisor: Dr. Karen Bradshaw 2 November 2009.

At A Glance VOLT is a freeware, platform independent tool set that coordinates cross-mission observation planning and scheduling among one or more space.

Authors Project Database Handler The project database handler dbCCP4i is a small server program that handles interactions between the job database and.

Introduction to Bioinformatics Spring 2002 Adapted from Irit Orr Course at WIS.

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.

Mining and Analysis of Control Structure Variant Clones Guo Qiao.

WebVizOr: A Fault Detection Visualization Tool for Web Applications Goal: Illustrate and evaluate the uses of WebVizOr, a new tool to aid web application.

1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.

A framework to support collaborative Velo: Knowledge Management for Collaborative (Science | Biology) Projects A framework to support collaborative 1.

Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.

CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

GRNmap and GRNsight June 24, Systems Biology Workflow DNA microarray data: wet lab-generated or published Generate gene regulatory network Modeling.

Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.

Systems Analysis and Design in a Changing World, Fourth Edition

We obtained breast cancer tissues from the Breast Cancer Biospecimen Repository of Fred Hutchinson Cancer Research Center. We performed two rounds of next-gen.

Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.

1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.

Deep Web Exploration Dr. Ngu, Steven Bauer, Paris Nelson REU-IR This research is funded by the NSF REU program AbstractOur Submission Technique Results.

Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.

March 2004 At A Glance autoProducts is an automated flight dynamics product generation system. It provides a mission flight operations team with the capability.

Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Project Database Handler The Project Database Handler is a brokering application that mediates interactions between the project database and the external.

While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.

Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.

Software Quality Assurance and Testing Fazal Rehman Shamil.

Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.

Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.

Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.

Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.

BUSINESS SENSITIVE 1 SAAW - Sequence Annotation and Analysis Workshop Boyu Yang and Gene Godbold Battelle Memorial Institute, Charlottesville Operations.

Experience Report: System Log Analysis for Anomaly Detection

MATLAB Distributed, and Other Toolboxes

Data-intensive Computing: Case Study Area 1: Bioinformatics

DEFECT PREDICTION : USING MACHINE LEARNING

Hansheng Xue School of Computer Science and Technology

1 Department of Engineering, 2 Department of Mathematics,

CSc4730/6730 Scientific Visualization

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.

Introduction to Bioinformatics

Presentation transcript:

Nucleotide Level We define four statistics to describe how results are scored at the nucleotide level. If a base is part of an actual site and is predicted by the tool, it is a true positive. If it is only predicted, it is a false positive. If it is part of an actual site and not predicted, it is a false negative, and if it is neither predicted nor an actual site, it is a true negative Site Level We define three measures to describe how results are scored on the site level, including the use of a threshold which defines what percentage of a motif can be considered a correct prediction. If a site is an actual site and is predicted by the tool by at least the threshold percentage (6.7% for these tests), it is a true positive. If it is only predicted, it is a false positive. If it is part of an actual site and not predicted, it is a false negative. False negatives are not counted on the site level. Phylogenetic vs. non-Phylogenetic Some TFBS detection methods take prediction one step further. Similar to allowing students to use supplemental material on a test, these programs make use of orthologous sequence data and a phylogenetic tree (both generated by the user) in an attempt to raise sensitivity and specificity. We implement methods for automatically generating orthologous sequence data and phylogenetic trees based off ribosomal RNA to incorporate phylogenetic programs into MTAP. Results Program outputs and scoring revealed a wide array of observations on current TFBS detection methods. Some key observations include the following: Our method of automation returns sensitivity and specificity that are comparable to “hand-made” results. Figure 4 shows the nucleotide and site sensitivity, and nucleotide specificity for Tompa’s “hand-made” method versus our automated method. The graph suggests is that our automated method returns results that are comparable to Tompa, evidence that our method is reliable for producing accurate results. Nucleotide and site sensitivity are higher overall with our method, suggesting that automation is an acceptable route compared to manual assessment. Phylogenetic programs are comparable in performance to non-phylogenetic programs over RegulonDB. Figure 5 depicts statistics for phylogenetic vs. non-phylogenetic programs over sequences 400 bp upstream of the coding sequence from RegulonDB using genomes NC_000913, NC_007946, and AC_ Phylogenetic program PhyloGibbs compares favorably with MEME, Weeder, and AlignAce. PhyME returns low sensitivity, perhaps due to its resistance to make predictions. More evaluations are needed over different genome selections to explore the benefits of phylogenetic approaches. Regardless, this illustrates the capabilities of MTAP for evaluation of different classes of algorithms. Abstract Transcription factor binding sites (TFBS) regulate the expression of genes in the cell. Their discovery remains one of the most challenging problems in molecular biology. To solve this problem, traditional techniques have been supplemented by the development of computational prediction methods. Today over 100 algorithms exist for detecting TFBS in various problem domains, yet it remains unclear how to assess their performance. Comparing these tools across datasets is nearly impossible due to a lack of standards for running programs and reporting results. This work proposes a novel method for standardizing runtime procedures and assessing tool performance. To appropriately compare each method, a standard reporting format was developed for each tool. We developed a pipeline framework to integrate, score, and evaluate TFBS identification for each method. We evaluated 9 computational methods for detecting TFBS and obtained statistics that describe their performance. These results allowed us to rank each method by performance on a range of datasets. Computational detection of TFBS remains a challenging problem. Sensitivity and specificity remain low for the algorithms regardless of dataset. Our method for comparing tools exposes a unique view into the behavior of TFBS detection and enables the improvement of methods in the future. Motivation One of the most complex problems facing scientists today is the prediction of TFBS patterns in silico. Over 100 computational prediction tools have been developed to address the problem of predicting TFBS. As a result, it can be unclear to scientists which available method is right for their specific domain. It is possible for a method that performs poorly on one dataset to perform better on another, but until recently there had been no way to determine this information quickly. Comparing even a small set of TFBS prediction tools manually requires extensive knowledge of each program, hours of manual computation, and the ability to streamline a number of different input/output formats into a format that is easily scored. Our objective is to make the process of comparing and evaluating tool performance automated and accurate. Our work allows many popular, publicly-available tools to be analyzed on a number of different datasets and parameters over the period of a few days. The Motif Tool Assessment Platform, or MTAP, allows scientists to learn more about which TFBS detection tool is right for their area of expertise. We are able to show that when tools are run through an automated pipeline, they perform just as well (if not better) than when they are executed manually. MTAP converts results for each tool we examine into one standard format. This format can be used for evaluation, visualization, and integrative prediction. On a representative set of data, we are able give insight into which programs perform best in varying domains. Finally, we can suggest the overall areas where TFBS prediction is lacking, and where TFBS prediction algorithms perform well. Methods The Motif Tool Assessment Platform, or MTAP is a platform for the integration of motif discovery tools. To accommodate the wide diversity of libraries and dependencies, the MTAP architecture was implemented using Java, Perl, Python, and C++. In order to handle computational demands, MTAP evaluates prediction tools simultaneously on a clustered computer. To ensure that our automation process reported results that were similar to results from manual investigation, we included 5 of 13 tools from Tompa et al.’s evaluation of TFBS detection programs [1]. The remaining 8 methods were not included because they were unobtainable, obsolete, or existed only as web versions. To expand our methods beyond what had already been evaluated, we included additional tools based on popularity, availability, and the ability to be executed on the command-line. Input / Output Input datasets were created using peer-reviewed public databases. Data was collected from RegulonDB, DBTBS, and Tompa et al.’s assessment on fly, human, yeast, and mouse. Outputs from each tool vary, and so a custom parsing class was designed for each individual tool. The results are stored in the pipeline scaffold framework for scoring, and stored in a standard output format for later retrieval. A Novel Approach for Integrating and Assessing Computational Tools to Detect Transcription Factor Binding Sites Kathryn Dempsey*, Daniel Quest*°, Mohammad Shafiullah*, Dhundy Bastola* and Hesham Ali*° *College of Information Science & Technology, University of Nebraska at Omaha, Omaha, NE °Department of Pathology & Microbiology, University of Nebraska Medical Center, Omaha, NE MTAP gives insight into program performance over different problem features. MTAP was run over the same 400bp sequences generated with RegulonDB as in Figure 5. Figure 6 was produced with MTAP by plotting the ratio of information content found in the motif relative to background sequence size versus sSn performance. Increasing the information content of the motif and decreasing the background sequence noise is expected to improve tool prediction performance. However, not all tools are able to take advantage of this feature. Some methods, such as AnnSpec, AlignAce, MEME, and Weeder, (MEME and Weeder highlighted) exploit an increase in information. As a result, site sensitivity increases with number of sequences. Other programs do not appear to take advantage of the amount of information given. This highlights one area where TFBS prediction improvement can be focused. Discussion and Conclusions Computational detection of TFBS remains a challenging problem. Automation of computational TFBS prediction makes assessment a possibility where there had been no genome wide means of assessment before. Overall, sensitivity and specificity remain low for the algorithms regardless of dataset, but by examining patterns exposed by our work we can better suggest critical areas for improvement. We have shown that not only is the task of automating this process possible, it performs just as well as when runs are executed by hand. Automation cuts down the amount of time and labor required, it also eliminates opportunities for human/user error when running the tools, scoring the output, etc. We have also shown that phylogenetic programs do not always outperform non-phylogenetic programs. These results beg more questions about phylogenetic programs and the additional information used in their implementation, such as number and proximity of orthologs used, and size of the phylogenetic tree. All these things can now be investigated with the aide of the MTAP. Finally, we have presented representative data sets that give a glimpse into how many different ways we can rank and compare TFBS prediction tools. Our method for comparing tools exposes a unique view into the behavior of TFBS detection and enables the improvement of methods in the future. References [1] M. Tompa et al., Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites. Nature Biotechnology, vol. 23, no. 1, January 2005, [2] H. Salgado et al., Regulondb (version 4.0): transcriptional regulation, operon organization and growth conditions in escherichia coli k-12., Nucleic Acids Res 32 (2004), no. 1,: [3] Acknowledgement This project was supported by the NIH grant number P20 RR from the INBRE Program of the National Center for Research Resources. We would like to thank the developers of the motif detection tools for their help and for making their tools available.