A Novel Knowledge Based Method to Predicting Transcription Factor Targets

Slides:



Advertisements
Similar presentations
Estimating the detector coverage in a negative selection algorithm Zhou Ji St. Jude Childrens Research Hospital Dipankar Dasgupta The University of Memphis.
Advertisements

Yinyin Yuan and Chang-Tsun Li Computer Science Department
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Biologically Inspired Computing: Operators for Evolutionary Algorithms
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
Table 2 shows that the set TFsf-TGblbs of predicted regulatory links has better results than the other two sets, based on having a significantly higher.
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
The multi-layered organization of information in living systems
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.
Article for analog vector algebra computation Allen P. Mils Jr, Bernard Yurke, Philip M Platzman.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.
Author: Jim C. Huang etc. Lecturer: Dong Yue Director: Dr. Yufei Huang.
Identification of Transcriptional Regulatory Elements in Chemosensory Receptor Genes by Probabilistic Segmentation Steven A. McCarroll, Hao Li Cornelia.
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 3 Finding Motifs Aleppo University Faculty of technical engineering.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Computing the exact p-value for structured motif Zhang Jing (Tsinghua University and university of waterloo) Co-authors: Xi Chen, Ming Li.
TRANSFAC Project Roadmap Discussion.  Structure DNA-binding domain (DBD)  The portion (domain) of the transcription factor that binds DNA Trans-activating.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
BACKGROUND E. coli is a free living, gram negative bacterium which colonizes the lower gut of animals. Since it is a model organism, a lot of experimental.
1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, (16 April 2004)
Selecting Informative Genes with Parallel Genetic Algorithms Deodatta Bhoite Prashant Jain.
Bioinformatics Basics Cyrus Courtesy from LO Leung Yau’s original presentation.
Finding Regulatory Motifs in DNA Sequences
Machine Learning in Simulation-Based Analysis 1 Li-C. Wang, Malgorzata Marek-Sadowska University of California, Santa Barbara.
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching ER 2012 October 2012, Florence.
ENCODE enhancers 12/13/2013 Yao Fu Gerstein lab. ‘Supervised’ enhancer prediction Yip et al., Genome Biology (2012) Get enhancer list away to genes DNase.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
1 SRI International Bioinformatics Advanced PGDB Editing: Regulation GO Terms Ingrid M. Keseler Bioinformatics Research Group SRI International
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Chapter 13. The Impact of Genomics on Antimicrobial Drug Discovery and Toxicology CBBL - Young-sik Sohn-
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design Won-Hyong Chung and Seong-Bae Park Dept. of Computer Engineering.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Gene expression analysis
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Tutorial 7 Gene expression analysis 1. Expression data –GEO –UCSC –ArrayExpress General clustering methods –Unsupervised Clustering Hierarchical clustering.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Flat clustering approaches
Vector Quantization CAP5015 Fall 2005.
Local Multiple Sequence Alignment Sequence Motifs
Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
杜嘉晨 PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
1 Chapter-3 (Electric Potential) Electric Potential: The electrical state for which flow of charge between two charged bodies takes place is called electric.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions.
Integrative Genomics I BME 230. Probabilistic Networks Incorporate uncertainty explicitly Capture sparseness of wiring Incorporate multiple kinds of data.
The Transcriptional Landscape of the Mammalian Genome
Molecular Classification of Cancer
Prediction of Regulatory Elements for Non-Model Organisms Rachita Sharma, Patricia.
Computational Discovery of miR-TF Regulatory Modules in Human Genome
Presentation transcript:

A Novel Knowledge Based Method to Predicting Transcription Factor Targets

Background It is termed as the major aspect of transcription regulation that transcription factor regulates target genes’ expression. Extensive efforts have been made in discovering transcription factors’ target genes both in wet and dry labs. Since transcription factors as well as their targets may participate the same biological pathways and share similar biological functions, we can inference the regulatory relationship by analyzing the Gene Ontology annotations and potential transcription factor DNA binding sites. Hence, the computational method to predict potential transcription factor target we developed could be useful in transcription regulatory mechanism researches.

Background TF-TFBS-TFT triplets –Transcription factors(TF) regulate transcription factor target(TFT) through binding to transcription factor DNA binding sites(TFBS).

Background Our predicting strategy TF TFBS TFT GO encoding 0/1 encoding GO encoding Hybridization Space True False Predictor It is TRUE that Transcription factors(TF) regulate transcription factor target(TFT) through binding to transcription factor DNA binding sites(TFBS).

Materials and Methods Positive dataset –For transcription factors as well as their targets, binding sites, the original dataset came from TRANSFAC v7.0. Then the original dataset was filtered as following steps: (1) Remove the TFs with no SwissProt Accessions, as well as TFTs. (2) Remove TFBSs with length less than 5bp or longer than 25bp. (3) Finally, positive dataset with 3430 TF-TFBS- TFT triplet which covered 143TF, 571TFBS and 1416TFT was built Negative dataset –Negative dataset was randomly generated based on positive dataset as following steps; (1) Random number i was generated from uniform distribution on interval [1,143], j from interval [1,571] and k from [1,1416]. (2) The ith TF, jth TFBS and kth TFT was selected from the positive dataset. Then a new triple was constructed through combining those three elements. This new triple is ignored if it does appear in the original positive dataset, otherwise would be pushed in the negative dataset. (3) Repeat step 1,2 and 3 until the size of negative dataset reached 6860, which is two times that of positive dataset. (3) Finally, a negative dataset with 6860 TF-TFBS-TFT triples which covered 140TF, 559TFBS and 1317TFT was obtained

Numeric representation system TF Gene Ontology representation system –By using Uniprot2GO mapping provided by GOA Uniprot 34.0 on November 21st 2005 [ ], functional annotations of TFs provided by GO were obtained. –Each TF can be represented in a 9525D (Dimensional) vector through using each of the 9525 GO items as the vector base, e.g. for a given TF that hit a GO item which is the ith number of the 9525 GO items, then the ith component of the 9525D vector will be set to 1, otherwise 0. –Thus, the TF sample can be formulated as where,

Numeric representation system TFTs are encoded by using the same approach as TFs –Each TFT can be represented in a 9525D (Dimensional) vector through using each of the 9525 GO items as the vector base where,

Numeric representation system Short nucleotide sequence TFBS are encoded using the 0/1 encoding system which can be briefed as follows –Firstly, TFBSs with length less than 25bp are extended to exact 25bp through adding ‘N’ suffixes, e.g. ‘CCCCACGTAGCTAGACGTAG’ will be extended to ‘CCCCACGTAGCTAGACGTAGNNNNN’, meanwhile make no change for these TFBSs with length exact 25bp. –Then, these TFBSs can be represented in a 100D (Dimensional) vector, e.g. ‘ACGTAGCTAGACGTAGCTAGNNNNN’ will be represented in a 100D binary vector as 0010'0010'0010'0010'0001'0010'0100'1000'0001'0100'0010'1000'0001'0100'0001'0010'0100'1 000'0001'0000'0000'0000'0000'0000, meanwhile each nucleotide was encoded with a 4D binary vector as –Finally, each TFBS can be formulated as where, d can be either 0 or 1.

The hybridization space To fascinate predicting the interactions between TF and TFT, a numeric representation to cover TF-TFBS-TFT triplet is developed. This can be done as follows. Suppose Tx, Dy and Gz are the xth TF, yth TFBS, zth TFT, respectively. The x − y − z TF-TFBS-TFT triplet TDG( x, y, z ) can be expressed as

The hybridization space Predictor 1 0 TF regulate TFT through binding TFBS NOT

The predictor The Nearest Neighbor Algorithm –Once the numeric representation is built, the predictor performed in this contribution can be briefly as follows. Suppose there are N triplets ( R1 R2,..., Ri,..., RN ) with known classification label ( L1 L2,..., Li,..., LN), where Li ∈ {true false} and true indicates it is indeed a true triplet that TF act on TFT through TFBS, and false otherwise. –Given a novel triplet R, is it true? To investigate this problem, distance D(R, Ri) (1≤ i ≤ N ) is defined where, Ri · R is the inner-product of R and Ri, ||R|| and ||Ri|| are the modulus of R and Ri, respectively.

The predictor The Nearest Neighbor Algorithm –Once the distance is calculated. The category of R can be predicted to be same as that of its nearest neighbor. –If there is a tie, which means there are more than one nearest neighbor.

Results and Discussion Jackknife cross-validation test DatasetPositiveNegativeOverall Success rate 2630/2693=97.6%5100/5337=95.6%7730/8030=96.3% Excluding the transcription factors with no GO annotations as well as the transcription factor targets, finally 19150D ( ) vector were built for 2693 true triples and 5337 artificial triplets. The result is obtained when k is set to 0.5.

Conclusion Identifying transcription factor’s targets is one of the basic researches in transcription regulatory area. In this contribution, a knowledge based method was proposed to identify TF-TFT relationships through integrating Gene Ontology annotations and transcription factor DNA binding preference. The predictor we built acquired a fairly good performance as 96.5%, which indicates the computational method we developed could be a useful tool in transcription regulatory mechanism researches.

Thank you