Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,

Slides:



Advertisements
Similar presentations
Road-Sign Detection and Recognition Based on Support Vector Machines Saturnino, Sergio et al. Yunjia Man ECG 782 Dr. Brendan.
Advertisements

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
A Genomic Code for Nucleosome Positioning Authors: Segal E., Fondufe-Mittendorfe Y., Chen L., Thastrom A., Field Y., Moore I. K., Wang J.-P. Z., Widom.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Gene regulation in cancer 11/14/07. Overview The hallmark of cancer is uncontrolled cell proliferation. Oncogenes code for proteins that help to regulate.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
Indiana University Bloomington, IN Junguk Hur Computational Omics Lab School of Informatics Differential location analysis A novel approach to detecting.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Fuzzy K means.
A Quantitative Modeling of Protein- DNA interaction for Improved Energy Based Motif Finding Algorithm Junguk Hur School of Informatics April 25, 2005 L529.
ChIP-seq QC Xiaole Shirley Liu STAT115, STAT215. Initial QC FASTQC Mappability Uniquely mapped reads Uniquely mapped locations Uniquely mapped locations.
Computational analyses of yeast and human chromatin William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren.
Whole Genome Expression Analysis
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
From motif search to gene expression analysis
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Proliferation cluster (G12) Figure S1 A The proliferation cluster is a stable one. A dendrogram depicting results of cluster analysis of all varying genes.
Finish up array applications Move on to proteomics Protein microarrays.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
I519 Introduction to Bioinformatics, Fall, 2012
Analysis of the yeast transcriptional regulatory network.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Analysis of protein-DNA interactions with tiling microarrays
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Cluster validation Integration ICES Bioinformatics.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Transcription factor binding motifs (part II) 10/22/07.
Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
CS273B: Deep learning for Genomics and Biomedicine
Babak Alipanahi1, Andrew Delong, Matthew T Weirauch & Brendan J Frey
De novo Motif Finding using ChIP-Seq
Volume 23, Issue 7, Pages (May 2018)
Finding regulatory modules
Yang Liu, Perry Palmedo, Qing Ye, Bonnie Berger, Jian Peng 
Mapping Global Histone Acetylation Patterns to Gene Expression
Presented by, Jeremy Logue.
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
Yaron Orenstein ACGT group meeting
Human Promoters Are Intrinsically Directional
Volume 133, Issue 7, Pages (June 2008)
Volume 132, Issue 6, Pages (March 2008)
Volume 42, Issue 6, Pages (June 2011)
Volume 3, Issue 4, Pages (April 2013)
Presented by, Jeremy Logue.
Genomewide profiling of chromatin accessibility in prostate cancer specimens Genomewide profiling of chromatin accessibility in prostate cancer specimens.
Deep Learning in Bioinformatics
BRD4 expression and genomic distribution in B-CLL.
Presentation transcript:

Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius, Aaron Arvey, William Chang, William Stafford Noble, Christina Leslie Memorial Sloa-Kettering Cancer Center, NY Presented by Yaron Orenstein for ACGT group meeting, 19 January 2011

Introduction – Biological Background Gene regulatory programs are orchestrated by transcription factors (TFs). These proteins usually bind to binding sites (BSs) in the promoter region and enable or impend transcription of the gene. Accurately modeling the DNA sequence preferences of TFs is a key piece in unraveling the regulatory code.

Modeling BSs: PSSM model The most popular model to represent binding sites is the PSSM: position specific scoring matrix. These motifs may match thousands of sites in intergenic regions, producing an unreliable list of potential TF target genes A C G T

All possible 8-mers model This model contains a list of all possible 8-mers ranked by the TF preference. This information can be obtained for example from PBM data and calculating an enrichment-score for each 8-mer. The disadvantage is clearly its large size and uninterpretability. In addition, the sequence similarities between 8-mers is not considered.

Protein Binding Microarray data PBM array contains ~41,000 probe sequence of length 35bp each, covering all possible DNA 10-mers. For each probe the binding intensity is reported.

Support vector regression Motivation: predict real values based on a feature set. Given a training set, find a function f which best predicts y. For example, if f is linear, then f(x) = +b, where w is the set of feature weights. is minimized under some error constraints.

Example for SVR A simple way to predict binding intensity from PBM data based on 8-mer features. Use indicator features for each 8-mer: – 1 if sequence x contains the 8-mer. – 0 if it does not.

An overview

Methods They developed a training strategy for the SVR model that involves three key components: 1.The choice of kernel. 2.The sampling procedure for selecting the most informative training sequences. 3.The feature selection method.

The di-mismatch kernel Let be a set of unique k-mers that occur in the set of training sequences. Define the set of substrings of length k in s (of length N: Then s is represented by the feature vector: And counts the number of matching dinucleotides between and.

Example for the di-mismatch kernel Two non-consecutive pair of mismatches lead to a count of mismatches 6: 4 consecutive mismatches lead to a count of 5:

Sampling PBM data to obtain an informative training set They selected the set of “positive” training probes to be those sequences associated with normalized binding intensities Z ≥ 3.5. If there were more than 500, they selected the top 500 ranked by their binding signals. The same number of “negative” training probes was selected from the other end of the distribution.

Feature Selection They selected the feature set to be those k-mers that are over-represented either in the “positive” or “negative” probe class They computed the mean di-mismatch score for each k-mer in each class and ranking features by the difference between these means. They used at most 4000 k-mers.

Results First, they tested how well they predict the ranking of probe sequences of one PBM array based on learning from another PBM array. They used the metric of: Top 100, meaning how many of the top 100 probes were ranked to be in the top 100 by the model. They compared to PSSM and E-Score (full 8- mers list) models.

The left scatter plot shows the detection of the top 100 probes using maximum E-scores (x-axis) and the SVR model (y-axis) in the prediction of in vitro TF binding preferences. Each point corresponds to one TF. The right panel is similar to the left, but compares the SVR versus PBM-derived PSSMs for the 114 mouse TFs.

Testing on Chip-Chip data

Prediction of in-vivo occupancy They computed the binding occupancy using a sliding 36-mer window for scoring. They compared to: 1.PSSM. Log-odds scores were used. 2.E-score over a fixed threshold. 3.E-score based occupancy (using the median probe intensity of PBM probes containing the highest-scoring 8-mer pattern).

Predicted binding profile for: – (left) yeast TF Ume6 along IGR iYFL022C – (right) yeast TF Gal4 along IGR iYFR026C

They computed the detection of the top 200 inter genomic regions by the top 200 predictions, where the top 200 “bound” IGRs were determined by their p- value ranking. Prediction of in vivo is weak to very poor (due to indirect and competitive binding as well as other factors). Still, in 8 out of 9 example the SVR method outperforms the occupancy score method of Zhu et al. (2009). Against PSSM model it was: 6 wins, 1 ties, 2 losses.

ChIP-Seq In high level: 1.a specific TF binds to the DNA. 2.The unbounded segments are removed. 3.An antibody removes the TF. 4.The left sequences are read. This gives an occupancy map with binding intensities to different genomic regions (measured by the number of reads from that segment).

Testing on ChIP-seq data

They selected 1000 confident peak regions (60bp each) and 1000 “negative” regions from flanking sequences (60bp regions 300bp away from the peaks). Model performance measured by area under the ROC curve (AUC), using the maximum SVR prediction score (over 36-mer windows) to rank ChIP-seq 60-mers. ROC = true positive rate vs. false positive rate.

SVRs trained on PBM arrays are able to capture ChIP-seq peaks better than PSSMs or the occupancy score.

Support Vector Machines Here we want to classify the data to binary classes, i.e. the training set is

Training discriminative models on ChIP-seq data Trained SVMs using the (13,5) parameters on 60- mer ChIP-seq peaks (positive sequences) and flanking negative sequences. Evaluation by computing AUCs on the same test sets of 1000 ChIP-seq peaks and 1000 flanking negative sequences using 10-fold cross- validation. Tested against Weeder and Mdscan, which determine overrepresented k-mer and PSSM motifs, respectively.

SVMs trained on ChIP-seq data capture sequence information from the genomic context of ChIP-seq peaks and improve in vivo prediction performance. There was no advantage to training regression models on ChIP-seq peaks label with real-valued occupancy.

PBM experiments may capture in vivo preference To investigate how some PBMs contain 2 different binding sites, they did: 1.Cluster k-mer features based on their co-occurrence in the training sequences. 2.Projected highly weighted k-mers into 2 dimenstions using principal component analysis (PCA) Two clusters were found, each representing a different motif. The SVR was trained on the features of each motif separately and the AUCs were 0.75 and 0.54.

K-mers contributing to the (left) Oct4 PBM model and (right) Sox2 ChIP model, where each point represents a 13-mer and is colored according to its model weight. Star and circle point styles indicate different clusters. For the PBM derived model, the clusters represent the primary and secondary binding motifs For the ChIP-derived model, the clusters correspond to the motifs for Sox2 and its cofactor Oct4.

Summary A flexible new discriminative framework for learning TF binding models from high resolution in vitro and in vivo data. The SVR/SVM models better predict binding affinity and thus are more suitable for representing complex regulatory regions.

Possible directions to continue 1.Training jointly on PBM and ChIP-seq data for the same TF. 2.Develop multi-task training strategies for modeling the binding preferences of a class of structurally relate TFs using features of the amino acid sequence. 3.Combine in vivo TF sequence preference models with data on chromatin state to predict TF target genes in new cell types.