Computational analyses of yeast and human chromatin William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
Periodic clusters. Non periodic clusters That was only the beginning…
Advertisements

Epigenetics Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
A Genomic Code for Nucleosome Positioning Authors: Segal E., Fondufe-Mittendorfe Y., Chen L., Thastrom A., Field Y., Moore I. K., Wang J.-P. Z., Widom.
We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Epigenetics 12/05/07 Statisticians like data.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.
ChIP-seq QC Xiaole Shirley Liu STAT115, STAT215. Initial QC FASTQC Mappability Uniquely mapped reads Uniquely mapped locations Uniquely mapped locations.
Biological Motivation Gene Finding in Eukaryotic Genomes
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Multiple testing correction
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Kristen Horstmann, Tessa Morris, and Lucia Ramirez Loyola Marymount University March 24, 2015 BIOL398-04: Biomathematical Modeling Lee, T. I., Rinaldi,
Queensland University of Technology CRICOS No J Using a Beagle to sniff for Bacterial Promoters Stefan R. Maetschke, Michael Towsey and James M.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis Presented by Rhee, Je-Keun Graduate Program in Bioinformatics.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,
Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Cis-regulatory Modules and Module Discovery
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Investigate Variation of Chromatin Interactions in Human Tissues Hiren Karathia, PhD., Sridhar Hannenhalli, PhD., Michelle Girvan, PhD.
Local Multiple Sequence Alignment Sequence Motifs
Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Finding genes in the genome
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Transcription factor binding motifs (part II) 10/22/07.
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO.
Transcription factor binding motifs
Recitation 7 2/4/09 PSSMs+Gene finding
Taichi Umeyama, Takashi Ito  Cell Reports 
Myong-Hee Sung, Michael J. Guertin, Songjoon Baek, Gordon L. Hager 
Presented by, Jeremy Logue.
Volume 128, Issue 6, Pages (March 2007)
Human Promoters Are Intrinsically Directional
Songjoon Baek, Ido Goldstein, Gordon L. Hager  Cell Reports 
Evolution of Alu Elements toward Enhancers
Nora Pierstorff Dept. of Genetics University of Cologne
Summarized by Sun Kim SNU Biointelligence Lab.
Presented by, Jeremy Logue.
High Sensitivity Profiling of Chromatin Structure by MNase-SSP
Taichi Umeyama, Takashi Ito  Cell Reports 
Presentation transcript:

Computational analyses of yeast and human chromatin William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington

Outline Sequence-based models of nucleosome positioning Footprinting protein binding sites genomewide

Genes Gene ‘domains’ Organization of cis-regulatory sequences DNaseI Hypersensitive Site Trans-factor complex Chromatin Fiber Nucleus GenomicDNA Packaged into Chromatin

4/43 9.3% 33/ % 108/ %

Overall approach Microarray data from (Yuan et al. 2006).

Sequence spectrum Compute frequencies of substrings of length k (k-mers) for k = 1 up to 6. Treat reverse complements as the same k-mer. The resulting vector contains 2772 entries. A/T C/G AA/TT AC/GT AG/CT AT/AT CA/TG CC/GG CG/CG GA/TC GC/GC TA/TA AAA/TTT AAC/GTT AAG/CTT AAT/ATT TTTAAA/TTTAAA

Primary results

The SVM recapitulates array data

10bp periodicity AA periodicity, Drew & Travers 1986 AA/TT/AT periodicity, Segal 2006 Periodicity in SVM score, Peckham 2007

Comparison of yeast models Segal 2006: The model is positional. The model is generative. Compare predicted positions with 199 sites from the literature. 54% are within 35 bp Expect 39% by chance. The model explains >50% of the signal. The model performs 15% better than chance. Peckham 2007: The model is compositional. The model is discriminative. Compare predicted positions with sites derived from (Yuan 2006). 50% are within 40 bp Expect 33% by chance. The model explains ~50% of the signal. The model performs 17% better than chance.

Two data sets Dennis et al., Genome Research, kb regions upstream of 42 genes 50-mer probes every 20 bp 3 arrays, 3 copies of each probe, forward and reverse strand → 18 measurements per probe Ozsolak et al., Nature Biotechnology, kb regions upstream of 3692 genes 50-mer probes every 10 bp 7 cell lines

Cross-validation results

Complementary aspects of chromatin accessibility Dennis and A375 SVMs accurately identify low MNase accessibility. MEC SVM accurately identifies high MNase accessibility.  Strong MNase digestion (MEC) allows the recognition of nucleosome disfavoring sequences.  Weak MNase digestion (A375) allows the recognition of nucleosome forming sequences.

Yeast and human concordance Each model was applied to the human ENCODE regions

Low- and high-scoring regions A375 SVM scores are averaged over 1000 top- and bottom-scoring regions. Flanking lines indicate standard error of the mean.

Dinucleotide frequencies MNase cleavage bias is unlikely to account for such large differences. Nucleosome forming sequences exhibit a 3bp periodicity of CG and GC dinucleotides. Nucleosome disfavoring sequences tends to be low complexity.

Transcription start sites A375 – weak digestion Recognizes nucleosome forming sequences MEC – strong digestion Recognizes nucleosome disfavoring sequences SVM scores are averaged over all TSSs in the ENCODE regions.

Summary An SVM can discriminate between MNase protected and MNase accessible sequences with high accuracy. The model learns to recognize complementary phenomena, depending upon the degree of MNase digestion. The model recapitulates known features of human chromatin. Most nucleosome positioning is boundary-event driven.

Methodology

60% of DNaseI cleavage occurs in intergenic regions

Individual footprints

Problem definition Given –Cut-counts at each position –Unique mappability (Boolean) of each position –Size range of footprints –Size of the background window Return –A ranked list of non-overlapping footprints, each associated with a statistical confidence score

Scoring a candidate footprint Foreground window Background window A depletion score

The probability that a window of size a within the target region will contain x or fewer cuts –a: effective foreground window size –b: effective background window size –B: # of cuts in the background window Score all overlapping windows of width k min to k max. Depletion score: binomial distribution

Depletion score: SNR Signal-to-noise ratio –λ: pseudo-count (0.01) –Noise is computed by excluding foreground from the background window.

Greedy selection Generate a non-overlapping set of high- scoring windows –Sort all of the depletion scores in ascending order –Traverse the sorted list, accepting a scored window if it does not overlap a previously accepted window

Empirical null model Shuffle the cut-counts at the level of genomic positions, together with the mappability information of each position Repeat the depletion scoring and greedy selection procedure on the shuffled data Generate a ranked list of footprints Estimate false discovery rate using Storey method.

Evaluation: gold standard MacIsaac set [MacIsaac et al. 2006] –Conserved regulatory sites in yeast –Identified from ChIP data –4387 sites with stringent thresholds Imperfect –Conservatively defined –Different experimental conditions Only used to compare different footprint detectors

Evaluation: metric Recall = TP / (TP+FN) Precision = TP / (TP+FP)

Results “What fraction of the MacIsaac motifs are in footprints?” “What fraction of the footprints contain a MacIsaac motif?”

Results Binomial scoring performs better than the simple ratio. The rank transformation yields better results. Larger background widths are better. Using the double scoring scheme does not always help.

Results 238,133 candidate footprints 4514 are significant at q<0.05. Estimated 10,716 footprints in total. Our algorithm identifies 40.0% of these at q<0.05.

Scan footprints with MacIsaac motifs, using q< % of the footprints contain a motif. Also scan intergenic regions. Every motif occurs more frequently in footprints than in intergenic regions.

Footprints contain known motifs We identify 5800 footprints at q=0.05. Find 100 motifs with MEME. Identify 20 of these motifs with Tomtom. Motif information content is inversely correlated with Phastcons score (p < ).

Motif discovery 15 sites, E=7e-1241 sites, E=1e-29 8 sites, E=6e sites, E=3e-6 7/8 sites occur in sigma LTRs associated with retrotransposons

MCM1 The first motif matches the core of the TRANSFAC MCM1 motif.

Motif discovery 41 sites, E=1e sites, E=3e occurrences in footprints. Of these, 42 are within 250bp 5’ of the start of a gene. 35 occurrences in footprints. Of these, 22 are within 250bp 5’ of the start of a gene.

Global view of chromatin organization

Summary Digital genomic footprinting provides a nucleotide-level map of DNaseI accessibility across the yeast genome. This map enables identification of individual protein binding sites. Dramatically improves the signal-to-noise ratio for motif searching. The method can be performed on any organism whose genome is sequenced, exposing its entire cis-regulatory framework in a single experiment.