Computational analyses of yeast and human chromatin William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Outline Sequence-based models of nucleosome positioning Footprinting protein binding sites genomewide
Genes Gene ‘domains’ Organization of cis-regulatory sequences DNaseI Hypersensitive Site Trans-factor complex Chromatin Fiber Nucleus GenomicDNA Packaged into Chromatin
4/43 9.3% 33/ % 108/ %
Overall approach Microarray data from (Yuan et al. 2006).
Sequence spectrum Compute frequencies of substrings of length k (k-mers) for k = 1 up to 6. Treat reverse complements as the same k-mer. The resulting vector contains 2772 entries. A/T C/G AA/TT AC/GT AG/CT AT/AT CA/TG CC/GG CG/CG GA/TC GC/GC TA/TA AAA/TTT AAC/GTT AAG/CTT AAT/ATT TTTAAA/TTTAAA
Primary results
The SVM recapitulates array data
10bp periodicity AA periodicity, Drew & Travers 1986 AA/TT/AT periodicity, Segal 2006 Periodicity in SVM score, Peckham 2007
Comparison of yeast models Segal 2006: The model is positional. The model is generative. Compare predicted positions with 199 sites from the literature. 54% are within 35 bp Expect 39% by chance. The model explains >50% of the signal. The model performs 15% better than chance. Peckham 2007: The model is compositional. The model is discriminative. Compare predicted positions with sites derived from (Yuan 2006). 50% are within 40 bp Expect 33% by chance. The model explains ~50% of the signal. The model performs 17% better than chance.
Two data sets Dennis et al., Genome Research, kb regions upstream of 42 genes 50-mer probes every 20 bp 3 arrays, 3 copies of each probe, forward and reverse strand → 18 measurements per probe Ozsolak et al., Nature Biotechnology, kb regions upstream of 3692 genes 50-mer probes every 10 bp 7 cell lines
Cross-validation results
Complementary aspects of chromatin accessibility Dennis and A375 SVMs accurately identify low MNase accessibility. MEC SVM accurately identifies high MNase accessibility. Strong MNase digestion (MEC) allows the recognition of nucleosome disfavoring sequences. Weak MNase digestion (A375) allows the recognition of nucleosome forming sequences.
Yeast and human concordance Each model was applied to the human ENCODE regions
Low- and high-scoring regions A375 SVM scores are averaged over 1000 top- and bottom-scoring regions. Flanking lines indicate standard error of the mean.
Dinucleotide frequencies MNase cleavage bias is unlikely to account for such large differences. Nucleosome forming sequences exhibit a 3bp periodicity of CG and GC dinucleotides. Nucleosome disfavoring sequences tends to be low complexity.
Transcription start sites A375 – weak digestion Recognizes nucleosome forming sequences MEC – strong digestion Recognizes nucleosome disfavoring sequences SVM scores are averaged over all TSSs in the ENCODE regions.
Summary An SVM can discriminate between MNase protected and MNase accessible sequences with high accuracy. The model learns to recognize complementary phenomena, depending upon the degree of MNase digestion. The model recapitulates known features of human chromatin. Most nucleosome positioning is boundary-event driven.
Methodology
60% of DNaseI cleavage occurs in intergenic regions
Individual footprints
Problem definition Given –Cut-counts at each position –Unique mappability (Boolean) of each position –Size range of footprints –Size of the background window Return –A ranked list of non-overlapping footprints, each associated with a statistical confidence score
Scoring a candidate footprint Foreground window Background window A depletion score
The probability that a window of size a within the target region will contain x or fewer cuts –a: effective foreground window size –b: effective background window size –B: # of cuts in the background window Score all overlapping windows of width k min to k max. Depletion score: binomial distribution
Depletion score: SNR Signal-to-noise ratio –λ: pseudo-count (0.01) –Noise is computed by excluding foreground from the background window.
Greedy selection Generate a non-overlapping set of high- scoring windows –Sort all of the depletion scores in ascending order –Traverse the sorted list, accepting a scored window if it does not overlap a previously accepted window
Empirical null model Shuffle the cut-counts at the level of genomic positions, together with the mappability information of each position Repeat the depletion scoring and greedy selection procedure on the shuffled data Generate a ranked list of footprints Estimate false discovery rate using Storey method.
Evaluation: gold standard MacIsaac set [MacIsaac et al. 2006] –Conserved regulatory sites in yeast –Identified from ChIP data –4387 sites with stringent thresholds Imperfect –Conservatively defined –Different experimental conditions Only used to compare different footprint detectors
Evaluation: metric Recall = TP / (TP+FN) Precision = TP / (TP+FP)
Results “What fraction of the MacIsaac motifs are in footprints?” “What fraction of the footprints contain a MacIsaac motif?”
Results Binomial scoring performs better than the simple ratio. The rank transformation yields better results. Larger background widths are better. Using the double scoring scheme does not always help.
Results 238,133 candidate footprints 4514 are significant at q<0.05. Estimated 10,716 footprints in total. Our algorithm identifies 40.0% of these at q<0.05.
Scan footprints with MacIsaac motifs, using q< % of the footprints contain a motif. Also scan intergenic regions. Every motif occurs more frequently in footprints than in intergenic regions.
Footprints contain known motifs We identify 5800 footprints at q=0.05. Find 100 motifs with MEME. Identify 20 of these motifs with Tomtom. Motif information content is inversely correlated with Phastcons score (p < ).
Motif discovery 15 sites, E=7e-1241 sites, E=1e-29 8 sites, E=6e sites, E=3e-6 7/8 sites occur in sigma LTRs associated with retrotransposons
MCM1 The first motif matches the core of the TRANSFAC MCM1 motif.
Motif discovery 41 sites, E=1e sites, E=3e occurrences in footprints. Of these, 42 are within 250bp 5’ of the start of a gene. 35 occurrences in footprints. Of these, 22 are within 250bp 5’ of the start of a gene.
Global view of chromatin organization
Summary Digital genomic footprinting provides a nucleotide-level map of DNaseI accessibility across the yeast genome. This map enables identification of individual protein binding sites. Dramatically improves the signal-to-noise ratio for motif searching. The method can be performed on any organism whose genome is sequenced, exposing its entire cis-regulatory framework in a single experiment.