Genes and Regulatory Elements Zhiping Weng U Mass Medical School
ENCODE ENCyclopedia Of DNA elements (The ENCODE Project Consortium, Science 2004, Nature 2007) m001 m002 m003 m004 m005 m007 m008 m009 m010 m011 m012 m013 m014 r111 r112 r113 r114 r121 r122 r123 r131 r132 r133 r211 r212 r213 r221 r222 r223 r231 r232 r233 r311 r312 r313 r321 r322 r323 r334 r324 m006 r331 r332 r333 1 2 3 4 5 6 9 8 7 10 12 11 13 15 14 20 19 16 22 21 Y X 17 18 Goal: Identify all functional elements in the human genome. Pilot phase: 1% of the genome is being annotated very extensively (30 Mb of sequence). Now genome-wide
The ENCODE Project Consortium (2004) The ENCODE (ENCyclopedia Of DNA Elements) Project Science, Vol 306, 636-640.
Gene RNA-seq
Epigenomics
Regulatory Elements chrX:129,143,739-129,661,948
The human genome 45% repetitive DNA 53% Unique and segmental duplicated DNA 2% genes (25,000) Where are the gene regulatory elements? G. Crawford
DNase hypersensitive (HS) sites identify active gene regulatory elements DNase I HS sites Regions hypersensitive to DNase Promoters Enhancers Silencers Insulators Locus control regions Meiotic recombination hotspots HS sites identify “open” regions of chromatin Crawford et al., Nature Methods 2006
DNase-chip to identify DNase HS sites or sequence directly. Crawford et al., Nature Methods 2006
Arrays used for DNase-chip NimbleGen arrays 385,000 50-mer oligos oligos spaced every 38 bases (12 base overlap) non-repetitive unique regions 1% of the genome (44 ENCODE regions) Crawford et al., Nature Methods 2006
DNase-chip Quality Assessment Identification of DNase HS sites from 6 cell types. (a) Representative DNase-chip data from ENCODE region ENr231 (chr1:148,050,000-148,165,000). Note how there are common, ubiquitous and cell-line specific DNase HS sites. (b) DNase-chip from K562 cells identifies all five locus control region (LCR) DNase HS sites upstream of the b-globin locus, as well as the 3’ DNase HS site (in red). There are additional DNase HS sites identified around promoter regions of the globin genes. Xi H., Shulha H.P., Lin J.M., Vales T.R., Fu Y., Bodine D.M., McKay R.D.J, Chenoweth J.G., Tesar P.J., Furey T.S., Ren B., Weng Z.+, Crawford G.E.+ (2007) Identification and characterization of cell type-specific and ubiquitous chromatin regulatory structures in the human genome. +Co-corresponding authors PLoS Genetics, 8, 8-20.
DNase HS site (DHS) statistics in 6 cell lines CD4 GM06990 K562 H9 HeLa IMR90 # of DNase I chip hits identified 1262 1098 1210 1274 1042 1244 Specificity based on TSS-depleted ENCODE regions >99.9% 99.5% 99.8% 99.7% Specificity based on “gold standard” negative DHS (n=134) 97% 99% 96% 95% 92% # of proximal DHS 784 674 666 773 575 642 # of distal DHS 478 424 538 498 464 599 # of DHS containing a TSS 402 352 357 432 318 362 # of cell line specific DHS 439 341 448 400 277 463 # of ubiquitous DHS 262 266 273 259
Unique, common, and ubiquitous DNase HS sites GM CD4 HeLa H9 K562 IMR90 Ubiquitous HS sites 20% Cell-type specific and Common HS sites 80% Collectively, the DHS cover 8.3% of the ENCODE regions.
Have we reached saturation in identifying most DNase HS sites?
CpG content of DNase HS sites
Ubiquitous DNase HS sites are enriched for promoters (TSS) Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site (TSS) Ubiquitous DNase HS sites are enriched for promoters (TSS) What about ubiquitous distal DNase HS sites?
Most Distal (non-TSS) ubiquitous DNase sites are insulators bound by CTCF ChIP
Chromatin-immunoprecipitation (ChIP) - chip Antibody against CTCF Tiling array Chromatin-immunoprecipitation (ChIP) - chip Kim T.H. et al. Direct Isolation and Identification of Promoters in the Human Genome Genome Research (2005) Direct sequencing ChIP-seq
The H19/IGF2 Locus is well insulated
DNase HS sites identify insulator in the Hox locus
Cell culture insulator assays demonstrate that DNaseI HS sites (that overlap CTCF) display enhancer blocking activity.
CTCF motif sites are conserved
CTCF sites make up a greater % of ubiquitous distal DNase HS sites than enhancers
Ubiquitous DNase HS sites are enriched for promoters (TSS) Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site (TSS) Ubiquitous DNase HS sites are enriched for promoters (TSS)
Ubiquitous proximal DNase HS sites
Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site (TSS)
Antibody against histone modification Tiling array Sequencing
Enrichment between tissue-specific H3K4me2 and DNase HS sites
Cell type-specific DNase HS sites correlate with cell type-specific histone modifications Similarly for H3K4me1, H3K4me3, H3ac and H4ac, for which we have experimental data.
Cell type-specific DNase HS sites correlate with cell type-specific enhancers
Cell type-specific DNase HS sites correlate with cell type-specific gene expression
Use cell-type specific DNase HS sites for discovery
Transcriptional Motifs Gene transcription is controlled by molecules (transcription factors, or TFs) binding to short DNA sequences (cis-elements, TF motifs) in promoters and distal elements
Finding enriched motifs in tissue-specific DNase HS sites Screen against a motif library, e.g., JASPAR or TRANSFAC STAT DHS #1 DHS #2 DHS #3 DHS #4 DHS #5 the Clover algorithm Myc/Max YY1 (etc.)
JASPAR: a database of transcription factor motifs
Clover: Cis-eLement OVERrepresentation Myc/Max DHS sequences 17.3 Raw score
The Clover Algorithm Frith MC, Fu Y, Yu L, Chen J-F, Hansen U, Weng Z (2004). Detection of Functional DNA Motifs Via Statistical Overrepresentation. Nucleic Acids Res. 32:1372-1381. Lk: nucleotide at position k W: motif width S: a promoter sequence Ms: number of motif locations in a sequence A: all possibilities of choosing a subset of sequences N: the total number of promoter sequences Clover Raw score
Clover: Cis-eLement OVERrepresentation Myc/Max Control DNA sequences DHS sequences 4.2 6.6 18 9.1 17.3 Raw score P-value = 1/4
Motifs enriched in cell-type specific DNase HS sites
Motifs enriched in cell-type specific DNase HS sites
Genome-wide DNase-chip and DNase-sequencing data CD4 cells 23 k proximal DNaseI HS sites 72 k distal DNaseI HS sites
Enriched transcription factor binding motifs in distal DNaseI HS sites Hematopoietic system: TAL1 AML PU.1 C/EBPα Immune system: STAT1, STAT3, STAT5 IRF1, IRF3 and IRF5
Identify motif clusters (modules) Distal DHS sequences acgtcggctgacaccaggtctgcttgattcgatgagattgaattcgtaggagctggattagag ggcttggggcttgaggcttgacaccatatcgtagcgctgagttgctgagtttcgtatggcgct cgatgcttattagcggctattataggctagctaggcaatacacatcgctgatatagcggctta tgagatagcgtgctagctatatggattggaatattcggcgctgaaaggtcttagctagtcgta aatatatgcgcgtatgcgtatggcgggtatatgggggcttggtcttttttttcgcttaggtcg Find motif clusters in the human genome Enriched motifs
Finding motif clusters with a hidden Markov model Score Location in DNA Red = motif type 1 (e.g. TAL1) Blue = motif type 2 (e.g. ETS) 0.8 Cluster-Buster MC Frith, MC Li, Z Weng (2003). Cluster-Buster: Finding dense clusters of motifs in DNA sequences. Nucleic Acids Research, 31(13):3666-8. http://zlab.bu.edu/cluster-buster/ 0.1 0.1
Overlap between predicted motif clusters and distal DNase HS sites Enrichment of the overlap = Overlap * Sequence space DHS * Motif Clusters Predicted motif clusters Cutoff DNase HS sites
Motif clusters can predict distal DNase HS sites genome-wide
Summary DNase HS sites identified from 6 cell types Cell-type specific Common Ubiquitous (found in all cell types studied) Ubiquitous DNase HS sites are likely to function as… Promoters (TSS) Insulators (CTCF) (no enhancers?) Ubiquitous sites indicative of housekeeping chromatin structure Cell-type specific DNase HS sites Correlate with histone modifications in a cell type-specific manner Correlate with gene expression in a cell type-specific manner Correlate with enhancer elements in a cell type-specific manner Contain cell type-specific motifs Motif clusters can predict DNase HS sites genome-wide
Motif Finding Many Slides by Bill Noble @ UW
Outline What is a sequence motif? Weight matrix representation Motif search Motif discovery Expectation-maximization Gibbs sampling Patterns-with-mismatches representation
What is a “Motif”? Generally, a recurring pattern, e.g. Sequence motif Structure motif Network motif More specifically, a set of similar substrings, within a family of diverged sequences. Protein sequence motifs DNA sequence motifs
Example motif
Motif in Logos Format
Gene 3’-Processing Signals RNA A simplified representation of the arrangement of control elements (with example sequences) that identify the 3'-processing site in yeast mRNA. JH Graber et al. (2002) Nucleic Acids Research 30(8):1851-8.
Splice site motif in logo format weblogo.berkeley.edu
Exonic Splicing Enhancers These motifs occur within exons and enhance splicing of introns from mRNA. Letter height indicates its frequency at that position. Fairbrother WG et al. (2002) Science 297(5583):1007-13
Transcription Factor Binding Sites Estrogen Receptor Transcription start DNA ERE (estrogen response element) Gene ERE Sequence Efp … a g g g t c a t g g t g a c c c t … TERT … t t g g t c a g g c t g a t c t c … Oxytocin … g c g g t g a c c t t g a c c c c … Lactoferrin … c a g g t c a a g g c g a t c t t … Angiotensin … t a g g g c a t c g t g a c c c g … VEGF … a t a a t c a g a c t g a c t g g …
Outline What is a sequence motif? Weight matrix representation Motif search Motif discovery Expectation-maximization Gibbs sampling Patterns-with-mismatches representation
Weight matrix Probabilistic model: How likely is each letter at each motif position? 1 2 3 4 5 6 7 8 9 .89 .02 .38 .34 .22 .27 .03 .04 .91 .20 .17 .28 .31 .30 .05 .41 .18 .29 .16 .07 .92 .01 .21 .26 .61 .78 A C G T
A. K. A. Weight matrices are also known as Position-specific scoring matrices Position-specific probability matrices Position-specific weight matrices
Scoring a motif model A motif is interesting if it is very different from the background distribution 1 2 3 4 5 6 7 8 9 .89 .02 .38 .34 .22 .27 .03 .04 .91 .20 .17 .28 .31 .30 .05 .41 .18 .29 .16 .07 .92 .01 .21 .26 .61 .78 A C G T less interesting more interesting
Relative entropy A motif is interesting if it is very different from the background distribution Use relative entropy*: pi, = probability of in matrix position i b = background frequency (in non-motif sequence) * Relative entropy is sometimes called information content.
Scoring motif instances A motif instance matches if it looks like it was generated by the weight matrix 1 2 3 4 5 6 7 8 9 .89 .02 .38 .34 .22 .27 .03 .04 .91 .20 .17 .28 .31 .30 .05 .41 .18 .29 .16 .07 .92 .01 .21 .26 .61 .78 A C G T “ A C G G C G C C T” Not likely! Hard to tell Matches weight matrix
Log likelihood ratio A motif instance matches if it looks like it was generated by the weight matrix Use log likelihood ratio Measures how much more like the weight matrix than like the background. i: the character at position i of the instance
Outline What is a sequence motif? Weight matrix representation Motif search Motif discovery Expectation-maximization Gibbs sampling Patterns-with-mismatches representation
Position-specific scoring matrix -1 -2 R 5 1 -3 N 6 D C Q E 2 G H 8 I -4 L K M F P S T W Y 3 V This PSSM assigns the sequence NMFWAFGH a score of 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12.
Significance of scores Motif Scanning algorithm 45 Low score = not a motif High score = motif occurrence How high is high enough? LENENQGKCTIAEYKYDGKKASVYNSFVS
Computing a p-value The scores for all possible sequences of length that matches the motif. Use these scores to compute a p-value. The probability of observing a score >4 is the area under the curve to the right of 4. This probability is called a p-value. p-value = Pr(data|null)
Outline What is a sequence motif? Weight matrix representation Motif search Motif discovery Expectation-maximization Gibbs sampling Patterns-with-mismatches representation
Motif discovery problem Given sequences Find motif seq. 1 seq. 2 seq. 3 IGRGGFGEVY at position 515 LGEGCFGQVV at position 430 VGSGGFGQVY at position 682
Motif discovery problem Given: a sequence or family of sequences. Find: the number of motifs the width of each motif the locations of motif occurrences
Why is this hard? ? ? Input sequences are long (thousands or millions of residues) Motif may be subtle Instances are short. Instances are only slightly similar. ? ?
xxxxxxxxxxx.xxxxxxxxx.xxxxx..........xxxxxx.xxxxxxx.xxxxxxxxxx.xxxxxxxxx HAHU V.LSPADKTN..VKAAWGKVG.AHAGE..........YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSA HAOR M.LTDAEKKE..VTALWGKAA.GHGEE..........YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSA HADK V.LSAADKTN..VKGVFSKIG.GHAEE..........YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSA HBHU VHLTPEEKSA..VTALWGKVN.VDEVG...........G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPD HBOR VHLSGGEKSA..VTNLWGKVN.INELG...........G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAG HBDK VHWTAEEKQL..ITGLWGKVNvAD.CG...........A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPT MYHU G.LSDGEWQL..VLNVWGKVE.ADIPG..........HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSED MYOR G.LSDGEWQL..VLKVWGKVE.GDLPG..........HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTED IGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY.PDIQNKFSQFaGKDLASIKD GPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA..........HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPE GPYL GVLTDVQVAL..VKSSFEEFN.ANIPK...........N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQ GGZLB M.L.DQQTIN..IIKATVPVLkEHGVT...........ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE xxxxx.xxxxxxxxxxxxx..xxxxxxxxxxxxxxx..xxxxxxx.xxxxxxx...xxxxxxxxxxxxxxxx HAHU QVKGH.GKKVADA.LTN......AVA.HVDDMPNA...LSALS.D.LHAHKL....RVDPVNF.KLLSHCLL HAOR QIKAH.GKKVADA.L.S......TAAGHFDDMDSA...LSALS.D.LHAHKL....RVDPVNF.KLLAHCIL HADK QIKAH.GKKVAAA.LVE......AVN.HVDDIAGA...LSKLS.D.LHAQKL....RVDPVNF.KFLGHCFL HBHU AVMGNpKVKAHGK.KVLGA..FSDGLAHLDNLKGT...FATLS.E.LHCDKL....HVDPENF.RL.LGNVL HBOR AVMGNpKVKAHGA.KVLTS..FGDALKNLDDLKGT...FAKLS.E.LHCDKL....HVDPENFNRL..GNVL HBDK AILGNpMVRAHGK.KVLTS..FGDAVKNLDNIKNT...FAQLS.E.LHCDKL....HVDPENF.RL.LGDIL MYHU EMKASeDLKKHGA.TVL......TALGGILKKKGHH..EAEIKPL.AQSHATK...HKIPVKYLEFISECII MYOR EMKASaDLKKHGG.TVL......TALGNILKKKGQH..EAELKPL.AQSHATK...HKISIKFLEYISEAII IGLOB T.GA...FATHATRIVSFLseVIALSGNTSNAAAV...NSLVSKL.GDDHKA....R.GVSAA.QF..GEFR GPUGNI NNPK...LKAHAAVIFKTI...CESATELRQKGHAVwdNNTLKRL.GSIHLK....N.KITDP.HF.EVMKG GPYL NNPD...LQAHAG.KVFKL..TYEAAIQLEVNGAVAs.DATLKSL.GSVHVS....K.GVVDA.HF.PVVKE GGZLB Q......PKALAM.TVL......AAAQNIENLPAIL..PAVKKIAvKHCQAGVaaaH.YPIVGQEL.LGAIK xxxxxxxxx.xxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxx..x HAHU VT.LAA.H..LPAEFTPA..VHASLDKFLASV.STVLTS..KY..R HAOR VV.LAR.H..CPGEFTPS..AHAAMDKFLSKV.ATVLTS..KY..R HADK VV.VAI.H..HPAALTPE..VHASLDKFMCAV.GAVLTA..KY..R HBHU VCVLAH.H..FGKEFTPP..VQAAYQKVVAGV.ANALAH..KY..H HBOR IVVLAR.H..FSKDFSPE..VQAAWQKLVSGV.AHALGH..KY..H HBDK IIVLAA.H..FTKDFTPE..CQAAWQKLVRVV.AHALAR..KY..H MYHU QV.LQSKHPgDFGADAQGA.MNKALELFRKDM.ASNYKELGFQ..G MYOR HV.LQSKHSaDFGADAQAA.MGKALELFRNDM.AAKYKEFGFQ..G IGLOB TA.LVA.Y..LQANVSWGDnVAAAWNKA.LDN.TFAIVV..PR..L GPUGNI ALLGTIKEA.IKENWSDE..MGQAWTEAYNQLVATIKAE..MK..E GPYL AILKTIKEV.VGDKWSEE..LNTAWTIAYDELAIIIKKE..MKdaA GGZLB EVLGDAAT..DDILDAWGK.AYGVIADVFIQVEADLYAQ..AV..E Globin motifs
Alternating approach Examples: Gibbs Sampler (Lawrence et al.) Guess an initial weight matrix Use weight matrix to predict instances in the input sequences Use instances to predict a weight matrix Repeat 2 & 3 until satisfied. Examples: Gibbs Sampler (Lawrence et al.) MEME (expectation maximization / Bailey, Elkan) ANN-Spec (neural network / Workman, Stormo)
Three Ingredients of Almost any Bioinformatics Method Search space Scoring scheme Search algorithm (= optimization technique) Mathematically precise formulation of the problem Strictly speaking, Gibbs sampling and expectation-maximization are search algorithms. They are not specific to motif discovery; indeed they were first used in other contexts.
Expectation-Maximization Guarantees finding a local optimum. Widely used in bioinformatics: The Baum-Welch algorithm for training HMMs is an example So is K-means clustering (e.g. used to analyze microarray data).
Expectation-maximization (EM) foreach subsequence of width W convert subsequence to a matrix do { re-estimate motif occurrences from matrix re-estimate matrix model from motif occurrences } until (matrix model stops changing) end select matrix with highest score EM
Sample DNA sequences >ce1cg TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA GCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCAC AAAAATGGAAGTCCACAGTCTTGACAG >ara GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAG AAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTG CTATGCCATAGCATTTTTATCCATAAG >bglr1 ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATA TAACTTTATAAATTCCTAAAATTACACAAAGTTAATAAC TGTGAGCATGGTCATATTTTTATCAAT >crp CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTAC AGTAATACATTGATGTACTGCATGTATGCAAAGGACGTC ACATTACCGTGCAGTACAGTTGATAGC
Motif occurrences >ce1cg taatgtttgtgctggtttttgtggcatcgggcgagaata gcgcgtggtgtgaaagactgttttTTTGATCGTTTTCAC aaaaatggaagtccacagtcttgacag >ara gacaaaaacgcgtaacaaaagtgtctataatcacggcag aaaagtccacattgattaTTTGCACGGCGTCACactttg ctatgccatagcatttttatccataag >bglr1 acaaatcccaataacttaattattgggatttgttatata taactttataaattcctaaaattacacaaagttaataac TGTGAGCATGGTCATatttttatcaat >crp cacaaagcgaaagctatgctaaaacagtcaggatgctac agtaatacattgatgtactgcatgtaTGCAAAGGACGTC ACattaccgtgcagtacagttgatagc
…gactgttttTTTGATCGTTTTCACaaaaatgg… Starting point …gactgttttTTTGATCGTTTTCACaaaaatgg… T T T G A T C G T T A 0.17 0.17 0.17 0.17 0.50 ... C 0.17 0.17 0.17 0.17 0.17 G 0.17 0.17 0.17 0.50 0.17 T 0.50 0.50 0.50 0.17 0.17
Re-estimating motif occurrences TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA T T T G A T C G T T A 0.17 0.17 0.17 0.17 0.50 ... C 0.17 0.17 0.17 0.17 0.17 G 0.17 0.17 0.17 0.50 0.17 T 0.50 0.50 0.50 0.17 0.17 Score = 0.50 + 0.17 + 0.17 + 0.17 + 0.17 + ...
Scoring each subsequence Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA Subsequences Score TGTGCTGGTTTTTGT 2.95 GTGCTGGTTTTTGTG 4.62 TGCTGGTTTTTGTGG 2.31 GCTGGTTTTTGTGGC ... Select from each sequence the subsequence with maximal score.
Re-estimating motif matrix Occurrences TTTGATCGTTTTCAC TTTGCACGGCGTCAC TGTGAGCATGGTCAT TGCAAAGGACGTCAC Counts A 000132011000040 C 001010300200403 G 020301131130000 T 423001002114001
Adding pseudocounts Counts + Pseudocounts Counts A 111243122111151
Converting to frequencies Counts + Pseudocounts A 111243122111151 C 112121411311514 G 131412242241111 T 534112113225112 T T T G A T C G T T A 0.13 0.13 0.13 0.25 0.50 ... C 0.13 0.13 0.25 0.13 0.25 G 0.13 0.38 0.13 0.50 0.13 T 0.63 0.38 0.50 0.13 0.13
Expectation-maximization foreach subsequence of width W convert subsequence to a matrix do { re-estimate motif occurrences from matrix re-estimate matrix model from motif occurrences } until (matrix model stops changing) end select matrix with highest score
Problem: This procedure doesn't allow the motifs to move around very much. Taking the max is too brittle. Solution: Associate with each start site a probability of motif occurrence.
Converting to probabilities Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA Occurrences Score Prob TGTGCTGGTTTTTGT 2.95 0.023 GTGCTGGTTTTTGTG 4.62 0.037 TGCTGGTTTTTGTGG 2.31 0.018 GCTGGTTTTTGTGGC ... ... Total 128.2 1.000
Computing weighted counts Occurrences Prob TGTGCTGGTTTTTGT 0.023 GTGCTGGTTTTTGTG 0.037 TGCTGGTTTTTGTGG 0.018 GCTGGTTTTTGTGGC ... 1 2 3 4 5 … A C G T Include counts from all subsequences, weighted by the degree to which they match the motif model.
Computing weighted counts Occurrences Prob TGTGCTGGTTTTTGT 0.023 GTGCTGGTTTTTGTG 0.037 TGCTGGTTTTTGTGG 0.018 GCTGGTTTTTGTGGC ... 1 2 3 4 5 … A C G T Include counts from all subsequences, weighted by the degree to which they match the motif model.
Problem: How do we estimate counts accurately when we have only a few examples? Solution: Use Dirichlet mixture priors. Problem: Too many possible starting points. Solution: Save time by running only 1 iteration of EM at first. Problem: Too many possible widths. Solution: Consider widths that vary by 2 and adjust motifs afterwards. Problem: Algorithm assumes exactly one motif occurrence per sequence. Solution: Normalize motif occurrence probabilities across all sequences, using a user-specified parameter. Problem: The EM algorithm finds only one motif. Solution: Probabilistically erase the motif from the data set, and repeat. Problem: The motif model is too simplistic. Solution: Use a two-component mixture model that captures the background distribution. Allow the background model to be more complex, e.g. a Markov model. Problem: The EM algorithm does not tell you how many motifs there are. Solution: Compute statistical significance of motifs and stop when they are no longer significant.
MEME algorithm do for (width = min; width *= 2; width < max) foreach possible starting point run 1 iteration of EM select candidate starting points foreach candidate run EM to convergence select best motif erase motif occurrences until (E-value of found motif > threshold)
Gibbs Sampling a type of Monte Carlo Markov chain method
Maximization Versus Sampling We are given some huge search space. Every point Z in the search space has some score SZ defined as before. Sampling: wander around the search space in such a way that how often we visit each point is proportional to πZ=exp(SZ). Maximization: find the point with the highest πZ, a likelihood ratio value between 0 and +∞. EM does maximization and MCMC does sampling. MCMC attempts to escape local optima.
Gibbs Sampling Use a Markov chain to wander around the search space. If we are at point X, move to point Y with probability MXY 1 2 Randomly pick a dimension. Suppose the search space is a 2D rectangle. (Typically, many dimensions!) X Start at a random point X. Look at all points along this dimension. Move to one of them randomly, proportional to its score π. Repeat.
Initialization Randomly guess an instance si from each of t input sequences {S1, ..., St}. sequence 1 ACAGTGT TTAGACC GTGACCA ACCCAGG CAGGTTT sequence 2 sequence 3 sequence 4 sequence 5
Gibbs sampler Initially: randomly guess an instance si from each of t input sequences {S1, ..., St}. Steps 2 & 3 (search): Throw away an instance si: remaining (t - 1) instances define weight matrix. Weight matrix defines instance probability at each position of input string Si Pick new si according to probability distribution Return highest-scoring motif seen
Sampler step illustration: ACAGTGT TAGGCGT ACACCGT ??????? CAGGTTT A C G T .45 .05 .25 .65 .85 ACAGTGT TAGGCGT ACACCGT ACGCCGT CAGGTTT sequence 4 11% ACGCCGT:20% ACGGCGT:52%
Comparison Both EM and Gibbs sampling involve iterating over two steps Convergence: EM converges when the PSSM stops changing. Gibbs sampling runs until you ask it to stop. Solution: EM may not find the motif with the highest score. Gibbs sampling will provably find the motif with the highest score, if you let it run long enough.
Comparison of motif finders
Summary Motifs are represented by weight matrices. Motif quality is measured by relative entropy. Motif occurrences are scored using log likelihood ratios. EM and the Gibbs sampler attempt to find a motif with maximal relative entropy. Both algorithms alternate between predicting instances and predicting the weight matrix.
Homework Go to UCSC genome browser to get the top 100 regions bound by CTCF Use MEME to find the binding motif of CTCF
Outline What is a sequence motif? Weight matrix representation Motif search Motif discovery Expectation-maximization Gibbs sampling Patterns-with-mismatches representation
Motif discovery problem (l,d)-k Problem: Given a sample, find all patterns of length l such that there are k occurrences of the pattern with up to d mismatches in each.
Two motif representations Matrices are richer: complete emission distributions, and position-specific. Matrix scores may correspond to binding energies. Patterns are more tractable and allow for performance guarantees. Pattern + data matrix.
Pattern-driven approach Consider all 4l patterns of length l. Compare each pattern to every l-mer in the data. Return all (l,d)-k patterns. Running time: O(4ln), where n is the total length of input sequences.
Sample-driven approach Only consider l-mers that occur in the data, plus their neighbors. Efficient, but requires a hash table of size 4l. Suggested by Waterman [1984]; improved upon by Sagot [1998] and Pavesi [2001].
instance pattern instance WINNOWER Condition A G C T A C A A d 2d A G Pevzner and Sze, 2001.
WINNOWER Condition A G C T A C A A d 2d 2d A G A T G C C A d d 2d A C
Graph constructed by WINNOWER For (15,4)-signal, we connect all words with distance at most 8: atgaccgggatactgatAgAAgAAAGGttGGGtataatggagtacgataa atgacttcAAtAAAAcGGcGGGtgctctcccgattttgagtatccctggg gcaatcgcgaaccaagctgagaattggatgtcAAAAtAAtGGaGtGGcac gtcaatcgaaaaaacggtggaggatttcAAAAAAAGGGattGgaccgctt real signals signal edges spurious signals spurious edges
Pattern Graph Pruning (k=4) C T A C A A A G C T A C C A A G C T T T A A A G C T A T C A A G C T G C C A A G C T T A A A A G C T T A A A A G C T T A T A
Pattern Graph Pruning (k=4) C T A C A A A G C T A C C A A G C T A T C A A G C T G C C A A G C T T A A A
Pattern Graph Pruning (k=4) C T A C A A A G C T A C C A A G C T A T C A A G C T G C C A Pruning Preserves Cliques.
Pattern Graph Pruning (k=4) C T A C A A A G C T T T A A A G C T A T C A A G C T G C C A A G C T T A A A A G C T T A A A A G C T T A T A
Pattern Graph Pruning (k=4) C T A C A A A G C T A T C A A G C T G C C A A G C T T A A A
Pattern Graph Pruning (k=4) C T A C A A A G C T A T C A A G C T G C C A
Pattern Graph Pruning (k=4) Empty graph = no valid patterns.
Pattern Graph Pruning (k=4) C T A C A A A G C T T T A A A G C T A T C A A G C T G C C A A G C T T A A A A G C T T A A A A G C T T A T A
Pattern Graph Pruning (k=4) C T A C A A A G C T A T C A A G C T G C C A A G C T T A A A A G C T T A T A Cannot prune any more edges.
MITRA Framework (MIsmatch TRee Approach) Eskin and Pevzner, 2002. MITRA Framework (MIsmatch TRee Approach) Split Pattern Space 4 easier sub problems. Removes many edges. Perform WINNOWER in each pattern space. Example: Space of patterns length 12 Can be split into 4 pattern subspaces: ? ? ? ? ? ? ? ? ? ? ? ? A ? ? ? ? ? ? ? ? ? ? ? C ? ? ? ? ? ? ? ? ? ? ? G ? ? ? ? ? ? ? ? ? ? ? T ? ? ? ? ? ? ? ? ? ? ?
MITRA-Graph WINNOWER style graph in each subspace. For edge to exist, three conditions: m1 A G C T A C A T t A G A ? ? ? ? ? m2 A C G A A T A C
MITRA-Graph Algorithm Initial Pattern Graph yes Prune Edges Is Pattern Space Empty? Done no yes Output Pattern Is Pattern Complete? no A New Pattern Spaces C Split Pattern Space G T
Mismatch Tree Approach (MITRA-Count, MITRA-Graph) Hard Problem. Worst case takes exponential time. MITRA Benefits: Guaranteed to find all patterns. Prune uninteresting portions of search space. Easily Parallelizable. Detects long patterns. Minimal memory requirements for MITRA-Counts. Drawbacks: Computational overhead of data structure. MITRA-Graph needs to store edges (memory).