Genes and Regulatory Elements

Slides:



Advertisements
Similar presentations
Methods to read out regulatory functions
Advertisements

Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt.
Motif discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Heuristic alignment algorithms and cost matrices
Lecture 5: Learning models using EM
Transcription factor binding motifs (part I) 10/17/07.
DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Sequence analysis How to locate rare/important sub- sequences.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Finding Regulatory Motifs in DNA Sequences
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Computational Genomics III: Motif Detection Eric Xing Lecture 6,
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Flat clustering approaches
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
bacteria and eukaryotes
Functional Elements in the Human Genome
A Very Basic Gibbs Sampler for Motif Detection
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Motif discovery GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Transcription factor binding motifs
Motif search GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Generalizations of Markov model to characterize biological sequences
Motif discovery GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Motif search GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Volume 128, Issue 6, Pages (March 2007)
Transcription factor binding motifs
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

Genes and Regulatory Elements Zhiping Weng U Mass Medical School

ENCODE ENCyclopedia Of DNA elements (The ENCODE Project Consortium, Science 2004, Nature 2007)                                                                                        m001 m002 m003 m004 m005 m007 m008 m009 m010 m011 m012 m013 m014 r111 r112 r113 r114 r121 r122 r123 r131 r132 r133 r211 r212 r213 r221 r222 r223 r231 r232 r233 r311 r312 r313 r321 r322 r323 r334 r324 m006 r331 r332 r333 1 2 3 4 5 6 9 8 7 10 12 11 13 15 14 20 19 16 22 21 Y X 17 18 Goal: Identify all functional elements in the human genome. Pilot phase: 1% of the genome is being annotated very extensively (30 Mb of sequence). Now genome-wide

The ENCODE Project Consortium (2004) The ENCODE (ENCyclopedia Of DNA Elements) Project Science, Vol 306, 636-640.

Gene RNA-seq

Epigenomics

Regulatory Elements chrX:129,143,739-129,661,948

The human genome 45% repetitive DNA 53% Unique and segmental duplicated DNA 2% genes (25,000) Where are the gene regulatory elements? G. Crawford

DNase hypersensitive (HS) sites identify active gene regulatory elements DNase I HS sites Regions hypersensitive to DNase Promoters Enhancers Silencers Insulators Locus control regions Meiotic recombination hotspots HS sites identify “open” regions of chromatin Crawford et al., Nature Methods 2006

DNase-chip to identify DNase HS sites or sequence directly. Crawford et al., Nature Methods 2006

Arrays used for DNase-chip NimbleGen arrays 385,000 50-mer oligos oligos spaced every 38 bases (12 base overlap) non-repetitive unique regions 1% of the genome (44 ENCODE regions) Crawford et al., Nature Methods 2006

DNase-chip Quality Assessment Identification of DNase HS sites from 6 cell types. (a) Representative DNase-chip data from ENCODE region ENr231 (chr1:148,050,000-148,165,000). Note how there are common, ubiquitous and cell-line specific DNase HS sites. (b) DNase-chip from K562 cells identifies all five locus control region (LCR) DNase HS sites upstream of the b-globin locus, as well as the 3’ DNase HS site (in red). There are additional DNase HS sites identified around promoter regions of the globin genes. Xi H., Shulha H.P., Lin J.M., Vales T.R., Fu Y., Bodine D.M., McKay R.D.J, Chenoweth J.G., Tesar P.J., Furey T.S., Ren B., Weng Z.+, Crawford G.E.+ (2007) Identification and characterization of cell type-specific and ubiquitous chromatin regulatory structures in the human genome. +Co-corresponding authors PLoS Genetics, 8, 8-20.

DNase HS site (DHS) statistics in 6 cell lines CD4 GM06990 K562 H9 HeLa IMR90 # of DNase I chip hits identified 1262 1098 1210 1274 1042 1244 Specificity based on TSS-depleted ENCODE regions >99.9% 99.5% 99.8% 99.7% Specificity based on “gold standard” negative DHS (n=134) 97% 99% 96% 95% 92% # of proximal DHS 784 674 666 773 575 642 # of distal DHS 478 424 538 498 464 599 # of DHS containing a TSS 402 352 357 432 318 362 # of cell line specific DHS 439 341 448 400 277 463 # of ubiquitous DHS 262 266 273 259

Unique, common, and ubiquitous DNase HS sites GM CD4 HeLa H9 K562 IMR90 Ubiquitous HS sites 20% Cell-type specific and Common HS sites 80% Collectively, the DHS cover 8.3% of the ENCODE regions.

Have we reached saturation in identifying most DNase HS sites?

CpG content of DNase HS sites

Ubiquitous DNase HS sites are enriched for promoters (TSS) Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site (TSS) Ubiquitous DNase HS sites are enriched for promoters (TSS) What about ubiquitous distal DNase HS sites?

Most Distal (non-TSS) ubiquitous DNase sites are insulators bound by CTCF ChIP

Chromatin-immunoprecipitation (ChIP) - chip Antibody against CTCF Tiling array Chromatin-immunoprecipitation (ChIP) - chip Kim T.H. et al. Direct Isolation and Identification of Promoters in the Human Genome Genome Research (2005) Direct sequencing  ChIP-seq

The H19/IGF2 Locus is well insulated

DNase HS sites identify insulator in the Hox locus

Cell culture insulator assays demonstrate that DNaseI HS sites (that overlap CTCF) display enhancer blocking activity.

CTCF motif sites are conserved

CTCF sites make up a greater % of ubiquitous distal DNase HS sites than enhancers

Ubiquitous DNase HS sites are enriched for promoters (TSS) Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site (TSS) Ubiquitous DNase HS sites are enriched for promoters (TSS)

Ubiquitous proximal DNase HS sites

Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site (TSS)

Antibody against histone modification Tiling array Sequencing

Enrichment between tissue-specific H3K4me2 and DNase HS sites

Cell type-specific DNase HS sites correlate with cell type-specific histone modifications Similarly for H3K4me1, H3K4me3, H3ac and H4ac, for which we have experimental data.

Cell type-specific DNase HS sites correlate with cell type-specific enhancers

Cell type-specific DNase HS sites correlate with cell type-specific gene expression

Use cell-type specific DNase HS sites for discovery

Transcriptional Motifs Gene transcription is controlled by molecules (transcription factors, or TFs) binding to short DNA sequences (cis-elements, TF motifs) in promoters and distal elements

Finding enriched motifs in tissue-specific DNase HS sites Screen against a motif library, e.g., JASPAR or TRANSFAC STAT DHS #1 DHS #2 DHS #3 DHS #4 DHS #5 the Clover algorithm Myc/Max YY1 (etc.)

JASPAR: a database of transcription factor motifs

Clover: Cis-eLement OVERrepresentation Myc/Max DHS sequences 17.3 Raw score

The Clover Algorithm Frith MC, Fu Y, Yu L, Chen J-F, Hansen U, Weng Z (2004). Detection of Functional DNA Motifs Via Statistical Overrepresentation. Nucleic Acids Res. 32:1372-1381. Lk: nucleotide at position k W: motif width S: a promoter sequence Ms: number of motif locations in a sequence A: all possibilities of choosing a subset of sequences N: the total number of promoter sequences Clover Raw score

Clover: Cis-eLement OVERrepresentation Myc/Max Control DNA sequences DHS sequences 4.2 6.6 18 9.1 17.3 Raw score P-value = 1/4

Motifs enriched in cell-type specific DNase HS sites

Motifs enriched in cell-type specific DNase HS sites

Genome-wide DNase-chip and DNase-sequencing data CD4 cells 23 k proximal DNaseI HS sites 72 k distal DNaseI HS sites

Enriched transcription factor binding motifs in distal DNaseI HS sites Hematopoietic system: TAL1 AML PU.1 C/EBPα Immune system: STAT1, STAT3, STAT5 IRF1, IRF3 and IRF5

Identify motif clusters (modules) Distal DHS sequences acgtcggctgacaccaggtctgcttgattcgatgagattgaattcgtaggagctggattagag ggcttggggcttgaggcttgacaccatatcgtagcgctgagttgctgagtttcgtatggcgct cgatgcttattagcggctattataggctagctaggcaatacacatcgctgatatagcggctta tgagatagcgtgctagctatatggattggaatattcggcgctgaaaggtcttagctagtcgta aatatatgcgcgtatgcgtatggcgggtatatgggggcttggtcttttttttcgcttaggtcg Find motif clusters in the human genome Enriched motifs

Finding motif clusters with a hidden Markov model Score Location in DNA Red = motif type 1 (e.g. TAL1) Blue = motif type 2 (e.g. ETS) 0.8 Cluster-Buster MC Frith, MC Li, Z Weng (2003). Cluster-Buster: Finding dense clusters of motifs in DNA sequences. Nucleic Acids Research, 31(13):3666-8. http://zlab.bu.edu/cluster-buster/ 0.1 0.1

Overlap between predicted motif clusters and distal DNase HS sites Enrichment of the overlap = Overlap * Sequence space DHS * Motif Clusters Predicted motif clusters Cutoff DNase HS sites

Motif clusters can predict distal DNase HS sites genome-wide

Summary DNase HS sites identified from 6 cell types Cell-type specific Common Ubiquitous (found in all cell types studied) Ubiquitous DNase HS sites are likely to function as… Promoters (TSS) Insulators (CTCF) (no enhancers?) Ubiquitous sites indicative of housekeeping chromatin structure Cell-type specific DNase HS sites Correlate with histone modifications in a cell type-specific manner Correlate with gene expression in a cell type-specific manner Correlate with enhancer elements in a cell type-specific manner Contain cell type-specific motifs Motif clusters can predict DNase HS sites genome-wide

Motif Finding Many Slides by Bill Noble @ UW

Outline What is a sequence motif? Weight matrix representation Motif search Motif discovery Expectation-maximization Gibbs sampling Patterns-with-mismatches representation

What is a “Motif”? Generally, a recurring pattern, e.g. Sequence motif Structure motif Network motif More specifically, a set of similar substrings, within a family of diverged sequences. Protein sequence motifs DNA sequence motifs

Example motif

Motif in Logos Format

Gene 3’-Processing Signals RNA A simplified representation of the arrangement of control elements (with example sequences) that identify the 3'-processing site in yeast mRNA. JH Graber et al. (2002) Nucleic Acids Research 30(8):1851-8.

Splice site motif in logo format weblogo.berkeley.edu

Exonic Splicing Enhancers These motifs occur within exons and enhance splicing of introns from mRNA. Letter height indicates its frequency at that position. Fairbrother WG et al. (2002) Science 297(5583):1007-13

Transcription Factor Binding Sites Estrogen Receptor Transcription start DNA ERE (estrogen response element) Gene ERE Sequence Efp … a g g g t c a t g g t g a c c c t … TERT … t t g g t c a g g c t g a t c t c … Oxytocin … g c g g t g a c c t t g a c c c c … Lactoferrin … c a g g t c a a g g c g a t c t t … Angiotensin … t a g g g c a t c g t g a c c c g … VEGF … a t a a t c a g a c t g a c t g g …

Outline What is a sequence motif? Weight matrix representation Motif search Motif discovery Expectation-maximization Gibbs sampling Patterns-with-mismatches representation

Weight matrix Probabilistic model: How likely is each letter at each motif position? 1 2 3 4 5 6 7 8 9 .89 .02 .38 .34 .22 .27 .03 .04 .91 .20 .17 .28 .31 .30 .05 .41 .18 .29 .16 .07 .92 .01 .21 .26 .61 .78 A C G T

A. K. A. Weight matrices are also known as Position-specific scoring matrices Position-specific probability matrices Position-specific weight matrices

Scoring a motif model A motif is interesting if it is very different from the background distribution 1 2 3 4 5 6 7 8 9 .89 .02 .38 .34 .22 .27 .03 .04 .91 .20 .17 .28 .31 .30 .05 .41 .18 .29 .16 .07 .92 .01 .21 .26 .61 .78 A C G T less interesting more interesting

Relative entropy A motif is interesting if it is very different from the background distribution Use relative entropy*: pi, = probability of  in matrix position i b = background frequency (in non-motif sequence) * Relative entropy is sometimes called information content.

Scoring motif instances A motif instance matches if it looks like it was generated by the weight matrix 1 2 3 4 5 6 7 8 9 .89 .02 .38 .34 .22 .27 .03 .04 .91 .20 .17 .28 .31 .30 .05 .41 .18 .29 .16 .07 .92 .01 .21 .26 .61 .78 A C G T “ A C G G C G C C T” Not likely! Hard to tell Matches weight matrix

Log likelihood ratio A motif instance matches if it looks like it was generated by the weight matrix Use log likelihood ratio Measures how much more like the weight matrix than like the background. i: the character at position i of the instance

Outline What is a sequence motif? Weight matrix representation Motif search Motif discovery Expectation-maximization Gibbs sampling Patterns-with-mismatches representation

Position-specific scoring matrix -1 -2 R 5 1 -3 N 6 D C Q E 2 G H 8 I -4 L K M F P S T W Y 3 V This PSSM assigns the sequence NMFWAFGH a score of 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12.

Significance of scores Motif Scanning algorithm 45 Low score = not a motif High score = motif occurrence How high is high enough? LENENQGKCTIAEYKYDGKKASVYNSFVS

Computing a p-value The scores for all possible sequences of length that matches the motif. Use these scores to compute a p-value. The probability of observing a score >4 is the area under the curve to the right of 4. This probability is called a p-value. p-value = Pr(data|null)

Outline What is a sequence motif? Weight matrix representation Motif search Motif discovery Expectation-maximization Gibbs sampling Patterns-with-mismatches representation

Motif discovery problem Given sequences Find motif seq. 1 seq. 2 seq. 3 IGRGGFGEVY at position 515 LGEGCFGQVV at position 430 VGSGGFGQVY at position 682

Motif discovery problem Given: a sequence or family of sequences. Find: the number of motifs the width of each motif the locations of motif occurrences

Why is this hard? ? ? Input sequences are long (thousands or millions of residues) Motif may be subtle Instances are short. Instances are only slightly similar. ? ?

xxxxxxxxxxx.xxxxxxxxx.xxxxx..........xxxxxx.xxxxxxx.xxxxxxxxxx.xxxxxxxxx HAHU V.LSPADKTN..VKAAWGKVG.AHAGE..........YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSA HAOR M.LTDAEKKE..VTALWGKAA.GHGEE..........YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSA HADK V.LSAADKTN..VKGVFSKIG.GHAEE..........YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSA HBHU VHLTPEEKSA..VTALWGKVN.VDEVG...........G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPD HBOR VHLSGGEKSA..VTNLWGKVN.INELG...........G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAG HBDK VHWTAEEKQL..ITGLWGKVNvAD.CG...........A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPT MYHU G.LSDGEWQL..VLNVWGKVE.ADIPG..........HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSED MYOR G.LSDGEWQL..VLKVWGKVE.GDLPG..........HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTED IGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY.PDIQNKFSQFaGKDLASIKD GPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA..........HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPE GPYL GVLTDVQVAL..VKSSFEEFN.ANIPK...........N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQ GGZLB M.L.DQQTIN..IIKATVPVLkEHGVT...........ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE   xxxxx.xxxxxxxxxxxxx..xxxxxxxxxxxxxxx..xxxxxxx.xxxxxxx...xxxxxxxxxxxxxxxx HAHU QVKGH.GKKVADA.LTN......AVA.HVDDMPNA...LSALS.D.LHAHKL....RVDPVNF.KLLSHCLL HAOR QIKAH.GKKVADA.L.S......TAAGHFDDMDSA...LSALS.D.LHAHKL....RVDPVNF.KLLAHCIL HADK QIKAH.GKKVAAA.LVE......AVN.HVDDIAGA...LSKLS.D.LHAQKL....RVDPVNF.KFLGHCFL HBHU AVMGNpKVKAHGK.KVLGA..FSDGLAHLDNLKGT...FATLS.E.LHCDKL....HVDPENF.RL.LGNVL HBOR AVMGNpKVKAHGA.KVLTS..FGDALKNLDDLKGT...FAKLS.E.LHCDKL....HVDPENFNRL..GNVL HBDK AILGNpMVRAHGK.KVLTS..FGDAVKNLDNIKNT...FAQLS.E.LHCDKL....HVDPENF.RL.LGDIL MYHU EMKASeDLKKHGA.TVL......TALGGILKKKGHH..EAEIKPL.AQSHATK...HKIPVKYLEFISECII MYOR EMKASaDLKKHGG.TVL......TALGNILKKKGQH..EAELKPL.AQSHATK...HKISIKFLEYISEAII IGLOB T.GA...FATHATRIVSFLseVIALSGNTSNAAAV...NSLVSKL.GDDHKA....R.GVSAA.QF..GEFR GPUGNI NNPK...LKAHAAVIFKTI...CESATELRQKGHAVwdNNTLKRL.GSIHLK....N.KITDP.HF.EVMKG GPYL NNPD...LQAHAG.KVFKL..TYEAAIQLEVNGAVAs.DATLKSL.GSVHVS....K.GVVDA.HF.PVVKE GGZLB Q......PKALAM.TVL......AAAQNIENLPAIL..PAVKKIAvKHCQAGVaaaH.YPIVGQEL.LGAIK xxxxxxxxx.xxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxx..x HAHU VT.LAA.H..LPAEFTPA..VHASLDKFLASV.STVLTS..KY..R HAOR VV.LAR.H..CPGEFTPS..AHAAMDKFLSKV.ATVLTS..KY..R HADK VV.VAI.H..HPAALTPE..VHASLDKFMCAV.GAVLTA..KY..R HBHU VCVLAH.H..FGKEFTPP..VQAAYQKVVAGV.ANALAH..KY..H HBOR IVVLAR.H..FSKDFSPE..VQAAWQKLVSGV.AHALGH..KY..H HBDK IIVLAA.H..FTKDFTPE..CQAAWQKLVRVV.AHALAR..KY..H MYHU QV.LQSKHPgDFGADAQGA.MNKALELFRKDM.ASNYKELGFQ..G MYOR HV.LQSKHSaDFGADAQAA.MGKALELFRNDM.AAKYKEFGFQ..G IGLOB TA.LVA.Y..LQANVSWGDnVAAAWNKA.LDN.TFAIVV..PR..L GPUGNI ALLGTIKEA.IKENWSDE..MGQAWTEAYNQLVATIKAE..MK..E GPYL AILKTIKEV.VGDKWSEE..LNTAWTIAYDELAIIIKKE..MKdaA GGZLB EVLGDAAT..DDILDAWGK.AYGVIADVFIQVEADLYAQ..AV..E Globin motifs

Alternating approach Examples: Gibbs Sampler (Lawrence et al.) Guess an initial weight matrix Use weight matrix to predict instances in the input sequences Use instances to predict a weight matrix Repeat 2 & 3 until satisfied. Examples: Gibbs Sampler (Lawrence et al.) MEME (expectation maximization / Bailey, Elkan) ANN-Spec (neural network / Workman, Stormo)

Three Ingredients of Almost any Bioinformatics Method Search space Scoring scheme Search algorithm (= optimization technique) Mathematically precise formulation of the problem Strictly speaking, Gibbs sampling and expectation-maximization are search algorithms. They are not specific to motif discovery; indeed they were first used in other contexts.

Expectation-Maximization Guarantees finding a local optimum. Widely used in bioinformatics: The Baum-Welch algorithm for training HMMs is an example So is K-means clustering (e.g. used to analyze microarray data).

Expectation-maximization (EM) foreach subsequence of width W convert subsequence to a matrix do { re-estimate motif occurrences from matrix re-estimate matrix model from motif occurrences } until (matrix model stops changing) end select matrix with highest score EM

Sample DNA sequences >ce1cg TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA GCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCAC AAAAATGGAAGTCCACAGTCTTGACAG >ara GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAG AAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTG CTATGCCATAGCATTTTTATCCATAAG >bglr1 ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATA TAACTTTATAAATTCCTAAAATTACACAAAGTTAATAAC TGTGAGCATGGTCATATTTTTATCAAT >crp CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTAC AGTAATACATTGATGTACTGCATGTATGCAAAGGACGTC ACATTACCGTGCAGTACAGTTGATAGC

Motif occurrences >ce1cg taatgtttgtgctggtttttgtggcatcgggcgagaata gcgcgtggtgtgaaagactgttttTTTGATCGTTTTCAC aaaaatggaagtccacagtcttgacag >ara gacaaaaacgcgtaacaaaagtgtctataatcacggcag aaaagtccacattgattaTTTGCACGGCGTCACactttg ctatgccatagcatttttatccataag >bglr1 acaaatcccaataacttaattattgggatttgttatata taactttataaattcctaaaattacacaaagttaataac TGTGAGCATGGTCATatttttatcaat >crp cacaaagcgaaagctatgctaaaacagtcaggatgctac agtaatacattgatgtactgcatgtaTGCAAAGGACGTC ACattaccgtgcagtacagttgatagc

…gactgttttTTTGATCGTTTTCACaaaaatgg… Starting point …gactgttttTTTGATCGTTTTCACaaaaatgg… T T T G A T C G T T A 0.17 0.17 0.17 0.17 0.50 ... C 0.17 0.17 0.17 0.17 0.17 G 0.17 0.17 0.17 0.50 0.17 T 0.50 0.50 0.50 0.17 0.17

Re-estimating motif occurrences TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA T T T G A T C G T T A 0.17 0.17 0.17 0.17 0.50 ... C 0.17 0.17 0.17 0.17 0.17 G 0.17 0.17 0.17 0.50 0.17 T 0.50 0.50 0.50 0.17 0.17 Score = 0.50 + 0.17 + 0.17 + 0.17 + 0.17 + ...

Scoring each subsequence Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA Subsequences Score TGTGCTGGTTTTTGT 2.95 GTGCTGGTTTTTGTG 4.62 TGCTGGTTTTTGTGG 2.31 GCTGGTTTTTGTGGC ... Select from each sequence the subsequence with maximal score.

Re-estimating motif matrix Occurrences TTTGATCGTTTTCAC TTTGCACGGCGTCAC TGTGAGCATGGTCAT TGCAAAGGACGTCAC Counts A 000132011000040 C 001010300200403 G 020301131130000 T 423001002114001

Adding pseudocounts Counts + Pseudocounts Counts A 111243122111151

Converting to frequencies Counts + Pseudocounts A 111243122111151 C 112121411311514 G 131412242241111 T 534112113225112 T T T G A T C G T T A 0.13 0.13 0.13 0.25 0.50 ... C 0.13 0.13 0.25 0.13 0.25 G 0.13 0.38 0.13 0.50 0.13 T 0.63 0.38 0.50 0.13 0.13

Expectation-maximization foreach subsequence of width W convert subsequence to a matrix do { re-estimate motif occurrences from matrix re-estimate matrix model from motif occurrences } until (matrix model stops changing) end select matrix with highest score

Problem: This procedure doesn't allow the motifs to move around very much. Taking the max is too brittle. Solution: Associate with each start site a probability of motif occurrence.

Converting to probabilities Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA Occurrences Score Prob TGTGCTGGTTTTTGT 2.95 0.023 GTGCTGGTTTTTGTG 4.62 0.037 TGCTGGTTTTTGTGG 2.31 0.018 GCTGGTTTTTGTGGC ... ... Total 128.2 1.000

Computing weighted counts Occurrences Prob TGTGCTGGTTTTTGT 0.023 GTGCTGGTTTTTGTG 0.037 TGCTGGTTTTTGTGG 0.018 GCTGGTTTTTGTGGC ... 1 2 3 4 5 … A C G T Include counts from all subsequences, weighted by the degree to which they match the motif model.

Computing weighted counts Occurrences Prob TGTGCTGGTTTTTGT 0.023 GTGCTGGTTTTTGTG 0.037 TGCTGGTTTTTGTGG 0.018 GCTGGTTTTTGTGGC ... 1 2 3 4 5 … A C G T Include counts from all subsequences, weighted by the degree to which they match the motif model.

Problem: How do we estimate counts accurately when we have only a few examples? Solution: Use Dirichlet mixture priors. Problem: Too many possible starting points. Solution: Save time by running only 1 iteration of EM at first. Problem: Too many possible widths. Solution: Consider widths that vary by 2 and adjust motifs afterwards. Problem: Algorithm assumes exactly one motif occurrence per sequence. Solution: Normalize motif occurrence probabilities across all sequences, using a user-specified parameter. Problem: The EM algorithm finds only one motif. Solution: Probabilistically erase the motif from the data set, and repeat. Problem: The motif model is too simplistic. Solution: Use a two-component mixture model that captures the background distribution. Allow the background model to be more complex, e.g. a Markov model. Problem: The EM algorithm does not tell you how many motifs there are. Solution: Compute statistical significance of motifs and stop when they are no longer significant.

MEME algorithm do for (width = min; width *= 2; width < max) foreach possible starting point run 1 iteration of EM select candidate starting points foreach candidate run EM to convergence select best motif erase motif occurrences until (E-value of found motif > threshold)

Gibbs Sampling a type of Monte Carlo Markov chain method

Maximization Versus Sampling We are given some huge search space. Every point Z in the search space has some score SZ defined as before. Sampling: wander around the search space in such a way that how often we visit each point is proportional to πZ=exp(SZ). Maximization: find the point with the highest πZ, a likelihood ratio value between 0 and +∞. EM does maximization and MCMC does sampling. MCMC attempts to escape local optima.

Gibbs Sampling Use a Markov chain to wander around the search space. If we are at point X, move to point Y with probability MXY 1 2 Randomly pick a dimension. Suppose the search space is a 2D rectangle. (Typically, many dimensions!) X Start at a random point X. Look at all points along this dimension. Move to one of them randomly, proportional to its score π. Repeat.

Initialization Randomly guess an instance si from each of t input sequences {S1, ..., St}. sequence 1 ACAGTGT TTAGACC GTGACCA ACCCAGG CAGGTTT sequence 2 sequence 3 sequence 4 sequence 5

Gibbs sampler Initially: randomly guess an instance si from each of t input sequences {S1, ..., St}. Steps 2 & 3 (search): Throw away an instance si: remaining (t - 1) instances define weight matrix. Weight matrix defines instance probability at each position of input string Si Pick new si according to probability distribution Return highest-scoring motif seen

Sampler step illustration: ACAGTGT TAGGCGT ACACCGT ??????? CAGGTTT A C G T .45 .05 .25 .65 .85 ACAGTGT TAGGCGT ACACCGT ACGCCGT CAGGTTT sequence 4 11% ACGCCGT:20% ACGGCGT:52%

Comparison Both EM and Gibbs sampling involve iterating over two steps Convergence: EM converges when the PSSM stops changing. Gibbs sampling runs until you ask it to stop. Solution: EM may not find the motif with the highest score. Gibbs sampling will provably find the motif with the highest score, if you let it run long enough.

Comparison of motif finders

Summary Motifs are represented by weight matrices. Motif quality is measured by relative entropy. Motif occurrences are scored using log likelihood ratios. EM and the Gibbs sampler attempt to find a motif with maximal relative entropy. Both algorithms alternate between predicting instances and predicting the weight matrix.

Homework Go to UCSC genome browser to get the top 100 regions bound by CTCF Use MEME to find the binding motif of CTCF

Outline What is a sequence motif? Weight matrix representation Motif search Motif discovery Expectation-maximization Gibbs sampling Patterns-with-mismatches representation

Motif discovery problem (l,d)-k Problem: Given a sample, find all patterns of length l such that there are k occurrences of the pattern with up to d mismatches in each.

Two motif representations Matrices are richer: complete emission distributions, and position-specific. Matrix scores may correspond to binding energies. Patterns are more tractable and allow for performance guarantees. Pattern + data  matrix.

Pattern-driven approach Consider all 4l patterns of length l. Compare each pattern to every l-mer in the data. Return all (l,d)-k patterns. Running time: O(4ln), where n is the total length of input sequences.

Sample-driven approach Only consider l-mers that occur in the data, plus their neighbors. Efficient, but requires a hash table of size 4l. Suggested by Waterman [1984]; improved upon by Sagot [1998] and Pavesi [2001].

instance pattern instance WINNOWER Condition A G C T A C A A d 2d A G Pevzner and Sze, 2001.

WINNOWER Condition A G C T A C A A d 2d 2d A G A T G C C A d d 2d A C

Graph constructed by WINNOWER For (15,4)-signal, we connect all words with distance at most 8: atgaccgggatactgatAgAAgAAAGGttGGGtataatggagtacgataa atgacttcAAtAAAAcGGcGGGtgctctcccgattttgagtatccctggg gcaatcgcgaaccaagctgagaattggatgtcAAAAtAAtGGaGtGGcac gtcaatcgaaaaaacggtggaggatttcAAAAAAAGGGattGgaccgctt real signals signal edges spurious signals spurious edges

Pattern Graph Pruning (k=4) C T A C A A A G C T A C C A A G C T T T A A A G C T A T C A A G C T G C C A A G C T T A A A A G C T T A A A A G C T T A T A

Pattern Graph Pruning (k=4) C T A C A A A G C T A C C A A G C T A T C A A G C T G C C A A G C T T A A A

Pattern Graph Pruning (k=4) C T A C A A A G C T A C C A A G C T A T C A A G C T G C C A Pruning Preserves Cliques.

Pattern Graph Pruning (k=4) C T A C A A A G C T T T A A A G C T A T C A A G C T G C C A A G C T T A A A A G C T T A A A A G C T T A T A

Pattern Graph Pruning (k=4) C T A C A A A G C T A T C A A G C T G C C A A G C T T A A A

Pattern Graph Pruning (k=4) C T A C A A A G C T A T C A A G C T G C C A

Pattern Graph Pruning (k=4) Empty graph = no valid patterns.

Pattern Graph Pruning (k=4) C T A C A A A G C T T T A A A G C T A T C A A G C T G C C A A G C T T A A A A G C T T A A A A G C T T A T A

Pattern Graph Pruning (k=4) C T A C A A A G C T A T C A A G C T G C C A A G C T T A A A A G C T T A T A Cannot prune any more edges.

MITRA Framework (MIsmatch TRee Approach) Eskin and Pevzner, 2002. MITRA Framework (MIsmatch TRee Approach) Split Pattern Space 4 easier sub problems. Removes many edges. Perform WINNOWER in each pattern space. Example: Space of patterns length 12 Can be split into 4 pattern subspaces: ? ? ? ? ? ? ? ? ? ? ? ? A ? ? ? ? ? ? ? ? ? ? ? C ? ? ? ? ? ? ? ? ? ? ? G ? ? ? ? ? ? ? ? ? ? ? T ? ? ? ? ? ? ? ? ? ? ?

MITRA-Graph WINNOWER style graph in each subspace. For edge to exist, three conditions: m1 A G C T A C A T t A G A ? ? ? ? ? m2 A C G A A T A C

MITRA-Graph Algorithm Initial Pattern Graph yes Prune Edges Is Pattern Space Empty? Done no yes Output Pattern Is Pattern Complete? no A New Pattern Spaces C Split Pattern Space G T

Mismatch Tree Approach (MITRA-Count, MITRA-Graph) Hard Problem. Worst case takes exponential time. MITRA Benefits: Guaranteed to find all patterns. Prune uninteresting portions of search space. Easily Parallelizable. Detects long patterns. Minimal memory requirements for MITRA-Counts. Drawbacks: Computational overhead of data structure. MITRA-Graph needs to store edges (memory).