Genes and Regulatory Elements

Genes and Regulatory Elements
Zhiping Weng U Mass Medical School

ENCODE ENCyclopedia Of DNA elements
(The ENCODE Project Consortium, Science 2004, Nature 2007) m001 m002 m003 m004 m005 m007 m008 m009 m010 m011 m012 m013 m014 r111 r112 r113 r114 r121 r122 r123 r131 r132 r133 r211 r212 r213 r221 r222 r223 r231 r232 r233 r311 r312 r313 r321 r322 r323 r334 r324 m006 r331 r332 r333 1 2 3 4 5 6 9 8 7 10 12 11 13 15 14 20 19 16 22 21 Y X 17 18 Goal: Identify all functional elements in the human genome. Pilot phase: 1% of the genome is being annotated very extensively (30 Mb of sequence). Now genome-wide

The ENCODE Project Consortium (2004)
The ENCODE (ENCyclopedia Of DNA Elements) Project Science, Vol 306,

Gene RNA-seq

Epigenomics

Regulatory Elements chrX:129,143, ,661,948

The human genome 45% repetitive DNA 53% Unique and segmental
duplicated DNA 2% genes (25,000) Where are the gene regulatory elements? G. Crawford

DNase hypersensitive (HS) sites identify
active gene regulatory elements DNase I HS sites Regions hypersensitive to DNase Promoters Enhancers Silencers Insulators Locus control regions Meiotic recombination hotspots HS sites identify “open” regions of chromatin Crawford et al., Nature Methods 2006

DNase-chip to identify DNase HS sites
or sequence directly. Crawford et al., Nature Methods 2006

Arrays used for DNase-chip
NimbleGen arrays 385, mer oligos oligos spaced every 38 bases (12 base overlap) non-repetitive unique regions 1% of the genome (44 ENCODE regions) Crawford et al., Nature Methods 2006

DNase-chip Quality Assessment
Identification of DNase HS sites from 6 cell types. (a) Representative DNase-chip data from ENCODE region ENr231 (chr1:148,050, ,165,000). Note how there are common, ubiquitous and cell-line specific DNase HS sites. (b) DNase-chip from K562 cells identifies all five locus control region (LCR) DNase HS sites upstream of the b-globin locus, as well as the 3’ DNase HS site (in red). There are additional DNase HS sites identified around promoter regions of the globin genes. Xi H., Shulha H.P., Lin J.M., Vales T.R., Fu Y., Bodine D.M., McKay R.D.J, Chenoweth J.G., Tesar P.J., Furey T.S., Ren B., Weng Z.+, Crawford G.E.+ (2007) Identification and characterization of cell type-specific and ubiquitous chromatin regulatory structures in the human genome. +Co-corresponding authors PLoS Genetics, 8, 8-20.

DNase HS site (DHS) statistics in 6 cell lines
CD4 GM06990 K562 H9 HeLa IMR90 # of DNase I chip hits identified 1262 1098 1210 1274 1042 1244 Specificity based on TSS-depleted ENCODE regions >99.9% 99.5% 99.8% 99.7% Specificity based on “gold standard” negative DHS (n=134) 97% 99% 96% 95% 92% # of proximal DHS 784 674 666 773 575 642 # of distal DHS 478 424 538 498 464 599 # of DHS containing a TSS 402 352 357 432 318 362 # of cell line specific DHS 439 341 448 400 277 463 # of ubiquitous DHS 262 266 273 259

Unique, common, and ubiquitous
DNase HS sites GM CD4 HeLa H9 K562 IMR90 Ubiquitous HS sites 20% Cell-type specific and Common HS sites 80% Collectively, the DHS cover 8.3% of the ENCODE regions.

Have we reached saturation in identifying
most DNase HS sites?

CpG content of DNase HS sites

Ubiquitous DNase HS sites are enriched for promoters (TSS)
Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site (TSS) Ubiquitous DNase HS sites are enriched for promoters (TSS) What about ubiquitous distal DNase HS sites?

Most Distal (non-TSS) ubiquitous DNase sites are insulators bound by CTCF
ChIP

Chromatin-immunoprecipitation (ChIP) - chip
Antibody against CTCF Tiling array Chromatin-immunoprecipitation (ChIP) - chip Kim T.H. et al. Direct Isolation and Identification of Promoters in the Human Genome Genome Research (2005) Direct sequencing  ChIP-seq

The H19/IGF2 Locus is well insulated

DNase HS sites identify insulator in the Hox locus

Cell culture insulator assays demonstrate that DNaseI HS sites (that overlap CTCF) display enhancer blocking activity.

CTCF motif sites are conserved

CTCF sites make up a greater % of ubiquitous
distal DNase HS sites than enhancers

Ubiquitous DNase HS sites are enriched for promoters (TSS)
Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site (TSS) Ubiquitous DNase HS sites are enriched for promoters (TSS)

Ubiquitous proximal DNase HS sites

Locations of cell-type specific, common, and ubiquitous DNase HS sites with respect to the Transcription Start Site (TSS)

Antibody against histone modification Tiling array Sequencing

Enrichment between tissue-specific H3K4me2 and DNase HS sites

Cell type-specific DNase HS sites correlate with cell type-specific histone modifications
Similarly for H3K4me1, H3K4me3, H3ac and H4ac, for which we have experimental data.

Cell type-specific DNase HS sites correlate
with cell type-specific enhancers

Cell type-specific DNase HS sites correlate
with cell type-specific gene expression

Use cell-type specific DNase HS sites for discovery

Transcriptional Motifs
Gene transcription is controlled by molecules (transcription factors, or TFs) binding to short DNA sequences (cis-elements, TF motifs) in promoters and distal elements

Finding enriched motifs in tissue-specific DNase HS sites
Screen against a motif library, e.g., JASPAR or TRANSFAC STAT DHS #1 DHS #2 DHS #3 DHS #4 DHS #5 the Clover algorithm Myc/Max YY1 (etc.)

JASPAR: a database of transcription factor motifs

Clover: Cis-eLement OVERrepresentation
Myc/Max DHS sequences 17.3 Raw score

The Clover Algorithm Frith MC, Fu Y, Yu L, Chen J-F, Hansen U, Weng Z (2004). Detection of Functional DNA Motifs Via Statistical Overrepresentation. Nucleic Acids Res. 32: Lk: nucleotide at position k W: motif width S: a promoter sequence Ms: number of motif locations in a sequence A: all possibilities of choosing a subset of sequences N: the total number of promoter sequences Clover Raw score

Clover: Cis-eLement OVERrepresentation
Myc/Max Control DNA sequences DHS sequences 4.2 6.6 18 9.1 17.3 Raw score P-value = 1/4

Motifs enriched in cell-type specific DNase HS sites

Genome-wide DNase-chip and DNase-sequencing data
CD4 cells 23 k proximal DNaseI HS sites 72 k distal DNaseI HS sites

Enriched transcription factor binding motifs in distal DNaseI HS sites
Hematopoietic system: TAL1 AML PU.1 C/EBPα Immune system: STAT1, STAT3, STAT5 IRF1, IRF3 and IRF5

Identify motif clusters (modules)
Distal DHS sequences acgtcggctgacaccaggtctgcttgattcgatgagattgaattcgtaggagctggattagag ggcttggggcttgaggcttgacaccatatcgtagcgctgagttgctgagtttcgtatggcgct cgatgcttattagcggctattataggctagctaggcaatacacatcgctgatatagcggctta tgagatagcgtgctagctatatggattggaatattcggcgctgaaaggtcttagctagtcgta aatatatgcgcgtatgcgtatggcgggtatatgggggcttggtcttttttttcgcttaggtcg Find motif clusters in the human genome Enriched motifs

Finding motif clusters with a hidden Markov model
Score Location in DNA Red = motif type 1 (e.g. TAL1) Blue = motif type 2 (e.g. ETS) 0.8 Cluster-Buster MC Frith, MC Li, Z Weng (2003). Cluster-Buster: Finding dense clusters of motifs in DNA sequences. Nucleic Acids Research, 31(13): 0.1 0.1

Overlap between predicted motif clusters and distal DNase HS sites
Enrichment of the overlap = Overlap * Sequence space DHS * Motif Clusters Predicted motif clusters Cutoff DNase HS sites

Motif clusters can predict distal DNase HS sites genome-wide

Summary DNase HS sites identified from 6 cell types Cell-type specific
Common Ubiquitous (found in all cell types studied) Ubiquitous DNase HS sites are likely to function as… Promoters (TSS) Insulators (CTCF) (no enhancers?) Ubiquitous sites indicative of housekeeping chromatin structure Cell-type specific DNase HS sites Correlate with histone modifications in a cell type-specific manner Correlate with gene expression in a cell type-specific manner Correlate with enhancer elements in a cell type-specific manner Contain cell type-specific motifs Motif clusters can predict DNase HS sites genome-wide

Motif Finding Many Slides by Bill Noble @ UW

Outline What is a sequence motif? Weight matrix representation
Motif search Motif discovery Expectation-maximization Gibbs sampling Patterns-with-mismatches representation

What is a “Motif”? Generally, a recurring pattern, e.g.
Sequence motif Structure motif Network motif More specifically, a set of similar substrings, within a family of diverged sequences. Protein sequence motifs DNA sequence motifs

Example motif

Motif in Logos Format

Gene 3’-Processing Signals
RNA A simplified representation of the arrangement of control elements (with example sequences) that identify the 3'-processing site in yeast mRNA. JH Graber et al. (2002) Nucleic Acids Research 30(8):

Splice site motif in logo format
weblogo.berkeley.edu

Exonic Splicing Enhancers
These motifs occur within exons and enhance splicing of introns from mRNA. Letter height indicates its frequency at that position. Fairbrother WG et al. (2002) Science 297(5583):

Transcription Factor Binding Sites
Estrogen Receptor Transcription start DNA ERE (estrogen response element) Gene ERE Sequence Efp … a g g g t c a t g g t g a c c c t … TERT … t t g g t c a g g c t g a t c t c … Oxytocin … g c g g t g a c c t t g a c c c c … Lactoferrin … c a g g t c a a g g c g a t c t t … Angiotensin … t a g g g c a t c g t g a c c c g … VEGF … a t a a t c a g a c t g a c t g g …

Weight matrix Probabilistic model: How likely is each letter at each motif position? 1 2 3 4 5 6 7 8 9 .89 .02 .38 .34 .22 .27 .03 .04 .91 .20 .17 .28 .31 .30 .05 .41 .18 .29 .16 .07 .92 .01 .21 .26 .61 .78 A C G T

A. K. A. Weight matrices are also known as
Position-specific scoring matrices Position-specific probability matrices Position-specific weight matrices

Scoring a motif model A motif is interesting if it is very different from the background distribution 1 2 3 4 5 6 7 8 9 .89 .02 .38 .34 .22 .27 .03 .04 .91 .20 .17 .28 .31 .30 .05 .41 .18 .29 .16 .07 .92 .01 .21 .26 .61 .78 A C G T less interesting more interesting

Relative entropy A motif is interesting if it is very different from the background distribution Use relative entropy*: pi, = probability of  in matrix position i b = background frequency (in non-motif sequence) * Relative entropy is sometimes called information content.

Scoring motif instances
A motif instance matches if it looks like it was generated by the weight matrix 1 2 3 4 5 6 7 8 9 .89 .02 .38 .34 .22 .27 .03 .04 .91 .20 .17 .28 .31 .30 .05 .41 .18 .29 .16 .07 .92 .01 .21 .26 .61 .78 A C G T “ A C G G C G C C T” Not likely! Hard to tell Matches weight matrix

Log likelihood ratio A motif instance matches if it looks like it was generated by the weight matrix Use log likelihood ratio Measures how much more like the weight matrix than like the background. i: the character at position i of the instance

Position-specific scoring matrix
-1 -2 R 5 1 -3 N 6 D C Q E 2 G H 8 I -4 L K M F P S T W Y 3 V This PSSM assigns the sequence NMFWAFGH a score of = 12.

Significance of scores
Motif Scanning algorithm 45 Low score = not a motif High score = motif occurrence How high is high enough? LENENQGKCTIAEYKYDGKKASVYNSFVS

Computing a p-value The scores for all possible sequences of length that matches the motif. Use these scores to compute a p-value. The probability of observing a score >4 is the area under the curve to the right of 4. This probability is called a p-value. p-value = Pr(data|null)

Motif discovery problem
Given sequences Find motif seq. 1 seq. 2 seq. 3 IGRGGFGEVY at position 515 LGEGCFGQVV at position 430 VGSGGFGQVY at position 682

Given: a sequence or family of sequences. Find: the number of motifs the width of each motif the locations of motif occurrences

Why is this hard? ? ? Input sequences are long
(thousands or millions of residues) Motif may be subtle Instances are short. Instances are only slightly similar. ? ?

xxxxxxxxxxx.xxxxxxxxx.xxxxx..........xxxxxx.xxxxxxx.xxxxxxxxxx.xxxxxxxxx
HAHU V.LSPADKTN..VKAAWGKVG.AHAGE YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSA HAOR M.LTDAEKKE..VTALWGKAA.GHGEE YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSA HADK V.LSAADKTN..VKGVFSKIG.GHAEE YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSA HBHU VHLTPEEKSA..VTALWGKVN.VDEVG G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPD HBOR VHLSGGEKSA..VTNLWGKVN.INELG G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAG HBDK VHWTAEEKQL..ITGLWGKVNvAD.CG A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPT MYHU G.LSDGEWQL..VLNVWGKVE.ADIPG HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSED MYOR G.LSDGEWQL..VLKVWGKVE.GDLPG HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTED IGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY.PDIQNKFSQFaGKDLASIKD GPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPE GPYL GVLTDVQVAL..VKSSFEEFN.ANIPK N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQ GGZLB M.L.DQQTIN..IIKATVPVLkEHGVT ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE xxxxx.xxxxxxxxxxxxx..xxxxxxxxxxxxxxx..xxxxxxx.xxxxxxx...xxxxxxxxxxxxxxxx HAHU QVKGH.GKKVADA.LTN......AVA.HVDDMPNA...LSALS.D.LHAHKL....RVDPVNF.KLLSHCLL HAOR QIKAH.GKKVADA.L.S......TAAGHFDDMDSA...LSALS.D.LHAHKL....RVDPVNF.KLLAHCIL HADK QIKAH.GKKVAAA.LVE......AVN.HVDDIAGA...LSKLS.D.LHAQKL....RVDPVNF.KFLGHCFL HBHU AVMGNpKVKAHGK.KVLGA..FSDGLAHLDNLKGT...FATLS.E.LHCDKL....HVDPENF.RL.LGNVL HBOR AVMGNpKVKAHGA.KVLTS..FGDALKNLDDLKGT...FAKLS.E.LHCDKL....HVDPENFNRL..GNVL HBDK AILGNpMVRAHGK.KVLTS..FGDAVKNLDNIKNT...FAQLS.E.LHCDKL....HVDPENF.RL.LGDIL MYHU EMKASeDLKKHGA.TVL......TALGGILKKKGHH..EAEIKPL.AQSHATK...HKIPVKYLEFISECII MYOR EMKASaDLKKHGG.TVL......TALGNILKKKGQH..EAELKPL.AQSHATK...HKISIKFLEYISEAII IGLOB T.GA...FATHATRIVSFLseVIALSGNTSNAAAV...NSLVSKL.GDDHKA....R.GVSAA.QF..GEFR GPUGNI NNPK...LKAHAAVIFKTI...CESATELRQKGHAVwdNNTLKRL.GSIHLK....N.KITDP.HF.EVMKG GPYL NNPD...LQAHAG.KVFKL..TYEAAIQLEVNGAVAs.DATLKSL.GSVHVS....K.GVVDA.HF.PVVKE GGZLB Q......PKALAM.TVL......AAAQNIENLPAIL..PAVKKIAvKHCQAGVaaaH.YPIVGQEL.LGAIK xxxxxxxxx.xxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxx..x HAHU VT.LAA.H..LPAEFTPA..VHASLDKFLASV.STVLTS..KY..R HAOR VV.LAR.H..CPGEFTPS..AHAAMDKFLSKV.ATVLTS..KY..R HADK VV.VAI.H..HPAALTPE..VHASLDKFMCAV.GAVLTA..KY..R HBHU VCVLAH.H..FGKEFTPP..VQAAYQKVVAGV.ANALAH..KY..H HBOR IVVLAR.H..FSKDFSPE..VQAAWQKLVSGV.AHALGH..KY..H HBDK IIVLAA.H..FTKDFTPE..CQAAWQKLVRVV.AHALAR..KY..H MYHU QV.LQSKHPgDFGADAQGA.MNKALELFRKDM.ASNYKELGFQ..G MYOR HV.LQSKHSaDFGADAQAA.MGKALELFRNDM.AAKYKEFGFQ..G IGLOB TA.LVA.Y..LQANVSWGDnVAAAWNKA.LDN.TFAIVV..PR..L GPUGNI ALLGTIKEA.IKENWSDE..MGQAWTEAYNQLVATIKAE..MK..E GPYL AILKTIKEV.VGDKWSEE..LNTAWTIAYDELAIIIKKE..MKdaA GGZLB EVLGDAAT..DDILDAWGK.AYGVIADVFIQVEADLYAQ..AV..E Globin motifs

Alternating approach Examples: Gibbs Sampler (Lawrence et al.)
Guess an initial weight matrix Use weight matrix to predict instances in the input sequences Use instances to predict a weight matrix Repeat 2 & 3 until satisfied. Examples: Gibbs Sampler (Lawrence et al.) MEME (expectation maximization / Bailey, Elkan) ANN-Spec (neural network / Workman, Stormo)

Three Ingredients of Almost any Bioinformatics Method
Search space Scoring scheme Search algorithm (= optimization technique) Mathematically precise formulation of the problem Strictly speaking, Gibbs sampling and expectation-maximization are search algorithms. They are not specific to motif discovery; indeed they were first used in other contexts.

Expectation-Maximization
Guarantees finding a local optimum. Widely used in bioinformatics: The Baum-Welch algorithm for training HMMs is an example So is K-means clustering (e.g. used to analyze microarray data).

Expectation-maximization (EM)
foreach subsequence of width W convert subsequence to a matrix do { re-estimate motif occurrences from matrix re-estimate matrix model from motif occurrences } until (matrix model stops changing) end select matrix with highest score EM

Sample DNA sequences >ce1cg TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA
GCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCAC AAAAATGGAAGTCCACAGTCTTGACAG >ara GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAG AAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTG CTATGCCATAGCATTTTTATCCATAAG >bglr1 ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATA TAACTTTATAAATTCCTAAAATTACACAAAGTTAATAAC TGTGAGCATGGTCATATTTTTATCAAT >crp CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTAC AGTAATACATTGATGTACTGCATGTATGCAAAGGACGTC ACATTACCGTGCAGTACAGTTGATAGC

Motif occurrences >ce1cg taatgtttgtgctggtttttgtggcatcgggcgagaata
gcgcgtggtgtgaaagactgttttTTTGATCGTTTTCAC aaaaatggaagtccacagtcttgacag >ara gacaaaaacgcgtaacaaaagtgtctataatcacggcag aaaagtccacattgattaTTTGCACGGCGTCACactttg ctatgccatagcatttttatccataag >bglr1 acaaatcccaataacttaattattgggatttgttatata taactttataaattcctaaaattacacaaagttaataac TGTGAGCATGGTCATatttttatcaat >crp cacaaagcgaaagctatgctaaaacagtcaggatgctac agtaatacattgatgtactgcatgtaTGCAAAGGACGTC ACattaccgtgcagtacagttgatagc

…gactgttttTTTGATCGTTTTCACaaaaatgg…
Starting point …gactgttttTTTGATCGTTTTCACaaaaatgg… T T T G A T C G T T A C G T

Re-estimating motif occurrences
TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA T T T G A T C G T T A C G T Score =

Scoring each subsequence
Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA Subsequences Score TGTGCTGGTTTTTGT GTGCTGGTTTTTGTG TGCTGGTTTTTGTGG GCTGGTTTTTGTGGC ... Select from each sequence the subsequence with maximal score.

Re-estimating motif matrix
Occurrences TTTGATCGTTTTCAC TTTGCACGGCGTCAC TGTGAGCATGGTCAT TGCAAAGGACGTCAC Counts A C G T

Adding pseudocounts Counts + Pseudocounts Counts A 111243122111151

Converting to frequencies
Counts + Pseudocounts A C G T T T T G A T C G T T A C G T

Expectation-maximization
foreach subsequence of width W convert subsequence to a matrix do { re-estimate motif occurrences from matrix re-estimate matrix model from motif occurrences } until (matrix model stops changing) end select matrix with highest score

Problem: This procedure doesn't allow the motifs to move around very much. Taking the max is too brittle. Solution: Associate with each start site a probability of motif occurrence.

Converting to probabilities
Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA Occurrences Score Prob TGTGCTGGTTTTTGT GTGCTGGTTTTTGTG TGCTGGTTTTTGTGG GCTGGTTTTTGTGGC Total

Computing weighted counts
Occurrences Prob TGTGCTGGTTTTTGT 0.023 GTGCTGGTTTTTGTG 0.037 TGCTGGTTTTTGTGG 0.018 GCTGGTTTTTGTGGC ... 1 2 3 4 5 … A C G T Include counts from all subsequences, weighted by the degree to which they match the motif model.

Problem: How do we estimate counts accurately when we have
only a few examples? Solution: Use Dirichlet mixture priors. Problem: Too many possible starting points. Solution: Save time by running only 1 iteration of EM at first. Problem: Too many possible widths. Solution: Consider widths that vary by 2 and adjust motifs afterwards. Problem: Algorithm assumes exactly one motif occurrence per sequence. Solution: Normalize motif occurrence probabilities across all sequences, using a user-specified parameter. Problem: The EM algorithm finds only one motif. Solution: Probabilistically erase the motif from the data set, and repeat. Problem: The motif model is too simplistic. Solution: Use a two-component mixture model that captures the background distribution. Allow the background model to be more complex, e.g. a Markov model. Problem: The EM algorithm does not tell you how many motifs there are. Solution: Compute statistical significance of motifs and stop when they are no longer significant.

MEME algorithm do for (width = min; width *= 2; width < max)
foreach possible starting point run 1 iteration of EM select candidate starting points foreach candidate run EM to convergence select best motif erase motif occurrences until (E-value of found motif > threshold)

Gibbs Sampling a type of Monte Carlo Markov chain method

Maximization Versus Sampling
We are given some huge search space. Every point Z in the search space has some score SZ defined as before. Sampling: wander around the search space in such a way that how often we visit each point is proportional to πZ=exp(SZ). Maximization: find the point with the highest πZ, a likelihood ratio value between 0 and +∞. EM does maximization and MCMC does sampling. MCMC attempts to escape local optima.

Gibbs Sampling Use a Markov chain to wander around the search space. If we are at point X, move to point Y with probability MXY 1 2 Randomly pick a dimension. Suppose the search space is a 2D rectangle. (Typically, many dimensions!) X Start at a random point X. Look at all points along this dimension. Move to one of them randomly, proportional to its score π. Repeat.

Initialization Randomly guess an instance si from each of t input sequences {S1, ..., St}. sequence 1 ACAGTGT TTAGACC GTGACCA ACCCAGG CAGGTTT sequence 2 sequence 3 sequence 4 sequence 5

Gibbs sampler Initially: randomly guess an instance si from each of t input sequences {S1, ..., St}. Steps 2 & 3 (search): Throw away an instance si: remaining (t - 1) instances define weight matrix. Weight matrix defines instance probability at each position of input string Si Pick new si according to probability distribution Return highest-scoring motif seen

Sampler step illustration:
ACAGTGT TAGGCGT ACACCGT ??????? CAGGTTT A C G T .45 .05 .25 .65 .85 ACAGTGT TAGGCGT ACACCGT ACGCCGT CAGGTTT sequence 4 11% ACGCCGT:20% ACGGCGT:52%

Comparison Both EM and Gibbs sampling involve iterating over two steps
Convergence: EM converges when the PSSM stops changing. Gibbs sampling runs until you ask it to stop. Solution: EM may not find the motif with the highest score. Gibbs sampling will provably find the motif with the highest score, if you let it run long enough.

Comparison of motif finders

Summary Motifs are represented by weight matrices.
Motif quality is measured by relative entropy. Motif occurrences are scored using log likelihood ratios. EM and the Gibbs sampler attempt to find a motif with maximal relative entropy. Both algorithms alternate between predicting instances and predicting the weight matrix.

Homework Go to UCSC genome browser to get the top 100 regions bound by CTCF Use MEME to find the binding motif of CTCF

(l,d)-k Problem: Given a sample, find all patterns of length l such that there are k occurrences of the pattern with up to d mismatches in each.

Two motif representations
Matrices are richer: complete emission distributions, and position-specific. Matrix scores may correspond to binding energies. Patterns are more tractable and allow for performance guarantees. Pattern + data  matrix.

Pattern-driven approach
Consider all 4l patterns of length l. Compare each pattern to every l-mer in the data. Return all (l,d)-k patterns. Running time: O(4ln), where n is the total length of input sequences.

Sample-driven approach
Only consider l-mers that occur in the data, plus their neighbors. Efficient, but requires a hash table of size 4l. Suggested by Waterman [1984]; improved upon by Sagot [1998] and Pavesi [2001].

instance pattern instance WINNOWER Condition A G C T A C A A d 2d A G
Pevzner and Sze, 2001.

WINNOWER Condition A G C T A C A A d 2d 2d A G A T G C C A d d 2d A C

Graph constructed by WINNOWER
For (15,4)-signal, we connect all words with distance at most 8: atgaccgggatactgatAgAAgAAAGGttGGGtataatggagtacgataa atgacttcAAtAAAAcGGcGGGtgctctcccgattttgagtatccctggg gcaatcgcgaaccaagctgagaattggatgtcAAAAtAAtGGaGtGGcac gtcaatcgaaaaaacggtggaggatttcAAAAAAAGGGattGgaccgctt real signals signal edges spurious signals spurious edges

Pattern Graph Pruning (k=4)
C T A C A A A G C T A C C A A G C T T T A A A G C T A T C A A G C T G C C A A G C T T A A A A G C T T A A A A G C T T A T A

C T A C A A A G C T A C C A A G C T A T C A A G C T G C C A A G C T T A A A

C T A C A A A G C T A C C A A G C T A T C A A G C T G C C A Pruning Preserves Cliques.

C T A C A A A G C T T T A A A G C T A T C A A G C T G C C A A G C T T A A A A G C T T A A A A G C T T A T A

C T A C A A A G C T A T C A A G C T G C C A A G C T T A A A

C T A C A A A G C T A T C A A G C T G C C A

Empty graph = no valid patterns.

C T A C A A A G C T T T A A A G C T A T C A A G C T G C C A A G C T T A A A A G C T T A A A A G C T T A T A

C T A C A A A G C T A T C A A G C T G C C A A G C T T A A A A G C T T A T A Cannot prune any more edges.

MITRA Framework (MIsmatch TRee Approach)
Eskin and Pevzner, 2002. MITRA Framework (MIsmatch TRee Approach) Split Pattern Space 4 easier sub problems. Removes many edges. Perform WINNOWER in each pattern space. Example: Space of patterns length 12 Can be split into 4 pattern subspaces: ? ? ? ? ? ? ? ? ? ? ? ? A ? ? ? ? ? ? ? ? ? ? ? C ? ? ? ? ? ? ? ? ? ? ? G ? ? ? ? ? ? ? ? ? ? ? T ? ? ? ? ? ? ? ? ? ? ?

MITRA-Graph WINNOWER style graph in each subspace.
For edge to exist, three conditions: m1 A G C T A C A T t A G A ? ? ? ? ? m2 A C G A A T A C

MITRA-Graph Algorithm
Initial Pattern Graph yes Prune Edges Is Pattern Space Empty? Done no yes Output Pattern Is Pattern Complete? no A New Pattern Spaces C Split Pattern Space G T

Mismatch Tree Approach (MITRA-Count, MITRA-Graph)
Hard Problem. Worst case takes exponential time. MITRA Benefits: Guaranteed to find all patterns. Prune uninteresting portions of search space. Easily Parallelizable. Detects long patterns. Minimal memory requirements for MITRA-Counts. Drawbacks: Computational overhead of data structure. MITRA-Graph needs to store edges (memory).

Genes and Regulatory Elements

Similar presentations

Presentation on theme: "Genes and Regulatory Elements"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Genes and Regulatory Elements

Similar presentations

Presentation on theme: "Genes and Regulatory Elements"— Presentation transcript:

Similar presentations

About project

Feedback