CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery.

Slides:



Advertisements
Similar presentations
Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
Advertisements

Periodic clusters. Non periodic clusters That was only the beginning…
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Comparative Motif Finding
Transcription factor binding motifs (part I) 10/17/07.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
CS262 Lecture 18, Win07, Batzoglou Sequence Logos Height of each letter proportional to its frequency Height of all letters proportional to information.
Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Building synteny maps Recommended local aligners BLASTZ  Most accurate, especially for genes  Chains local alignments WU-BLAST  Good tradeoff of efficiency/sensitivity.
Ab initio motif finding
Finding Regulatory Motifs in DNA Sequences
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Biological Motif Discovery Concepts Motif Modeling and Motif Information EM and Gibbs Sampling Comparative Motif Prediction Applications Transcription.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
MotifClick: cis-regulatory k - length motifs finding in cliques of 2(k-1)- mers Shaoqiang Zhang April 3, 2013.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
A B IL-4(+) IL-4(-) IL-4(+) IL-4(-) ChIP-Seq (STAT6) Ramos IL-4 (+) P-value Ramos IL-4 (-) P-value BEAS2B IL-4 (+) P-value BEASB IL-4 (-) P-value fold.
Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G.
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Biological Motif Discovery Concepts Motif Modeling and Motif Information EM and Gibbs Sampling Comparative Motif Prediction Applications Transcription.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Motif Search and RNA Structure Prediction Lesson 9.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Transcription factor binding motifs (part II) 10/22/07.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Regulation of Gene Expression
The Transcriptional Landscape of the Mammalian Genome
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
De novo Motif Finding using ChIP-Seq
Algorithms for Regulatory Motif Discovery
Recitation 7 2/4/09 PSSMs+Gene finding
Motif finding in groups of related sequences
Predicting Gene Expression from Sequence
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery

CS273a, Spring 2007, Lecture 11 Challenges in Computational Biology DNA 4 Genome Assembly Gene Finding Regulatory motif discovery Database lookup Gene expression analysis9 RNA transcript Sequence alignment Evolutionary Theory7 TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT Cluster discovery10Gibbs sampling Protein network analysis12 Emerging network properties14 13 Regulatory network inference Comparative Genomics RNA folding

CS273a, Spring 2007, Lecture 11 TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA CATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC CGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT AGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG ATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG AACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA CTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA TAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT GGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAA GTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA TGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGA TACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACAT TTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAA AGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAAT ACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTAC AACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATAT CAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCG TTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTC TTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATT AATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATA CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTA AGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA TTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA GTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATG CTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACT TAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGAT TGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT

CS273a, Spring 2007, Lecture 11 TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA CATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC CGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT AGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG ATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG AACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA CTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA TAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT GGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAA GTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA TGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGA TACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACAT TTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAA AGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAAT ACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTAC AACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATAT CAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCG TTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTC TTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATT AATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATA CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTA AGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA TTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA GTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATG CTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACT TAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGAT TGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT Promoter motifs 3’ UTR motifsExons Introns

CS273a, Spring 2007, Lecture 11 Comparing genomes reveals functional elements Ultra-conserved elements Protein-coding genes Short regulatory motifs

CS273a, Spring 2007, Lecture 11 ATGACTAAATCTCATTCAGAAGAAGTGA Regulatory Motif Discovery GAL1 CCCCWCGGCCG Gal4 Mig1 CGGCCG Gal4 Gene regulation –Genes are turned on / off in response to changing environments –Gene regulatory logic is controlled by sequence motifs –Specialized proteins (transcription factors) recognize motifs What makes motif discovery hard? –Motifs are short (6-8 bp) and usually degenerate –Act at variable distances upstream (or downstream) of target gene

CS273a, Spring 2007, Lecture 11 Regulatory Motif Discovery Study known motifs Derive conservation rules Discover novel motifs

CS273a, Spring 2007, Lecture 11 Known motifs are preferentially conserved Is this enough to discover motifs? No.

CS273a, Spring 2007, Lecture 11 human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGG dog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGG mouse GTCTTAGGAGGCT-CGATCGCC GCCTGCATTATT----- rat GTCTTAGTTGGCCACGACCTGC TCATGCATAATT----- ***** * * * * * * human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGC dog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGC mouse CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC rat CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** * human TGCGGGCCCGAGACCCCCG GGCCTCCCTGCCCCCCGCGCCG dog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCG mouse TGCAGGCTCACCACCCCGTCTTTTCT GCTTTTCGAGTCG rat -GCATACACCCCGCCTTTTTTTTTTTTTT TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * * Known motifs are preferentially conserved human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGG dog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGG mouse GTCTTAGGAGGCT-CGATCGCC GCCTGCATTATT----- rat GTCTTAGTTGGCCACGACCTGC TCATGCATAATT----- ***** * * * * * * human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGC dog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGC mouse CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC rat CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** * human TGCGGGCCCGAGACCCCCG GGCCTCCCTGCCCCCCGCGCCG dog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCG mouse TGCAGGCTCACCACCCCGTCTTTTCT GCTTTTCGAGTCG rat -GCATACACCCCGCCTTTTTTTTTTTTTT TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * * Gabpa Err  human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGG dog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGG mouse GTCTTAGGAGGCT-CGATCGCC GCCTGCATTATT----- rat GTCTTAGTTGGCCACGACCTGC TCATGCATAATT----- ***** * * * * * * human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGC dog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGC mouse CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC rat CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** * human TGCGGGCCCGAGACCCCCG GGCCTCCCTGCCCCCCGCGCCG dog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCG mouse TGCAGGCTCACCACCCCGTCTTTTCT GCTTTTCGAGTCG rat -GCATACACCCCGCCTTTTTTTTTTTTTT TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * * Is this enough to discover motifs? No

CS273a, Spring 2007, Lecture 11 Known motifs are frequently conserved Across the human promoter regions, the Err  motif: –appears 434 times –is conserved 162 times Human Dog Mouse Rat Err  Conservation rate: 37% Compare to random control motifs –Conservation rate of control motifs: 6.8% –Err  enrichment: 5.4-fold –Err  p-value < (25 standard deviations under binomial) Motif Conservation Score (MCS)

CS273a, Spring 2007, Lecture 11 MCS distribution of all 6-mers shows excess conservation –High scoring patterns include known motifs –Excess specific to promoters and 3’-UTRs (not introns) –For MCS > 6, estimate 97% specificity Motif density Motif Conservation Score (MCS) Select motifs with MCS > 6.0, cluster

CS273a, Spring 2007, Lecture 11 Hill-climbing in sequence space Seed selection –Three mini-motif conservation criteria (CC1, CC2, CC3) Motif extension –Non-random conservation of neighbors Motif collapsing –Merge neighbors using hierarchical clustering, avg-max-linkage Re-scoring complex motifs –Motif conservation score for full motifs (MCS)

CS273a, Spring 2007, Lecture 11 Test 1: Intergenic conservation Total count Conserved count CGG-11-CCG

CS273a, Spring 2007, Lecture 11 Test 1: Selecting mini-motifs Estimate basal rate of conservation –Expected conservation rate at the evolutionary distances observed –Average conservation rate of non- outlier mini-motifs Score conservation of mini-motif –k: conserved motif occurrences –n: total motif occurrences –r: basal conservation rate –Evaluate binomial probability of observing k successes out of n trials Assign z-score to each mini-motif –Bulk of distribution is symmetric –Estimate specificity as (R-L)/R –Select cutoff: 5.0 sigma –1190 mini-motifs, 97.5% non-random Conservation rate r N Binomial score Right tail Left tail Specificity Cutoff

CS273a, Spring 2007, Lecture 11 Test 2: Intergenic vs. Coding Coding Conservation Intergenic Conservation CGG-11-CCG Higher Conservation in Genes

CS273a, Spring 2007, Lecture 11 Test 3: Upstream vs. Downstream CGG-11-CCG Downstream motifs? Most Patterns Downstream Conservation Upstream Conservation

CS273a, Spring 2007, Lecture 11 Extend Collapse Full Motifs Constructing full motifs 2,000 Mini-motifs 72 Full motifs 6 CTA CGA R R CTGRC CGAA ACCTGCGAACTGRCCGAACTRAY CGAA Y 5 Extend Collapse Merge Test 1Test 2Test 3

CS273a, Spring 2007, Lecture 11 Extending mini-motifs Separate conserved and non-conserved instances CTACGA 6 CTxxGA 6 Causal set Random set CTACGARGW CTxxGAYHS Find maximally discriminating neighborhood N1 N2 M1 M2 Evaluate non-randomness of neighborhood –chi-square contingency test on [N1,M1], [N2,M2]

CS273a, Spring 2007, Lecture 11 Systematically test candidate patterns All potential motifs Evaluate MCS Cluster similar motifs GTC AGT R R Y gap S W 174 motifs in promoters 106 motifs in 3’ UTRs Enumerate –Length between 6 and 15 nt, allow central gap –11 letter alphabet (A C G T, 2-fold codes, N) Score –Compute binomial score (conserved vs. total) –Select MCS > 6.0  specificity 97% Cluster –Sequence similarity –Overlapping occurrences Are these real ?

CS273a, Spring 2007, Lecture 11 Functions of discovered motifs

CS273a, Spring 2007, Lecture 11 Evidence of motif function Promoter motifs: (1)Comparison to known motifs (2)Distance from TSS (3)Expression enrichment Promoter3’-UTR ATG Stop 174 motifs106 motifs

CS273a, Spring 2007, Lecture 11 (1)Promoter motifs match known TF binding sites Compare discovered motifs to TRANSFAC database of 125 known motifs 55% of TRANSFAC motifs match discovered motifs 45% of discovered motifs match TRANSFAC motifs (only 2% of control sequences match TRANSFAC motifs)

CS273a, Spring 2007, Lecture 11 (2) Promoter motifs show preferred distance to TSS 32% of discovered motifs show strong positional bias Conserved motif sites in all four species Motif instances in human Each of 174 discovered motifs Motif 8 Motif Distance from TSS Discovered motifs occur preferentially Within 200 bp of Transcription Start Site Individual motifs show strong peaks Regardless of conservation

CS273a, Spring 2007, Lecture 11 (3) Promoter motifs enriched in specific tissues 70% of motifs show significant enrichment in at least one tissue New motifsKnown TFs

CS273a, Spring 2007, Lecture 11 Summary for promoter motifs RankDiscovered Motif Known TF motif Tissue Enrichment Distance bias 1RCGCAnGCGYNRF-1Yes 2CACGTGMYCYes 3SCGGAAGYELK-1Yes 4ACTAYRnnnCCCRYes 5GATTGGYNF-YYes 6GGGCGGRSP1Yes 7TGAnTCAAP-1Yes 8TMTCGCGAnRYes 9TGAYRTCAATF3Yes 10GCCATnTTGYY1Yes 11MGGAAGTGGABPYes 12CAGGTGE12Yes 13CTTTGTLEF1Yes 14TGACGTCAATF3Yes 15CAGCTGAP-4Yes 16RYTTCCTGC-ETS-2Yes 17AACTTTIRF1(*)Yes 18TCAnnTGAYSREBP-1Yes 19GKCGCn(7)TGAYGYes 20GTGACGYE4F1Yes 21GGAAnCGGAAnYYes 22TGCGCAnKYes 23TAATTACHX10Yes 24GGGAGGRRMAZYes 25TGACCTYERRAYes 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias  75% have evidence Control sequences < 2% match known TF motifs < 5% expression enrichment < 3% show positional bias  < 7% false positives 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias  75% have evidence Control sequences < 2% match known TF motifs < 5% expression enrichment < 3% show positional bias  < 7% false positives Most discovered motifs are likely to be functional New

CS273a, Spring 2007, Lecture 11 Summary of Promoter Motifs

CS273a, Spring 2007, Lecture 11 Similar analysis in 5% most conserved regions in human bp long motifs bp long motifs

CS273a, Spring 2007, Lecture 11 Similar analysis in 5% most conserved regions in human

CS273a, Spring 2007, Lecture 11 Overview of Motif Discovery Algorithms

CS273a, Spring 2007, Lecture 11 Motif Representation GTATAA CTATAA GTCTTA ATATAC GTAATA TTGTAC GTATTA GTATTC ATCTAA GTATAA CTATAA GTCTTA ATATAC GTAATA TTGTAC GTATTA GTATTC ATCTAA PSSM GTATAA Consensus GTATAMGTATAM GTATAMGTATAM IUPAC Complex Dependency Graphical Models Complex Dependency Graphical Models GTATAA CTATAA TTGTAC GTCTTA GTAATA ATATACATATAC ATATACATATAC GTATTA GTATTC ATCTAA Nonparametric – Graph or Bag of Words Nonparametric – Graph or Bag of Words

CS273a, Spring 2007, Lecture 11 Motif Representation – Pairwise Dependencies Complex Dependency Graphical Models Complex Dependency Graphical Models

CS273a, Spring 2007, Lecture 11 Motif Representation – MotifScan GTATAA CTATAA TTGTAC GTCTTA GTAATA ATATACATATAC ATATACATATAC GTATTA GTATTC ATCTAA

CS273a, Spring 2007, Lecture 11 Motif Finding Given a set of promoter sequences –For example, common expression pattern of the respective genes in microarrays ACCGAGAGTATAAGCTTACGTGACTTGCATGATCTTGCGATGTGTGTTCAGCT ATCGTACGTTGAGGAGAGGCGGTAATAGAAGTACGTCGATGTCGTCGTACAT TTCCTATAAGATCGACTGTAGGGAGAGTCTCTGAGAGTATTGCTGGCATGTG ACTTCGAGGAGAGATTCTCTAGATCTATGCTGTGGTATTAAGAGATCTCTAG ATCGATGCGCTGATCGCTATAATATATCGGCGGTATCTGGTTGATCTGGTGT GACTGATGTATCGTATCTGATCTGTCGGTATAATATAGCTGTCTGATTAGTTG TCTCTAGATGCTGTGCTGATGGTCTTATCGATGTGCGACGGTAATAGTATCCT Find a common motif that they share GTATAA GTAATA CTATAA GTATTA CTATAA GTATAA GTAATA

CS273a, Spring 2007, Lecture 11 Most Popular Approaches Expectation Maximization – MEME –Sequences are mixtures of Motif model M, e.g., a motif PSSM Background model B, e.g., 3 rd order model of promoters –Learn model by Starting from random M, learned B from promoters Assign each position in input to M or B, accordingly Re-estimate M and B based on current assignments Gibbs Sampling – AlignACE, BioProspector –Update 1-seq x at a time Remove from M Pick a new location in x based on M M x

CS273a, Spring 2007, Lecture 11 MotifCut Construct a graph of all promoters –Each k-mer in each promoter is a node –Nodes are connected with edges of weight proportional to sequence similarity Find maximum density subgraph ACAGGATCACTGATGCAGCATGCATGCATCG CTAGTCGTAGTCTCGATCTAGCTGTGTGTC CATGATGCGCGATCTTGCTGTGGTCATTAGC ATCGAGGCGAGAGAGATCTCTCTAGTGTACT ACAGGAT CAGGATC AGGATCA GGATCAC …