Motif search GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.

Slides:



Advertisements
Similar presentations
Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, All rights reserved.
Advertisements

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Motif discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
My Experience at Bioinformatics Summer Academy! My Experience at Bioinformatics Summer Academy! By: Binisha Maharjan.
Genes and Regulatory Elements
Promoter Panel Review. Background related Promoter In genetics, a promoter is a DNA sequence that enables a gene to be transcribed. It may be very long.
Investigating the Importance of non-coding transcripts.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
Multiple sequence alignments and motif discovery Tutorial 5.
The Poor Beginners’ Guide to Bioinformatics. What we have – and don’t have... a computer connected to the Internet (incl. Web browser) a text editor (Notepad.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Sequence comparison: Local alignment Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Multiple testing correction
An Introduction to Bioinformatics
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Computational Genomics III: Motif Detection Eric Xing Lecture 6,
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
Construction of Substitution Matrices
Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Comparative genomics analysis of NtcA regulons in cyanobacteria: Regulation of nitrogen assimilation and its coupling to photosynthesis Wen-Ting Huang.
Copyright OpenHelix. No use or reproduction without express written consent1.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Motif discovery and Protein Databases Tutorial 5.
Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Construction of Substitution matrices
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Projects
Today’s goals Introduce Major Essay 2: Write to Inform (WTI) essays
Pairwise sequence comparison
A Very Basic Gibbs Sampler for Motif Detection
Learning Sequence Motif Models Using Expectation Maximization (EM)
Motif discovery GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Transcription factor binding motifs
While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Significance of similarity scores
Sequence comparison: Traceback and local alignment
For loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble Notes for 2010: I skipped slide 10. This is.
Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Ab initio gene prediction
Lesson 2 Programming constructs – Algorithms – Scratch – Variables Intro.
Sequence comparison: Multiple testing correction
Recitation 7 2/4/09 PSSMs+Gene finding
Sequence comparison: Dynamic programming
Introduction to Bioinformatics II
Motif discovery GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Local alignment
Sequence comparison: Traceback
Sequence comparison: Multiple testing correction
Motif search GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Significance of similarity scores
False discovery rate estimation
Basic Local Alignment Search Tool
Presentation transcript:

Motif search GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble

One-minute response Would like to keep format of ½ theory and ½ python. x5 I agree that we don’t need to go over sample problems at beginning of next class. The stats went from very basic to showing a difficult proof. Explanation of the proof was kind of a waste of time. Deriving Bonferroni made sense conceptually but might have helped with more concrete example.

One-minute response: positives Thank you for giving lots of time to ask questions. I really understood the statistics part very well. Thank you! I like that we spent more time on programming. Python section was very useful / do-able / easy to follow. It was nice to get the separate reference sheet.

One-minute response: pacing I thought pace for both were good today. x2 Good pacing today. x2 Loved the pace of both parts. Class was really good today. Good pacing, good exercises. I liked the pacing for the class. Good pace and contents overall. x2 Theory seemed to move slowly today. I think we could go a little faster, like introducing if and for loops in the same class. I don’t think we should go slower with Python. Going slower through practice problems helped, as well as working through the code.

One-minute response: other My biggest problem is I keep forgetting to put commas in my print statements. The second problem was difficult for me to get without help. Bonus problems for class exercises could be fun. I could use more sample problems for the theory side. Wish we spent more time on probability. It seems like we are doing really small chunks. I would prefer to use a larger % of class time for Python. Perhaps going more in depth on why you used certain parts of code would help with homework.

Outline Motifs – definition and motivation Motif databases Scanning with a PSSM

What is a motif? Set of similar substrings, within a family of diverged sequences. Motif long DNA or protein sequence

Protein motifs Protein binding site Phosphorylation site HAHU V.LSPADKTN..VKAAWGKVG.AHAGE..........YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSA HAOR M.LTDAEKKE..VTALWGKAA.GHGEE..........YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSA HADK V.LSAADKTN..VKGVFSKIG.GHAEE..........YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSA HBHU VHLTPEEKSA..VTALWGKVN.VDEVG...........G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPD HBOR VHLSGGEKSA..VTNLWGKVN.INELG...........G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAG HBDK VHWTAEEKQL..ITGLWGKVNvAD.CG...........A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPT MYHU G.LSDGEWQL..VLNVWGKVE.ADIPG..........HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSED MYOR G.LSDGEWQL..VLKVWGKVE.GDLPG..........HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTED IGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY..PDIQNKFSQaFKDLASIKD GPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA..........HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPE GPYL GVLTDVQVAL..VKSSFEEFN.ANIPK...........N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQ GGZLB M.L.DQQTIN..IIKATVPVLkEHGVT...........ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE   Protein binding site Phosphorylation site Structural motif

Transcription factor binding site motifs Transcription factor binding sites

Why identify motifs? In proteins In DNA Identify functionally important regions of a protein family Find similarities to known proteins In DNA Discover how genes are regulated Discover how splicing is regulated

xxxxxxxxxxx.xxxxxxxxx.xxxxx..........xxxxxx.xxxxxxx.xxxxxxxxxx.xxxxxxxxx HAHU V.LSPADKTN..VKAAWGKVG.AHAGE..........YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSA HAOR M.LTDAEKKE..VTALWGKAA.GHGEE..........YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSA HADK V.LSAADKTN..VKGVFSKIG.GHAEE..........YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSA HBHU VHLTPEEKSA..VTALWGKVN.VDEVG...........G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPD HBOR VHLSGGEKSA..VTNLWGKVN.INELG...........G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAG HBDK VHWTAEEKQL..ITGLWGKVNvAD.CG...........A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPT MYHU G.LSDGEWQL..VLNVWGKVE.ADIPG..........HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSED MYOR G.LSDGEWQL..VLKVWGKVE.GDLPG..........HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTED IGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY.PDIQNKFSQFaGKDLASIKD GPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA..........HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPE GPYL GVLTDVQVAL..VKSSFEEFN.ANIPK...........N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQ GGZLB M.L.DQQTIN..IIKATVPVLkEHGVT...........ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE   xxxxx.xxxxxxxxxxxxx..xxxxxxxxxxxxxxx..xxxxxxx.xxxxxxx...xxxxxxxxxxxxxxxx HAHU QVKGH.GKKVADA.LTN......AVA.HVDDMPNA...LSALS.D.LHAHKL....RVDPVNF.KLLSHCLL HAOR QIKAH.GKKVADA.L.S......TAAGHFDDMDSA...LSALS.D.LHAHKL....RVDPVNF.KLLAHCIL HADK QIKAH.GKKVAAA.LVE......AVN.HVDDIAGA...LSKLS.D.LHAQKL....RVDPVNF.KFLGHCFL HBHU AVMGNpKVKAHGK.KVLGA..FSDGLAHLDNLKGT...FATLS.E.LHCDKL....HVDPENF.RL.LGNVL HBOR AVMGNpKVKAHGA.KVLTS..FGDALKNLDDLKGT...FAKLS.E.LHCDKL....HVDPENFNRL..GNVL HBDK AILGNpMVRAHGK.KVLTS..FGDAVKNLDNIKNT...FAQLS.E.LHCDKL....HVDPENF.RL.LGDIL MYHU EMKASeDLKKHGA.TVL......TALGGILKKKGHH..EAEIKPL.AQSHATK...HKIPVKYLEFISECII MYOR EMKASaDLKKHGG.TVL......TALGNILKKKGQH..EAELKPL.AQSHATK...HKISIKFLEYISEAII IGLOB T.GA...FATHATRIVSFLseVIALSGNTSNAAAV...NSLVSKL.GDDHKA....R.GVSAA.QF..GEFR GPUGNI NNPK...LKAHAAVIFKTI...CESATELRQKGHAVwdNNTLKRL.GSIHLK....N.KITDP.HF.EVMKG GPYL NNPD...LQAHAG.KVFKL..TYEAAIQLEVNGAVAs.DATLKSL.GSVHVS....K.GVVDA.HF.PVVKE GGZLB Q......PKALAM.TVL......AAAQNIENLPAIL..PAVKKIAvKHCQAGVaaaH.YPIVGQEL.LGAIK xxxxxxxxx.xxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxx..x HAHU VT.LAA.H..LPAEFTPA..VHASLDKFLASV.STVLTS..KY..R HAOR VV.LAR.H..CPGEFTPS..AHAAMDKFLSKV.ATVLTS..KY..R HADK VV.VAI.H..HPAALTPE..VHASLDKFMCAV.GAVLTA..KY..R HBHU VCVLAH.H..FGKEFTPP..VQAAYQKVVAGV.ANALAH..KY..H HBOR IVVLAR.H..FSKDFSPE..VQAAWQKLVSGV.AHALGH..KY..H HBDK IIVLAA.H..FTKDFTPE..CQAAWQKLVRVV.AHALAR..KY..H MYHU QV.LQSKHPgDFGADAQGA.MNKALELFRKDM.ASNYKELGFQ..G MYOR HV.LQSKHSaDFGADAQAA.MGKALELFRNDM.AAKYKEFGFQ..G IGLOB TA.LVA.Y..LQANVSWGDnVAAAWNKA.LDN.TFAIVV..PR..L GPUGNI ALLGTIKEA.IKENWSDE..MGQAWTEAYNQLVATIKAE..MK..E GPYL AILKTIKEV.VGDKWSEE..LNTAWTIAYDELAIIIKKE..MKdaA GGZLB EVLGDAAT..DDILDAWGK.AYGVIADVFIQVEADLYAQ..AV..E Globin motifs

Splice site motif in logo format weblogo.berkeley.edu

Position-specific scoring matrix This PSSM assigns the sequence NMFWAFGH a score of 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12. A -1 -2 R 5 1 -3 N 6 D C Q E 2 G H 8 I -4 L K M F P S T W Y 3 V

Representing a motif as a PSSM Convert these 9 6-letter sequences into a PSSM. In practice, we add a small pseudocount, convert to frequencies, and take the log. AAGTGT TAATGT AATTGT AATTGA ATCTGT TGTTGT AAATGA TTTTGT A 6 6 2 0 0 2 C 0 0 1 0 0 0 G 0 1 1 0 9 0 T 3 2 5 9 0 7

Representing motifs as PSSMs Add a pseudocount. Convert to frequencies. Take the logarithm. AAGTGT TAATGT AATTGT AATTGA ATCTGT TGTTGT AAATGA TTTTGT A 6 6 2 0 0 2 C 0 0 1 0 0 0 G 0 1 1 0 9 0 T 3 2 5 9 0 7 A 6.25 6.25 2.25 0.25 0.25 2.25 C 0.25 0.25 1.25 0.25 0.25 0.25 G 0.25 1.25 1.25 0.25 9.25 0.25 T 3.25 2.25 5.25 9.25 0.25 7.25 A 0.625 0.625 0.225 0.025 0.025 0.225 C 0.025 0.025 0.125 0.025 0.025 0.025 G 0.025 0.125 0.125 0.025 0.925 0.025 T 0.325 0.225 0.525 0.925 0.025 0.725 A 1.32 1.32 -0.15 -3.32 -3.32 -0.15 C -3.32 -3.32 -1.00 -3.32 -3.32 -3.32 G -3.32 -1.00 -1.00 -3.32 1.89 -3.32 T 0.38 -0.15 1.07 1.89 -3.32 1.54

Scanning for motif occurrences Given: a long DNA sequence, and TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG a DNA motif represented as a PSSM Find: occurrences of the motif in the sequence A 1.32 1.32 -0.15 -3.32 -3.32 -0.15 C -3.32 -3.32 -1.00 -3.32 -3.32 -3.32 G -3.32 -1.00 -1.00 -3.32 1.89 -3.32 T 0.38 -0.15 1.07 1.89 -3.32 1.54

Scanning for motif occurrences 0.38 + 1.32 – 0.15 + 1.89 + 1.89 + 1.54 = 6.87 TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG

Scanning for motif occurrences 1.32 + 1.32 + 1.07 – 3.32 – 3.32 + 1.54 = -1.39 TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG

Summary DNA and protein motifs represent functionally or structurally important sequence elements. Motifs are typically represented using position-specific scoring matrices. A PSSM can be used to scan a given DNA or protein sequence to search for occurrences of the motif.