Presentation is loading. Please wait.

Presentation is loading. Please wait.

Motif search GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.

Similar presentations


Presentation on theme: "Motif search GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble."— Presentation transcript:

1 Motif search GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble

2 One-minute response Would like to keep format of ½ theory and ½ python. x5 I agree that we don’t need to go over sample problems at beginning of next class. The stats went from very basic to showing a difficult proof. Explanation of the proof was kind of a waste of time. Deriving Bonferroni made sense conceptually but might have helped with more concrete example.

3 One-minute response: positives
Thank you for giving lots of time to ask questions. I really understood the statistics part very well. Thank you! I like that we spent more time on programming. Python section was very useful / do-able / easy to follow. It was nice to get the separate reference sheet.

4 One-minute response: pacing
I thought pace for both were good today. x2 Good pacing today. x2 Loved the pace of both parts. Class was really good today. Good pacing, good exercises. I liked the pacing for the class. Good pace and contents overall. x2 Theory seemed to move slowly today. I think we could go a little faster, like introducing if and for loops in the same class. I don’t think we should go slower with Python. Going slower through practice problems helped, as well as working through the code.

5 One-minute response: other
My biggest problem is I keep forgetting to put commas in my print statements. The second problem was difficult for me to get without help. Bonus problems for class exercises could be fun. I could use more sample problems for the theory side. Wish we spent more time on probability. It seems like we are doing really small chunks. I would prefer to use a larger % of class time for Python. Perhaps going more in depth on why you used certain parts of code would help with homework.

6 Outline Motifs – definition and motivation Motif databases
Scanning with a PSSM

7 What is a motif? Set of similar substrings, within a family of diverged sequences. Motif long DNA or protein sequence

8 Protein motifs Protein binding site Phosphorylation site
HAHU V.LSPADKTN..VKAAWGKVG.AHAGE YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSA HAOR M.LTDAEKKE..VTALWGKAA.GHGEE YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSA HADK V.LSAADKTN..VKGVFSKIG.GHAEE YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSA HBHU VHLTPEEKSA..VTALWGKVN.VDEVG G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPD HBOR VHLSGGEKSA..VTNLWGKVN.INELG G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAG HBDK VHWTAEEKQL..ITGLWGKVNvAD.CG A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPT MYHU G.LSDGEWQL..VLNVWGKVE.ADIPG HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSED MYOR G.LSDGEWQL..VLKVWGKVE.GDLPG HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTED IGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY..PDIQNKFSQaFKDLASIKD GPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPE GPYL GVLTDVQVAL..VKSSFEEFN.ANIPK N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQ GGZLB M.L.DQQTIN..IIKATVPVLkEHGVT ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE Protein binding site Phosphorylation site Structural motif

9 Transcription factor binding site motifs
Transcription factor binding sites

10 Why identify motifs? In proteins In DNA
Identify functionally important regions of a protein family Find similarities to known proteins In DNA Discover how genes are regulated Discover how splicing is regulated

11 xxxxxxxxxxx.xxxxxxxxx.xxxxx..........xxxxxx.xxxxxxx.xxxxxxxxxx.xxxxxxxxx
HAHU V.LSPADKTN..VKAAWGKVG.AHAGE YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSA HAOR M.LTDAEKKE..VTALWGKAA.GHGEE YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSA HADK V.LSAADKTN..VKGVFSKIG.GHAEE YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSA HBHU VHLTPEEKSA..VTALWGKVN.VDEVG G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPD HBOR VHLSGGEKSA..VTNLWGKVN.INELG G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAG HBDK VHWTAEEKQL..ITGLWGKVNvAD.CG A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPT MYHU G.LSDGEWQL..VLNVWGKVE.ADIPG HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSED MYOR G.LSDGEWQL..VLKVWGKVE.GDLPG HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTED IGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY.PDIQNKFSQFaGKDLASIKD GPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPE GPYL GVLTDVQVAL..VKSSFEEFN.ANIPK N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQ GGZLB M.L.DQQTIN..IIKATVPVLkEHGVT ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE xxxxx.xxxxxxxxxxxxx..xxxxxxxxxxxxxxx..xxxxxxx.xxxxxxx...xxxxxxxxxxxxxxxx HAHU QVKGH.GKKVADA.LTN......AVA.HVDDMPNA...LSALS.D.LHAHKL....RVDPVNF.KLLSHCLL HAOR QIKAH.GKKVADA.L.S......TAAGHFDDMDSA...LSALS.D.LHAHKL....RVDPVNF.KLLAHCIL HADK QIKAH.GKKVAAA.LVE......AVN.HVDDIAGA...LSKLS.D.LHAQKL....RVDPVNF.KFLGHCFL HBHU AVMGNpKVKAHGK.KVLGA..FSDGLAHLDNLKGT...FATLS.E.LHCDKL....HVDPENF.RL.LGNVL HBOR AVMGNpKVKAHGA.KVLTS..FGDALKNLDDLKGT...FAKLS.E.LHCDKL....HVDPENFNRL..GNVL HBDK AILGNpMVRAHGK.KVLTS..FGDAVKNLDNIKNT...FAQLS.E.LHCDKL....HVDPENF.RL.LGDIL MYHU EMKASeDLKKHGA.TVL......TALGGILKKKGHH..EAEIKPL.AQSHATK...HKIPVKYLEFISECII MYOR EMKASaDLKKHGG.TVL......TALGNILKKKGQH..EAELKPL.AQSHATK...HKISIKFLEYISEAII IGLOB T.GA...FATHATRIVSFLseVIALSGNTSNAAAV...NSLVSKL.GDDHKA....R.GVSAA.QF..GEFR GPUGNI NNPK...LKAHAAVIFKTI...CESATELRQKGHAVwdNNTLKRL.GSIHLK....N.KITDP.HF.EVMKG GPYL NNPD...LQAHAG.KVFKL..TYEAAIQLEVNGAVAs.DATLKSL.GSVHVS....K.GVVDA.HF.PVVKE GGZLB Q......PKALAM.TVL......AAAQNIENLPAIL..PAVKKIAvKHCQAGVaaaH.YPIVGQEL.LGAIK xxxxxxxxx.xxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxx..x HAHU VT.LAA.H..LPAEFTPA..VHASLDKFLASV.STVLTS..KY..R HAOR VV.LAR.H..CPGEFTPS..AHAAMDKFLSKV.ATVLTS..KY..R HADK VV.VAI.H..HPAALTPE..VHASLDKFMCAV.GAVLTA..KY..R HBHU VCVLAH.H..FGKEFTPP..VQAAYQKVVAGV.ANALAH..KY..H HBOR IVVLAR.H..FSKDFSPE..VQAAWQKLVSGV.AHALGH..KY..H HBDK IIVLAA.H..FTKDFTPE..CQAAWQKLVRVV.AHALAR..KY..H MYHU QV.LQSKHPgDFGADAQGA.MNKALELFRKDM.ASNYKELGFQ..G MYOR HV.LQSKHSaDFGADAQAA.MGKALELFRNDM.AAKYKEFGFQ..G IGLOB TA.LVA.Y..LQANVSWGDnVAAAWNKA.LDN.TFAIVV..PR..L GPUGNI ALLGTIKEA.IKENWSDE..MGQAWTEAYNQLVATIKAE..MK..E GPYL AILKTIKEV.VGDKWSEE..LNTAWTIAYDELAIIIKKE..MKdaA GGZLB EVLGDAAT..DDILDAWGK.AYGVIADVFIQVEADLYAQ..AV..E Globin motifs

12 Splice site motif in logo format
weblogo.berkeley.edu

13

14

15

16 Position-specific scoring matrix
This PSSM assigns the sequence NMFWAFGH a score of = 12. A -1 -2 R 5 1 -3 N 6 D C Q E 2 G H 8 I -4 L K M F P S T W Y 3 V

17 Representing a motif as a PSSM
Convert these 9 6-letter sequences into a PSSM. In practice, we add a small pseudocount, convert to frequencies, and take the log. AAGTGT TAATGT AATTGT AATTGA ATCTGT TGTTGT AAATGA TTTTGT A C G T

18 Representing motifs as PSSMs
Add a pseudocount. Convert to frequencies. Take the logarithm. AAGTGT TAATGT AATTGT AATTGA ATCTGT TGTTGT AAATGA TTTTGT A C G T A C G T A C G T A C G T

19 Scanning for motif occurrences
Given: a long DNA sequence, and TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG a DNA motif represented as a PSSM Find: occurrences of the motif in the sequence A C G T

20 Scanning for motif occurrences
– = 6.87 TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG

21 Scanning for motif occurrences
– 3.32 – = -1.39 TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG

22 Summary DNA and protein motifs represent functionally or structurally important sequence elements. Motifs are typically represented using position-specific scoring matrices. A PSSM can be used to scan a given DNA or protein sequence to search for occurrences of the motif.


Download ppt "Motif search GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble."

Similar presentations


Ads by Google