Presentation is loading. Please wait.

Presentation is loading. Please wait.

Welcome to Advanced Molecular Genetics, Bioinformatics, and Computational Genomics Pattern Recognition and Gene Finding Today is the last class. Would.

Similar presentations


Presentation on theme: "Welcome to Advanced Molecular Genetics, Bioinformatics, and Computational Genomics Pattern Recognition and Gene Finding Today is the last class. Would."— Presentation transcript:

1

2 Welcome to Advanced Molecular Genetics, Bioinformatics, and Computational Genomics Pattern Recognition and Gene Finding Today is the last class. Would please you tell students: 1.Please submit all assignments. Last assignment due today because your class has no assignment. 2. Please finish the course evaluation.

3 Apr 4 May 2

4

5

6

7

8

9

10 Welcome to Advanced Molecular Genetics, Bioinformatics, and Computational Genomics Pattern Recognition and Gene Finding An alternative (Through software tools)

11 Lives of the Scientist

12 World’s Greatest Explorer

13

14

15 Expect = 4e-98 World’s Greatest Musicologist

16 World’s Greatest Microbiologist

17 3337901 TACACCAGAT ATTGATGTCG TTTTGATGGA TGTAATGATG CCAGAAATGG 3337951 ACGGTTACGA AACAACAAGC TTAATCCGCC AAAACGAGCA ATTTAAATCT 3338001 TTGCCGATTA TTGCACTGAC AGCTAAAGCC ATGCAAGGCG ATCGCGAGAA 3338051 GTGTATTGAA GCGGGTGCAT CAGACTACAT CACCAAACCC GTAGATACTG 3338101 AACAACTGCT TTCACTCTTG CGTGTTTGGC TATACCGTTA ATTGGGGCAG 3338151 GGGGCAGGGA GCCGTTGCAA CTATTTCAAC CCTAATAGGG ATTTTGATGA 3338201 ATTGCAATTC CTCCTTCCTC TGGCTCTGCC ACCGTTCAGC AACTTGGTTT 3338251 CAATCCCTGA TAGGGATTTT GATGAATTGC AATATATTAT TTCACAACTG 3338301 GTAAAAACGC TAAAGGTTTA GTTTCAATCC CTGATAGGGA TTTTGATGAA 3338351 TTGCAATGTT AAACTGGTCT GCTTTGCCGA TACCCAAATA TTGCTAGGTT 3338401 TCAATCCCTG ATAGGGATTT TGATGAATTG CAATGAAATC AGAAACATCT 3338451 TTGATTTTTT TGACCATGTT TCAATCCCTG ATAGGGATTT TGATGAATTG 3338501 CAATTTTTTG GGGAAGAGGT AATCTGAAAC AGAATTTAGT ATTTGTTTCA 3338551 ATCCCTGATA GGGATTTTGA TGAATTGCAA TGTTGTTACT TAATCCGTCA 3338601 AATAGTCCCA TTAGATGTTT CAATCCCTGA TAGGGATTTT GATGAATTGC 3338651 AATTTTGTGT TACTTGAATT ACTTTGTTGT AATATGCTGG TTTCAATCCC 3338701 TGATAGGGAT TTTGATGAAT TGCAATCAGC AACGTATGCT GTGGGATGCT 3338751 GGATATGCAC GTTTCAATCC CTGATAGGGA TTTTGATGAA TTGCAATTTG 3338801 CATATCTCCA TCCAACTGTA TTCAGCTGAA AAGTTTCAAT CCCTGATAGG 3338851 GATTTTGATG AATTGCAATC TTCGGCATAA CCATTCTTCC ACCTCCAGTA

18 AATAAAGCTTTACAAA CCAAACTCTGGCTTCA ATTGTGTAACCCAAGC TTTGATTCTTTCCTCTG TTAAATCGGATTGATT ATCTTCATCAAGGGCA AGACCTACAAATTTAC CATCACGAACAGCTTT AGACTCACTGAATTCA TAACCTTCTGTAGGCC AATAGCCAACTGTTTC ACCACCATTTTCTGAA ATTTTTTCCTCTAGAAT ACCGAGGGCATCTTGA AATGTATCAGGATAAC CAACCTGGTCTCCAGG AGCAAAATAAGCAAC TTTTTTGCCGATGAAGT CAATGTTATCTAACTC ATCATAAAAATTTTCC CAATCACTTTGCAATT CTCCAACATTCCAGGT AGGACAACCAACAAC GATATAATCGTAGTTA TTGAAATCACTTGGTT CAGCTTGTGAAATATC ATATAAAGTTACAACA CTATCACCACCAAACT CCTTCTGAATTATTTCT GATTCAGTTTGGGTATT GCCTGTTTGAGTACCA AAAAATAAACCAATA TTAGACATTTTTACTCC TTTTATGTATTTGCAAA ATTATTTCAATTAAAA TATTTAGTAATAATTA ATTGTTAGCTAGCTAA TAATTAAATTTTTATTA CAATCATTGTAAAAGG CATTGAAAAAGTAAAT AAAAATTTTTATTCTAC GTTATTTCAAAAATAT TTACTTACATATACTTA ACCTTTATAGTGATGT AATATACTCTAATTCC TATTTTACTTATAAATA CCATCTCAGCTTAATG TAACGAATTTTTCTGTT TATCTTTAAATACAAA AAATTCAACAAAACTA CAGAAAATTAATCTTA ATAACACAAAACAAG TATCAATCTGTAATAC AACTAAGCTTAAATAA ATTAATAGAAAGCTTC ATCTATCTAATAGGTT GAGAATAGTTTATGTC TAATGACATAAATTCA TTCGTGTTGATTTCATT TGGGTATATTCATCTG ATTTAGGATTTACTCC ATTAAGTTTGTACTCAT CAATGCCCGCCTGTTG GTATCCACAATTCTCA TACAGTGCGCGAGCAA AGTAATCAATCGTTCG TCGCCATATCTAACTTT GAGTCAAACAAACCA GTTGGATTACCAACCC TCAACTAATCGCTTCTT TAAGGCGAGCGATCGC ACATTTAACTGTTGGTT GTCACAAGAGAACTA ATACTACAGCAGTATA TTTAACAACTAAGGGT GGTTCAACTTTCGCTG CGACTCCTCCAACGCG CTGAAATACACAGGA CTGATGCGATCGCAAA CTCTTTGACTAAATTCC ATACATTATCATGACC ATCTCCCAAACAAACA AGTGGGTTAACCAGAT GCTGACTATTAACATC CCCTGAGTTCGGAGTT GTAGGTCTATTTGACT GGTTCAAAGCGATGAT GGAACGGCTTTGTTGC ATGAATTAAAAAAAG ACACACCATCACCTAC TTCTAGGATAGACACA TCAAACGTCCCACCGC CTAAGTCAAATACCAA GATAATTTCGTTAGTTT TCTTGTCAAGTCCGTA AGCGAGGGCCGCCGC CGTGGGCTAGTTGATA ATTCGCAGAACTTTAA TCCCGGCAATTCTACT GGCATCTTTGGTAGCC TGCCGTTGAGAGTCAT TGAAATAGGCAGGGG TGGTAATTACCGCTTG CCTCACTGGTTCCCCC AGATATGTGCTGGCAT CATCTATCAGCTTGCG GACTACCTCATACCAT TTCACGAAAAACCTGA TACACATGTAAACTCT GAAACCCTTGCTGTAT CAAAGTTTTGTAATTA CGAATTACGAATTACG AATTGATATCAGCCGA GATTTCTTCGGGTGAA AATTCCTTGTTCAGAG CGGGACAGTGTAGCTT GACATTGCCATTACTG TCACGTACCACTTTGT AAGTAACTTGTTTTGC CTCTTGCGTAACTTCAT CATACCTGCGCCCGAT GAACCGCTTCACAGAA TAAAAAGTGTTTTCTG GGTTCATTACACCCTG GCGCTT

19 Expect = 4e-98 TCTACTTATA TTCAATCCAC AGGGCTACAC AAGAGTCTGT TGAATGAACA CATACATGGT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCGTAAAC CTCTAACATG ATGTCAGCAA TGAATAAACT TTGTTAAAGG TACAAATGAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT AAACCTGTAT GGTTACATGA ACTGCCTAAA TTATATATTT TAAGAAATTA ATTGCAATTA CCCCAGCTGT CATTAAAAAG AGGCAAATAC GACAGCACTG ACCCTCAAGA AGGCACCGGC GCTGAAATTC CGCTGAGAGC AGAGTGGTAC CCCTGCACCA GGTCTTTCCT GTGGGCACTG ATGAATGACT GAACGAACGA TTGAATGAAA Globin Blast

20

21

22

23

24

25

26 AATAAAGCTTTACAAA CCAAACTCTGGCTTCA ATTGTGTAACCCAAGC TTTGATTCTTTCCTCTG TTAAATCGGATTGATT ATCTTCATCAAGGGCA AGACCTACAAATTTAC CATCACGAACAGCTTT AGACTCACTGAATTCA TAACCTTCTGTAGGCC AATAGCCAACTGTTTC ACCACCATTTTCTGAA ATTTTTTCCTCTAGAAT ACCGAGGGCATCTTGA AATGTATCAGGATAAC CAACCTGGTCTCCAGG AGCAAAATAAGCAAC TTTTTTGCCGATGAAGT CAATGTTATCTAACTC ATCATAAAAATTTTCC CAATCACTTTGCAATT CTCCAACATTCCAGGT AGGACAACCAACAAC GATATAATCGTAGTTA TTGAAATCACTTGGTT CAGCTTGTGAAATATC ATATAAAGTTACAACA CTATCACCACCAAACT CCTTCTGAATTATTTCT GATTCAGTTTGGGTATT GCCTGTTTGAGTACCA AAAAATAAACCAATA TTAGACATTTTTACTCC TTTTATGTATTTGCAAA ATTATTTCAATTAAAA TATTTAGTAATAATTA ATTGTTAGCTAGCTAA TAATTAAATTTTTATTA CAATCATTGTAAAAGG CATTGAAAAAGTAAAT AAAAATTTTTATTCTAC GTTATTTCAAAAATAT TTACTTACATATACTTA ACCTTTATAGTGATGT AATATACTCTAATTCC TATTTTACTTATAAATA CCATCTCAGCTTAATG TAACGAATTTTTCTGTT TATCTTTAAATACAAA AAATTCAACAAAACTA CAGAAAATTAATCTTA ATAACACAAAACAAG TATCAATCTGTAATAC AACTAAGCTTAAATAA ATTAATAGAAAGCTTC ATCTATCTAATAGGTT GAGAATAGTTTATGTC TAATGACATAAATTCA TTCGTGTTGATTTCATT TGGGTATATTCATCTG ATTTAGGATTTACTCC ATTAAGTTTGTACTCAT CAATGCCCGCCTGTTG GTATCCACAATTCTCA TACAGTGCGCGAGCAA AGTAATCAATCGTTCG TCGCCATATCTAACTTT GAGTCAAACAAACCA GTTGGATTACCAACCC TCAACTAATCGCTTCTT TAAGGCGAGCGATCGC ACATTTAACTGTTGGTT GTCACAAGAGAACTA ATACTACAGCAGTATA TTTAACAACTAAGGGT GGTTCAACTTTCGCTG CGACTCCTCCAACGCG CTGAAATACACAGGA CTGATGCGATCGCAAA CTCTTTGACTAAATTCC ATACATTATCATGACC ATCTCCCAAACAAACA AGTGGGTTAACCAGAT GCTGACTATTAACATC CCCTGAGTTCGGAGTT GTAGGTCTATTTGACT GGTTCAAAGCGATGAT GGAACGGCTTTGTTGC ATGAATTAAAAAAAG ACACACCATCACCTAC TTCTAGGATAGACACA TCAAACGTCCCACCGC CTAAGTCAAATACCAA GATAATTTCGTTAGTTT TCTTGTCAAGTCCGTA AGCGAGGGCCGCCGC CGTGGGCTAGTTGATA ATTCGCAGAACTTTAA TCCCGGCAATTCTACT GGCATCTTTGGTAGCC TGCCGTTGAGAGTCAT TGAAATAGGCAGGGG TGGTAATTACCGCTTG CCTCACTGGTTCCCCC AGATATGTGCTGGCAT CATCTATCAGCTTGCG GACTACCTCATACCAT TTCACGAAAAACCTGA TACACATGTAAACTCT GAAACCCTTGCTGTAT CAAAGTTTTGTAATTA CGAATTACGAATTACG AATTGATATCAGCCGA GATTTCTTCGGGTGAA AATTCCTTGTTCAGAG CGGGACAGTGTAGCTT GACATTGCCATTACTG TCACGTACCACTTTGT AAGTAACTTGTTTTGC CTCTTGCGTAACTTCAT CATACCTGCGCCCGAT GAACCGCTTCACAGAA TAAAAAGTGTTTTCTG GGTTCATTACACCCTG GCGCTT Program the computer

27 Biology researchers do not program Program the computer 10 Biology and Microbiology Depts at major universities

28 Why hasn't it happened? Programming languages An alternative

29 Lives of the Scientist (Part II)

30 Repeated sequences bacterial genomes REP sequences genes Genome of E. coli K12 str MG1655

31

32

33

34 Algorithm to extract REP sequences Pattern

35 Algorithm to extract REP sequences Pattern "

36 Algorithm to extract REP sequences Pattern "repeat_region "

37 Algorithm to extract REP sequences Pattern "repeat_region "

38 Algorithm to extract REP sequences Pattern "repeat_region " Special symbols... As many of previous character as possible

39 Algorithm to extract REP sequences Pattern "repeat_region... " Special symbols... As many of previous character as possible

40 Algorithm to extract REP sequences Pattern "repeat_region... " Special symbols... As many of previous character as possible # A single digit

41 Algorithm to extract REP sequences Pattern "repeat_region...# " Special symbols... As many of previous character as possible # A single digit

42 Algorithm to extract REP sequences Pattern "repeat_region...#... " Special symbols... As many of previous character as possible # A single digit

43 Algorithm to extract REP sequences Pattern "repeat_region...#... " Special symbols... As many of previous character as possible # A single digit () Capture what's inside

44 Algorithm to extract REP sequences Pattern "repeat_region...(#...) " Special symbols... As many of previous character as possible # A single digit () Capture what's inside

45 Algorithm to extract REP sequences Pattern "repeat_region...(#...) " Special symbols... As many of previous character as possible # A single digit () Capture what's inside

46 Algorithm to extract REP sequences Pattern "repeat_region...(#...) " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character

47 Algorithm to extract REP sequences Pattern "repeat_region...(#...)** " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character

48 Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...) " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character

49 Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)* " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character

50 Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)* " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary

51 Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*.. " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary

52 Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*.. " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or ''

53 Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*..' '" Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or ''

54 Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*..'( )'" Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or ''

55 Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*..'(*..)'" Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or ''

56 We start Go to: www.people.vcu.edu/~elhaij Click: MICR 653

57 www.people.vcu.edu/~elhaij Click MICR 653 Using Firefox

58

59 biobike.csbc.vcu.edu

60

61

62 Function palette Workspace Results window

63

64 General Syntax of BioBIKE Function-name Argument (object) Keyword object Flag The basic unit of BioBIKE is the function box. It consists of the name of a function, perhaps one or more required arguments, and optional keywords and flags. A function may be thought of as a black box: you feed it information, it produces a product.

65 Function-name (e.g. SEQUENCE-OF or LENGTH-OF ) Argument: Required, acted on by function Keyword clause: Optional, more information General Syntax of BioBIKE Flag: Optional, more (yes/no) information Function-name Argument (object) Keyword object Flag Function boxes contain the following elements:

66 General Syntax of BioBIKE Function-name Argument (object) Keyword object Flag … and icons to help you work with functions: Option icon: Brings up a menu of keywords and flags Clear/Delete icon: Removes information you entered or removes box entirely Action icon: Brings up a menu enabling you to execute a function, copy and paste, information, get help, etc

67 Functions Sin Angle Sin (angle)

68 Functions Length Entity

69 Functions Length Entity "icahLnlna bormA" 14 Abraham Lincoln "Abraham Lincoln" 192 14 variable vs literal

70 Functions Length Entity "icahLnlna bormA" 14 Abraham Lincoln "Abraham Lincoln" 192 14 US-presidents 44 list vs single value

71 Functions Length Entity "icahLnlna bormA" 14 Abraham Lincoln "Abraham Lincoln" 192 14 US-presidents 44 (188 170 189 163 …) single application of a function vs iteration of a function

72 Arcsin Functions Sin Angle

73 Arcsin Functions Angle Sin (angle) Nested functions Evaluated from the inside out A box is replaced by its value

74 Gene (npf0076) Functions "transposase"

75 Gene (npf0076) Functions Nested functions Evaluated from the inside out A box is replaced by its value

76 Gene (npf0076) Pitfalls (the most common error in the language) CLOSE BOXES BEFORE EXECUTING White is incompatible with execution

77 Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*..'(*..)'" Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or ''

78

79 Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*..'(*..)'" Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or '' s

80

81 Mining files for data Pattern matching Works great Highly flexible Quick and easy BUT... Unforgiving (1 mismatch  death)

82 Conserved motifs of methyltransferases Pattern "[DS]PP[YF]" Special symbols [ ] Character set

83 Searching for conserved motifs Pattern matching Ignores lots of information Unforgiving (1 mismatch  death) Quick and easy Position-specific scoring matrices (PSSMs)

84 Searching for conserved motifs Pattern matching Ignores lots of information Unforgiving (1 mismatch  death) Quick and easy Position-specific scoring matrices (PSSMs) Needs training set What if you don’t have one?

85 Lives of the Scientist (Part III)

86

87 New pattern discovery (Meme, Gibbs sampler, BioProspector) snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT nucleolinGCAGGCTCAGTCTTTCGCCTCAGTCTCGAGCTCTCGCTGG snRNP ETGCCGCCGCGTGACCTTCACACTTCCGCTTCCGGTTCTTT rp S14GACACGGAAGTGACCCCCGTCGCTCCGCCCTCTCCCACTC rp S17TGGCCTAAGCTTTAACAGGCTTCGCCTGTGCTTCCTGTTT ribosomal p. S19ACCCTACGCCCGACTTGTGCGCCCGGGAAACCCCGTCGTT a'-tubulin ba'1GGTCTGGGCGTCCCGGCTGGGCCCCGTGTCTGTGCGCACG b'-tubulin b'2GGGAGGGTATATAAGCGTTGGCGGACGGTCGGTTGTAGCA a'-actin skel-m.CCGCGGGCTATATAAAACCTGAGCAGAGGGACAAGCGGCC a'-cardiac actinTCAGCGTTCTATAAAGCGGCCCTCCTGGAGCCAGCCACCC b'-actinCGCGGCGGCGCCCTATAAAACCCAGCGGCGCGACGCGCCA Human sequences 5’ to transcriptional start What to do with no training set? “TATA box”?

88 snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence How does Meme work?

89 snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences How does Meme work?

90 snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches GACAGGGCAGAA GCCCGGGTGTTT GCCGGGGACGCG GCCCCCGGGCCT GCCGCAGAGCTG How does Meme work? 1 2 3 4 5 6 7 8 9 10 11 12 A 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.4 0.2 0.0 0.2 0.2 C 0.0 0.8 1.0 0.4 0.4 0.2 0.0 0.2 0.2 0.4 0.4 0.0 G 1.0 0.0 0.0 0.4 0.6 0.6 1.0 0.2 0.6 0.4 0.0 0.4 T 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.2 0.4 0.4

91 snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table GACAGGGCAGAA GCCCGGGTGTTT GCCGGGGACGCG GCCCCCGGGCCT GCCGCAGAGCTG How does Meme work? 1 2 3 4 5 6 7 8 9 10 11 12 A 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.4 0.2 0.0 0.2 0.2 C 0.0 0.8 1.0 0.4 0.4 0.2 0.0 0.2 0.2 0.4 0.4 0.0 G 1.0 0.0 0.0 0.4 0.6 0.6 1.0 0.2 0.6 0.4 0.0 0.4 T 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.2 0.4 0.4

92 snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table Step 5. If probability score high, remember pattern and score How does Meme work?

93 snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table Step 5. If probability score high, remember pattern and score Step 6. Repeat Steps 1 - 5 How does Meme work?

94 New pattern discovery (Meme, Gibbs sampler, BioProspector) snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT nucleolinGCAGGCTCAGTCTTTCGCCTCAGTCTCGAGCTCTCGCTGG snRNP ETGCCGCCGCGTGACCTTCACACTTCCGCTTCCGGTTCTTT rp S14GACACGGAAGTGACCCCCGTCGCTCCGCCCTCTCCCACTC rp S17TGGCCTAAGCTTTAACAGGCTTCGCCTGTGCTTCCTGTTT ribosomal p. S19ACCCTACGCCCGACTTGTGCGCCCGGGAAACCCCGTCGTT a'-tubulin ba'1GGTCTGGGCGTCCCGGCTGGGCCCCGTGTCTGTGCGCACG b'-tubulin b'2GGGAGGGTATATAAGCGTTGGCGGACGGTCGGTTGTAGCA a'-actin skel-m.CCGCGGGCTATATAAAACCTGAGCAGAGGGACAAGCGGCC a'-cardiac actinTCAGCGTTCTATAAAGCGGCCCTCCTGGAGCCAGCCACCC b'-actinCGCGGCGGCGCCCTATAAAACCCAGCGGCGCGACGCGCCA Human sequences 5’ to transcriptional start What to do with no training set?

95 Searching for conserved motifs Pattern matching Ignores lots of information Unforgiving (1 mismatch  death) Quick and easy Position-specific scoring matrices (PSSMs) Needs training set Meme, Gibbs sampler, et al (PSSM in reverse) Relatively unbiased Can't easily handle variable-length gaps DETAILS

96 Moral of the Stories

97

98

99 Are you comfortable using programming in the service of your research? None…This is beyond my responsibilities in the lab. I have zero experience in computer programming before this class I am about 60% confidant in using python I have experience using Python, Java, Unix & DOS environments, R, mySQL/SQL, and SAS I have no experience in programming I have had some R experience… However, I am still a Novice

100 www.people.vcu.edu/~elhaij Click MICR 653 Using Firefox

101

102 Scientific Questions I. What determines the beginning of a gene?

103 Scientific Questions I. What determines the beginning of a gene?

104 Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? HIV

105 Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated?

106 Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated?

107 Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs)

108 Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs)

109 Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data

110 Metabolic correlates to N-deprivation What enzymes of carbon metabolism are affected by N-starvation? Cyanobacteria use primarily the reactions of the Pentose Phosphate Pathway to break down glucose derivatives. They use carbon fixation reactions to build glucose. These sets overlap a great deal. Carbon fixation Pentose Phosphate Pathway Glycogen metabolism

111 Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data RNAseq

112 Measuring RNA through Microarrays Spot Courtesy of Inst. für Hormon-und Fortpflanzungsforschung, Universität Hamburg RNA from cell type #1 + RNA from cell type #2 Scan for red fluorescence Scan for green fluorescence Combine images Type #1 RNA > Type #2 RNA Type #2 RNA > Type #1 RNA Type #1 RNA  Type #2 RNA

113 Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data Difference in intensity chip to chip different conditions or different replicates

114 Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data Difference in intensity chip to chip different conditions or different replicates

115 Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data V. CRISPRs in enteric bacteria GTTTCAATCCCTGATAGGGATTTTAGAGGGTTTTAACAATAACTGGATAGCACTAGCAGAAGGGCTAGAAGGTTTCAATCCCTGATAGGGATTTTAGAGGGTTTTAACGTAT

116 Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data V. CRISPRs in enteric bacteria GTTTCAATCCCTGATAGGGATTTTAGAGGGTTTTAACAATAACTGGATAGCACTAGCAGAAGGGCTAGAAGGTTTCAATCCCTGATAGGGATTTTAGAGGGTTTTAACGTAT

117 Scientific Questions VI. Finding targets for DNA-binding proteins

118 Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data V. CRISPRs in enteric bacteria VI. Finding targets for DNA-binding proteins (targets known) VII. Finding targets for DNA-binding proteins (genes known)

119


Download ppt "Welcome to Advanced Molecular Genetics, Bioinformatics, and Computational Genomics Pattern Recognition and Gene Finding Today is the last class. Would."

Similar presentations


Ads by Google