Welcome to Advanced Molecular Genetics, Bioinformatics, and Computational Genomics Pattern Recognition and Gene Finding
15. Apr 27 20
Welcome to Advanced Molecular Genetics, Bioinformatics, and Computational Genomics Pattern Recognition and Gene Finding An alternative
Lives of the Scientist
Expect = 4e-98
TACACCAGAT ATTGATGTCG TTTTGATGGA TGTAATGATG CCAGAAATGG ACGGTTACGA AACAACAAGC TTAATCCGCC AAAACGAGCA ATTTAAATCT TTGCCGATTA TTGCACTGAC AGCTAAAGCC ATGCAAGGCG ATCGCGAGAA GTGTATTGAA GCGGGTGCAT CAGACTACAT CACCAAACCC GTAGATACTG AACAACTGCT TTCACTCTTG CGTGTTTGGC TATACCGTTA ATTGGGGCAG GGGGCAGGGA GCCGTTGCAA CTATTTCAAC CCTAATAGGG ATTTTGATGA ATTGCAATTC CTCCTTCCTC TGGCTCTGCC ACCGTTCAGC AACTTGGTTT CAATCCCTGA TAGGGATTTT GATGAATTGC AATATATTAT TTCACAACTG GTAAAAACGC TAAAGGTTTA GTTTCAATCC CTGATAGGGA TTTTGATGAA TTGCAATGTT AAACTGGTCT GCTTTGCCGA TACCCAAATA TTGCTAGGTT TCAATCCCTG ATAGGGATTT TGATGAATTG CAATGAAATC AGAAACATCT TTGATTTTTT TGACCATGTT TCAATCCCTG ATAGGGATTT TGATGAATTG CAATTTTTTG GGGAAGAGGT AATCTGAAAC AGAATTTAGT ATTTGTTTCA ATCCCTGATA GGGATTTTGA TGAATTGCAA TGTTGTTACT TAATCCGTCA AATAGTCCCA TTAGATGTTT CAATCCCTGA TAGGGATTTT GATGAATTGC AATTTTGTGT TACTTGAATT ACTTTGTTGT AATATGCTGG TTTCAATCCC TGATAGGGAT TTTGATGAAT TGCAATCAGC AACGTATGCT GTGGGATGCT GGATATGCAC GTTTCAATCC CTGATAGGGA TTTTGATGAA TTGCAATTTG CATATCTCCA TCCAACTGTA TTCAGCTGAA AAGTTTCAAT CCCTGATAGG GATTTTGATG AATTGCAATC TTCGGCATAA CCATTCTTCC ACCTCCAGTA
AATAAAGCTTTACAAA CCAAACTCTGGCTTCA ATTGTGTAACCCAAGC TTTGATTCTTTCCTCTG TTAAATCGGATTGATT ATCTTCATCAAGGGCA AGACCTACAAATTTAC CATCACGAACAGCTTT AGACTCACTGAATTCA TAACCTTCTGTAGGCC AATAGCCAACTGTTTC ACCACCATTTTCTGAA ATTTTTTCCTCTAGAAT ACCGAGGGCATCTTGA AATGTATCAGGATAAC CAACCTGGTCTCCAGG AGCAAAATAAGCAAC TTTTTTGCCGATGAAGT CAATGTTATCTAACTC ATCATAAAAATTTTCC CAATCACTTTGCAATT CTCCAACATTCCAGGT AGGACAACCAACAAC GATATAATCGTAGTTA TTGAAATCACTTGGTT CAGCTTGTGAAATATC ATATAAAGTTACAACA CTATCACCACCAAACT CCTTCTGAATTATTTCT GATTCAGTTTGGGTATT GCCTGTTTGAGTACCA AAAAATAAACCAATA TTAGACATTTTTACTCC TTTTATGTATTTGCAAA ATTATTTCAATTAAAA TATTTAGTAATAATTA ATTGTTAGCTAGCTAA TAATTAAATTTTTATTA CAATCATTGTAAAAGG CATTGAAAAAGTAAAT AAAAATTTTTATTCTAC GTTATTTCAAAAATAT TTACTTACATATACTTA ACCTTTATAGTGATGT AATATACTCTAATTCC TATTTTACTTATAAATA CCATCTCAGCTTAATG TAACGAATTTTTCTGTT TATCTTTAAATACAAA AAATTCAACAAAACTA CAGAAAATTAATCTTA ATAACACAAAACAAG TATCAATCTGTAATAC AACTAAGCTTAAATAA ATTAATAGAAAGCTTC ATCTATCTAATAGGTT GAGAATAGTTTATGTC TAATGACATAAATTCA TTCGTGTTGATTTCATT TGGGTATATTCATCTG ATTTAGGATTTACTCC ATTAAGTTTGTACTCAT CAATGCCCGCCTGTTG GTATCCACAATTCTCA TACAGTGCGCGAGCAA AGTAATCAATCGTTCG TCGCCATATCTAACTTT GAGTCAAACAAACCA GTTGGATTACCAACCC TCAACTAATCGCTTCTT TAAGGCGAGCGATCGC ACATTTAACTGTTGGTT GTCACAAGAGAACTA ATACTACAGCAGTATA TTTAACAACTAAGGGT GGTTCAACTTTCGCTG CGACTCCTCCAACGCG CTGAAATACACAGGA CTGATGCGATCGCAAA CTCTTTGACTAAATTCC ATACATTATCATGACC ATCTCCCAAACAAACA AGTGGGTTAACCAGAT GCTGACTATTAACATC CCCTGAGTTCGGAGTT GTAGGTCTATTTGACT GGTTCAAAGCGATGAT GGAACGGCTTTGTTGC ATGAATTAAAAAAAG ACACACCATCACCTAC TTCTAGGATAGACACA TCAAACGTCCCACCGC CTAAGTCAAATACCAA GATAATTTCGTTAGTTT TCTTGTCAAGTCCGTA AGCGAGGGCCGCCGC CGTGGGCTAGTTGATA ATTCGCAGAACTTTAA TCCCGGCAATTCTACT GGCATCTTTGGTAGCC TGCCGTTGAGAGTCAT TGAAATAGGCAGGGG TGGTAATTACCGCTTG CCTCACTGGTTCCCCC AGATATGTGCTGGCAT CATCTATCAGCTTGCG GACTACCTCATACCAT TTCACGAAAAACCTGA TACACATGTAAACTCT GAAACCCTTGCTGTAT CAAAGTTTTGTAATTA CGAATTACGAATTACG AATTGATATCAGCCGA GATTTCTTCGGGTGAA AATTCCTTGTTCAGAG CGGGACAGTGTAGCTT GACATTGCCATTACTG TCACGTACCACTTTGT AAGTAACTTGTTTTGC CTCTTGCGTAACTTCAT CATACCTGCGCCCGAT GAACCGCTTCACAGAA TAAAAAGTGTTTTCTG GGTTCATTACACCCTG GCGCTT
Expect = 4e-98 TCTACTTATA TTCAATCCAC AGGGCTACAC AAGAGTCTGT TGAATGAACA CATACATGGT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCGTAAAC CTCTAACATG ATGTCAGCAA TGAATAAACT TTGTTAAAGG TACAAATGAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT AAACCTGTAT GGTTACATGA ACTGCCTAAA TTATATATTT TAAGAAATTA ATTGCAATTA CCCCAGCTGT CATTAAAAAG AGGCAAATAC GACAGCACTG ACCCTCAAGA AGGCACCGGC GCTGAAATTC CGCTGAGAGC AGAGTGGTAC CCCTGCACCA GGTCTTTCCT GTGGGCACTG ATGAATGACT GAACGAACGA TTGAATGAAA Globin Blast
AATAAAGCTTTACAAA CCAAACTCTGGCTTCA ATTGTGTAACCCAAGC TTTGATTCTTTCCTCTG TTAAATCGGATTGATT ATCTTCATCAAGGGCA AGACCTACAAATTTAC CATCACGAACAGCTTT AGACTCACTGAATTCA TAACCTTCTGTAGGCC AATAGCCAACTGTTTC ACCACCATTTTCTGAA ATTTTTTCCTCTAGAAT ACCGAGGGCATCTTGA AATGTATCAGGATAAC CAACCTGGTCTCCAGG AGCAAAATAAGCAAC TTTTTTGCCGATGAAGT CAATGTTATCTAACTC ATCATAAAAATTTTCC CAATCACTTTGCAATT CTCCAACATTCCAGGT AGGACAACCAACAAC GATATAATCGTAGTTA TTGAAATCACTTGGTT CAGCTTGTGAAATATC ATATAAAGTTACAACA CTATCACCACCAAACT CCTTCTGAATTATTTCT GATTCAGTTTGGGTATT GCCTGTTTGAGTACCA AAAAATAAACCAATA TTAGACATTTTTACTCC TTTTATGTATTTGCAAA ATTATTTCAATTAAAA TATTTAGTAATAATTA ATTGTTAGCTAGCTAA TAATTAAATTTTTATTA CAATCATTGTAAAAGG CATTGAAAAAGTAAAT AAAAATTTTTATTCTAC GTTATTTCAAAAATAT TTACTTACATATACTTA ACCTTTATAGTGATGT AATATACTCTAATTCC TATTTTACTTATAAATA CCATCTCAGCTTAATG TAACGAATTTTTCTGTT TATCTTTAAATACAAA AAATTCAACAAAACTA CAGAAAATTAATCTTA ATAACACAAAACAAG TATCAATCTGTAATAC AACTAAGCTTAAATAA ATTAATAGAAAGCTTC ATCTATCTAATAGGTT GAGAATAGTTTATGTC TAATGACATAAATTCA TTCGTGTTGATTTCATT TGGGTATATTCATCTG ATTTAGGATTTACTCC ATTAAGTTTGTACTCAT CAATGCCCGCCTGTTG GTATCCACAATTCTCA TACAGTGCGCGAGCAA AGTAATCAATCGTTCG TCGCCATATCTAACTTT GAGTCAAACAAACCA GTTGGATTACCAACCC TCAACTAATCGCTTCTT TAAGGCGAGCGATCGC ACATTTAACTGTTGGTT GTCACAAGAGAACTA ATACTACAGCAGTATA TTTAACAACTAAGGGT GGTTCAACTTTCGCTG CGACTCCTCCAACGCG CTGAAATACACAGGA CTGATGCGATCGCAAA CTCTTTGACTAAATTCC ATACATTATCATGACC ATCTCCCAAACAAACA AGTGGGTTAACCAGAT GCTGACTATTAACATC CCCTGAGTTCGGAGTT GTAGGTCTATTTGACT GGTTCAAAGCGATGAT GGAACGGCTTTGTTGC ATGAATTAAAAAAAG ACACACCATCACCTAC TTCTAGGATAGACACA TCAAACGTCCCACCGC CTAAGTCAAATACCAA GATAATTTCGTTAGTTT TCTTGTCAAGTCCGTA AGCGAGGGCCGCCGC CGTGGGCTAGTTGATA ATTCGCAGAACTTTAA TCCCGGCAATTCTACT GGCATCTTTGGTAGCC TGCCGTTGAGAGTCAT TGAAATAGGCAGGGG TGGTAATTACCGCTTG CCTCACTGGTTCCCCC AGATATGTGCTGGCAT CATCTATCAGCTTGCG GACTACCTCATACCAT TTCACGAAAAACCTGA TACACATGTAAACTCT GAAACCCTTGCTGTAT CAAAGTTTTGTAATTA CGAATTACGAATTACG AATTGATATCAGCCGA GATTTCTTCGGGTGAA AATTCCTTGTTCAGAG CGGGACAGTGTAGCTT GACATTGCCATTACTG TCACGTACCACTTTGT AAGTAACTTGTTTTGC CTCTTGCGTAACTTCAT CATACCTGCGCCCGAT GAACCGCTTCACAGAA TAAAAAGTGTTTTCTG GGTTCATTACACCCTG GCGCTT Program the computer
Biology researchers do not program Program the computer 10 Biology and Microbiology Depts at major universities
Why hasn't it happened? Programming languages An alternative
Lives of the Scientist (Part II)
Repeated sequences bacterial genomes REP sequences genes Genome of E. coli K12 str MG1655
Algorithm to extract REP sequences Pattern
Algorithm to extract REP sequences Pattern "
Algorithm to extract REP sequences Pattern "repeat_region "
Algorithm to extract REP sequences Pattern "repeat_region "
Algorithm to extract REP sequences Pattern "repeat_region " Special symbols... As many of previous character as possible
Algorithm to extract REP sequences Pattern "repeat_region... " Special symbols... As many of previous character as possible
Algorithm to extract REP sequences Pattern "repeat_region... " Special symbols... As many of previous character as possible # A single digit
Algorithm to extract REP sequences Pattern "repeat_region...# " Special symbols... As many of previous character as possible # A single digit
Algorithm to extract REP sequences Pattern "repeat_region...#... " Special symbols... As many of previous character as possible # A single digit
Algorithm to extract REP sequences Pattern "repeat_region...#... " Special symbols... As many of previous character as possible # A single digit () Capture what's inside
Algorithm to extract REP sequences Pattern "repeat_region...(#...) " Special symbols... As many of previous character as possible # A single digit () Capture what's inside
Algorithm to extract REP sequences Pattern "repeat_region...(#...) " Special symbols... As many of previous character as possible # A single digit () Capture what's inside
Algorithm to extract REP sequences Pattern "repeat_region...(#...) " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character
Algorithm to extract REP sequences Pattern "repeat_region...(#...)** " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character
Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...) " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character
Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)* " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character
Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)* " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary
Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*.. " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary
Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*.. " Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or ''
Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*..' '" Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or ''
Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*..'( )'" Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or ''
Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*..'(*..)'" Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or ''
We start Go to: Click: MICR 653
Click MICR 653 Using Firefox
biobike.csbc.vcu.edu
Function palette Workspace Results window
General Syntax of BioBIKE Function-name Argument (object) Keyword object Flag The basic unit of BioBIKE is the function box. It consists of the name of a function, perhaps one or more required arguments, and optional keywords and flags. A function may be thought of as a black box: you feed it information, it produces a product.
Function-name (e.g. SEQUENCE-OF or LENGTH-OF ) Argument: Required, acted on by function Keyword clause: Optional, more information General Syntax of BioBIKE Flag: Optional, more (yes/no) information Function-name Argument (object) Keyword object Flag Function boxes contain the following elements:
General Syntax of BioBIKE Function-name Argument (object) Keyword object Flag … and icons to help you work with functions: Option icon: Brings up a menu of keywords and flags Clear/Delete icon: Removes information you entered or removes box entirely Action icon: Brings up a menu enabling you to execute a function, copy and paste, information, get help, etc
Functions Sin Angle Sin (angle)
Functions Length Entity
Functions Length Entity "icahLnlna bormA" 14 Abraham Lincoln "Abraham Lincoln" variable vs literal
Functions Length Entity "icahLnlna bormA" 14 Abraham Lincoln "Abraham Lincoln" US-presidents 44 list vs single value
Functions Length Entity "icahLnlna bormA" 14 Abraham Lincoln "Abraham Lincoln" US-presidents 44 ( …) single application of a function vs iteration of a function
Arcsin Functions Sin Angle
Arcsin Functions Angle Sin (angle) Nested functions Evaluated from the inside out A box is replaced by its value
Gene (npf0076) Functions "transposase"
Gene (npf0076) Functions Nested functions Evaluated from the inside out A box is replaced by its value
Gene (npf0076) Pitfalls (the most common error in the language) CLOSE BOXES BEFORE EXECUTING White is incompatible with execution
Algorithm to extract REP sequences Pattern "repeat_region...(#...)**(#...)*..'(*..)'" Special symbols... As many of previous character as possible # A single digit () Capture what's inside * Any character.. As few of previous character as necessary '' or ''
Mining files for data Pattern matching Works great Highly flexible Quick and easy BUT...
Searching for conserved motifs Pattern matching Unforgiving (1 mismatch death) Quick and easy
Conserved motifs of methyltransferases Pattern "[DS]PP[YF]" Special symbols [ ] Character set
Searching for conserved motifs Pattern matching Ignores lots of information Unforgiving (1 mismatch death) Quick and easy Position-specific scoring matrices (PSSMs)
Searching for conserved motifs Pattern matching Ignores lots of information Unforgiving (1 mismatch death) Quick and easy Position-specific scoring matrices (PSSMs) Needs training set What if you don’t have one?
Lives of the Scientist (Part III)
New pattern discovery (Meme, Gibbs sampler, BioProspector) snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT nucleolinGCAGGCTCAGTCTTTCGCCTCAGTCTCGAGCTCTCGCTGG snRNP ETGCCGCCGCGTGACCTTCACACTTCCGCTTCCGGTTCTTT rp S14GACACGGAAGTGACCCCCGTCGCTCCGCCCTCTCCCACTC rp S17TGGCCTAAGCTTTAACAGGCTTCGCCTGTGCTTCCTGTTT ribosomal p. S19ACCCTACGCCCGACTTGTGCGCCCGGGAAACCCCGTCGTT a'-tubulin ba'1GGTCTGGGCGTCCCGGCTGGGCCCCGTGTCTGTGCGCACG b'-tubulin b'2GGGAGGGTATATAAGCGTTGGCGGACGGTCGGTTGTAGCA a'-actin skel-m.CCGCGGGCTATATAAAACCTGAGCAGAGGGACAAGCGGCC a'-cardiac actinTCAGCGTTCTATAAAGCGGCCCTCCTGGAGCCAGCCACCC b'-actinCGCGGCGGCGCCCTATAAAACCCAGCGGCGCGACGCGCCA Human sequences 5’ to transcriptional start What to do with no training set?
snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence How does Meme work?
snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences How does Meme work?
snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches GACAGGGCAGAA GCCCGGGTGTTT GCCGGGGACGCG GCCCCCGGGCCT GCCGCAGAGCTG How does Meme work? A C G T
snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table GACAGGGCAGAA GCCCGGGTGTTT GCCGGGGACGCG GCCCCCGGGCCT GCCGCAGAGCTG How does Meme work? A C G T
snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table Step 5. If probability score high, remember pattern and score How does Meme work?
snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table Step 5. If probability score high, remember pattern and score Step 6. Repeat Steps How does Meme work?
New pattern discovery (Meme, Gibbs sampler, BioProspector) snRNA U1 (pU1-6)AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC histone H1tGCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT HMG-14CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG TP1GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT protamine P1CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT nucleolinGCAGGCTCAGTCTTTCGCCTCAGTCTCGAGCTCTCGCTGG snRNP ETGCCGCCGCGTGACCTTCACACTTCCGCTTCCGGTTCTTT rp S14GACACGGAAGTGACCCCCGTCGCTCCGCCCTCTCCCACTC rp S17TGGCCTAAGCTTTAACAGGCTTCGCCTGTGCTTCCTGTTT ribosomal p. S19ACCCTACGCCCGACTTGTGCGCCCGGGAAACCCCGTCGTT a'-tubulin ba'1GGTCTGGGCGTCCCGGCTGGGCCCCGTGTCTGTGCGCACG b'-tubulin b'2GGGAGGGTATATAAGCGTTGGCGGACGGTCGGTTGTAGCA a'-actin skel-m.CCGCGGGCTATATAAAACCTGAGCAGAGGGACAAGCGGCC a'-cardiac actinTCAGCGTTCTATAAAGCGGCCCTCCTGGAGCCAGCCACCC b'-actinCGCGGCGGCGCCCTATAAAACCCAGCGGCGCGACGCGCCA Human sequences 5’ to transcriptional start What to do with no training set?
Searching for conserved motifs Pattern matching Ignores lots of information Unforgiving (1 mismatch death) Quick and easy Position-specific scoring matrices (PSSMs) Needs training set Meme, Gibbs sampler, et al (PSSM in reverse) Relatively unbiased Can't easily handle variable-length gaps DETAILS
Moral of the Stories
Are you comfortable using programming in the service of your research? I have absolutely no experience in computer programming
Are you comfortable using programming in the service of your research? I have absolutely no experience... I have minimal knowledge... I hope to gain more experience with programs used in bioinformatics. I can program in python I have a well defined background in programming I do not have any previous experience with computer programming Yes
Please briefly describe the nature of your research
What more do you hope to gain before the semester ends? Most classes are a lecture followed by a short instruction on how to do the assignment. This had not provided enough time for me to appreciate the programs being used. I want more hands-on time with the computer.
Click MICR 653 Using Firefox
Scientific Questions
I. What determines the beginning of a gene?
Scientific Questions I. What determines the beginning of a gene?
Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? HIV
Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated?
Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated?
Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs)
Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs)
Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data RNAseq
Measuring RNA through Microarrays Spot Courtesy of Inst. für Hormon-und Fortpflanzungsforschung, Universität Hamburg RNA from cell type #1 + RNA from cell type #2 Scan for red fluorescence Scan for green fluorescence Combine images Type #1 RNA > Type #2 RNA Type #2 RNA > Type #1 RNA Type #1 RNA Type #2 RNA
Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data Difference in intensity chip to chip different conditions or different replicates
Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data Difference in intensity chip to chip different conditions or different replicates
Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data Difference in intensity chip to chip different conditions or different replicates
Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data V. CRISPRs in enteric bacteria GTTTCAATCCCTGATAGGGATTTTAGAGGGTTTTAACAATAACTGGATAGCACTAGCAGAAGGGCTAGAAGGTTTCAATCCCTGATAGGGATTTTAGAGGGTTTTAACGTAT
Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data V. CRISPRs in enteric bacteria GTTTCAATCCCTGATAGGGATTTTAGAGGGTTTTAACAATAACTGGATAGCACTAGCAGAAGGGCTAGAAGGTTTCAATCCCTGATAGGGATTTTAGAGGGTTTTAACGTAT
Scientific Questions VI. Finding targets for DNA-binding proteins
Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data V. CRISPRs in enteric bacteria VI. Finding targets for DNA-binding proteins cI cro OR3OR3OR2OR2 OR1OR1 P RM CTTTTTTGTGCTCATACGTTAAATCTATCACCGCAAGGGATAAATATCTAACACCGTGCGTGTTGACTATTTTACCTCTGGCGGTGATAATGGTTGCATGTACTAAGGAGGTTGTATGGAACAACGCATA GAAAAAACACGAGTATGCAATTTAGATAGTGGCGTTCCCTATTTATAGATTGTGGCACGCACAACTGATAAAATGGAGACCGCCACTATTACCAACGTACATGATTCCTCCAACATACCTTGTTGCGTAT
Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data V. CRISPRs in enteric bacteria VI. Finding targets for DNA-binding proteins CTTTTTTGTGCTCATACGTTAAATCTATCACCGCAAGGGATAAATATCTAACACCGTGCGTGTTGACTATTTTACCTCTGGCGGTGATAATGGTTGCATGTACTAAGGAGGTTGTATGGAACAACGCATA GAAAAAACACGAGTATGCAATTTAGATAGTGGCGTTCCCTATTTATAGATTGTGGCACGCACAACTGATAAAATGGAGACCGCCACTATTACCAACGTACATGATTCCTCCAACATACCTTGTTGCGTAT
Scientific Questions I. What determines the beginning of a gene? II. Where in a bacterial genome are viruses integrated? III. Determination of short tandem repeats (STRs) IV. Analysis of gene expression data V. CRISPRs in enteric bacteria VI. Finding targets for DNA-binding proteins CTTTTTTGTGCTCATACGTTAAATCTATCACCGCAAGGGATAAATATCTAACACCGTGCGTGTTGACTATTTTACCTCTGGCGGTGATAATGGTTGCATGTACTAAGGAGGTTGTATGGAACAACGCATA GAAAAAACACGAGTATGCAATTTAGATAGTGGCGTTCCCTATTTATAGATTGTGGCACGCACAACTGATAAAATGGAGACCGCCACTATTACCAACGTACATGATTCCTCCAACATACCTTGTTGCGTAT
Scientific Questions