Presentation is loading. Please wait.

Presentation is loading. Please wait.

Billions and Billions of Bases How does a biologist maintain a grip on reality?

Similar presentations


Presentation on theme: "Billions and Billions of Bases How does a biologist maintain a grip on reality?"— Presentation transcript:

1

2 Billions and Billions of Bases How does a biologist maintain a grip on reality?

3 46 chromosomes ~3 billion nucleotides The Human Genome Project One millionth of total

4 The Human Genome Project TGAGACACATATTTTTGATATTCCAGTTGTTGCAATC GAATGTAAAACATATTTAGATCTTTAAATGTATGGTAC ATTCAAGATCCAACCTTCATTCTAGTGTTTAAAGAGAAC TGATTTGTTTGCAGGGGCAGGAGGCTTTGGTTTAGGTTTTG AAATGGCAGGCTTCTCTGTACCTTTATCTGTTGAAATTGATAC C TGGGCTTGTGATACACTACGCTACAACCGCCCTGATTCAACAGT TA TTCAAAATGATATCGGTAACTTTAGTACAGAAAATGACGTTAAGA ATA TCTGCAACTTTAAACCTGATATTATTATTGGCGGGCCTCCATGCCAG GGATT TAGTATTGCTGGGCCAGCCCAAAAAGATCCTAAAGATCCTAGAAATG GTTTATT CATCAACTTTGCACAATGGATAAAATTTCTTGAACCTAAAGCGTTTGTC ATGGAAAA CGTAAAAGGATTGCTATCAAGGAAAAATGCAGAAGGTTTTAAAGTTATAG ATATTATTAAG AAAACATTTGAAGAACTTGGTTATTTTGTCGAAGTATGGGTTTTAAATGCTG CGGAATATGGCATT CCGCAAATTAGAGAACGTATTTTTATTGTTGGCAATAAAAAAGGTAAAGTACT AGGTATTCCTAAAAAAA CACATTCTCTGCAATTTTTAAATTTAAATAGGTCTCAATTATCGATCTTCGATGAT ATGAGTATTATACCTGCACTAA CTTTGTGGGACGCAATATCAGACTTACCAGAACTTAATGCGCGTGAAGGAAGTGAA GAGCAACCCTATCATTTAAAACCTC AAAATACTTATCAGACTTGGGCTAGAAATGGTAGTGCTACGCTTTACAATCATGTTGCAAT GGAACATTCTGACCGTTTAGTAGAACG TTTCCGGCATATAAAATGGGGTGAATCCAGTTCGGATGTATCTAAAGAACATGGAGCTAGACGACGT AGTGGTAATGGTGAATTATCAAACAAATCA TATGATCAGAATAATCGCCGTTTAAATCCTCATAAACCGTCTCACACTATTGCTGCGTCATTCTATGCTAATTTTG TCCATCCTTTTCAACATCGAAATTTAACAGCCCGT GAAGGAGCTAGAATCCAATCTTTTCCAGATAACTATAGATTTTTTGGAAAAAAAACTGTCGTATCTCATAAACTATTGCATCGA GAAGAAAGATTTGATGAAAAATTTCTTTGTCAATATAATC AAATCGGTAATGCTGTACCCCCTCTTCTCGCTAAAGTAATTGCACATCATCTTCTAGAGAAATTAGAGTTATGCCAACAACTGATAGAAATCCTCTA GTGCATGGATCAAATCTTGAACAAAAAGAGAATCATCGTACAAAA TACAGAGATACTGAAAGCAGGACTTTCCTTAGAGAAATCAGAACTGAATATGACAAATGGCATAAAGCAAATATGAACCTGGTTGGACCAAAATCAGAAATTACTGACCA AGATGATTCAATTATTACTCAAAGAGTGGAACTTCTCACTAAATATAAAGATTTT TTAGATCAGCAGCATTATGCAGAAAAATTTGATTCAAGATCCAACCTTCATTCTAGTGTTTTAGAGACCATTTATAAAGTAAATCTTTAGACGACTAGACGACGTAGCATAATACGAGTCATAACGGCATATATG GCAGCCTCACTCATTTCTGGGAGACGCTCATAATCCTTACTGAGACGACGGTACTGGTTTAACCAGCC AAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCTGAAATTCTTGATTAGTACGCCGGATTACCTCAACATGAGCTTGAATCATCAGCCAAACAGAGAGCGCAAATTTATCACCGTCATAGCCGGAATCAACCCAGATGACTTCAACTTTTTCCAGTAATTCTGGACGCTCTTCTAACAGTTCCATCAAAGTATA GGCGGCAAGTAATCTTTCTCCAGCATTTGCTTCACTTACAACCACTTTTAACAA AAGTCCCAGACTATCAACCAAAGTTTGCCGCTTTCGTCCTTTTACCTTCTTGCCACCATCAAAACCGTACACATCCCCCTTTTTTCAGTCGTTTTTACCGACTGGCTGTCTGCCGCGATCGCCGTGGGTTGAGTTGACTTCCCCATTTTTTGACGAACTTGATCGCGCAAAGTATGATTCATTTCAGTTGAACTAGGAGGAAAATCCCCTGGAAGCATATCCCACTGAC AACCTGTTTTCAGATGGTAGTAGATAGCGTTGCATACTTCTCGCATATCAGTTGTTCGGGGATGCCCACCGCATTTAGCGGGTGGAATCAAAGGAGCTAAAATTGCCCATTCTGAGTCATTAAGGTCTGTA GAATAAGACTTTCGTCTCATTGTTTCCTATGTAAATACACTCTACAAACAGTATCTTATCGCTGCCTTTTTATCTTAGCTCTCCTTTAGATTTACTTTATAAATAGCCTCTTAGAAGAATTTCTTTATTATTTATTTAAAGATTTAGTACAAGATTTCGGGCAGAACGCTCTTATTGGTAAGTCACACACGTTCAAAGATATTTTCTTCGTACCACCAAAATATTCTGAAATGCTCAAGCGACCTTATGCGCGAATTGAGAGAAAAGATCATGATTTCGTAATTGGTGCAACTGTTCAAGCATCGCTTGAAGCAGCACCTCCTCCAGAACAAAACCATGCTTGAGGGATCTTCACGCGCAGCAGAGGATTTAAAAGCGAGAAATCCTAACAGTTTATACCTTGTGGTTATGGAATGGATAAAACTGACCAATGATGTAAATTTACGAAAATATAAAGTTGAT CAAATTTATGTACTACGTCAGCAAAAAAATACTGATAGAGAGTTTAGGTATGAGTCAACTTACATAAAAAAT

5 The Human Genome Project AATAAAGCTTTACAAACCAA ACTCTGGCTTCAATTGTGTAA CCCAAGCTTTGATTCTTTCCT CTGTTAAATCGGATTGATTAT CTTCATCAAGGGCAAGACCT ACAAATTTACCATCACGAAC AGCTTTAGACTCACTGAATT CATAACCTTCTGTAGGCCAA TAGCCAACTGTTTCACCACC ATTTTCTGAAATTTTTTCCTCT AGAATACCGAGGGCATCTTG AAATGTATCAGGATAACCAA CCTGGTCTCCAGGAGCAAAA TAAGCAACTTTTTTGCCGATG AAGTCAATGTTATCTAACTC ATCATAAAAATTTTCCCAAT CACTTTGCAATTCTCCAACAT TCCAGGTAGGACAACCAAC AACGATATAATCGTAGTTAT TGAAATCACTTGGTTCAGCTT GTGAAATATCATATAAAGTT ACAACACTATCACCACCAAA CTCCTTCTGAATTATTTCTGA TTCAGTTTGGGTATTGCCTGT TTGAGTACCAAAAAATAAAC CAATATTAGACATTTTTACTC CTTTTATGTATTTGCAAAATT ATTTCAATTAAAATATTTAGT AATAATTAATTGTTAGCTAG CTAATAATTAAATTTTTATTA CAATCATTGTAAAAGGCATT GAAAAAGTAAATAAAAATT TTTATTCTACGTTATTTCAAA AATATTTACTTACATATACTT AACCTTTATAGTGATGTAAT ATACTCTAATTCCTATTTTAC TTATAAATACCATCTCAGCTT AATGTAACGAATTTTTCTGTT TATCTTTAAATACAAAAAAT TCAACAAAACTACAGAAAA TTAATCTTAATAACACAAAA CAAGTATCAATCTGTAATAC AACTAAGCTTAAATAAATTA ATAGAAAGCTTCATCTATCT AATAGGTTGAGAATAGTTTA TGTCTAATGACATAAATTCA TTCGTGTTGATTTCATTTGGG TATATTCATCTGATTTAGGAT TTACTCCATTAAGTTTGTACT CATCAATGCCCGCCTGTTGG TATCCACAATTCTCATACAG TGCGCGAGCAAAGTAATCA ATCGTTCGTCGCCATATCTA ACTTTGAGTCAAACAAACCA GTTGGATTACCAACCCTCAA CTAATCGCTTCTTTAAGGCG AGCGATCGCACATTTAACTG TTGGTTGTCACAAGAGAACT AATACTACAGCAGTATATTT AACAACTAAGGGTGGTTCAA CTTTCGCTGCGACTCCTCCAA CGCGCTGAAATACACAGGA CTGATGCGATCGCAAACTCT TTGACTAAATTCCATACATT ATCATGACCATCTCCCAAAC AAACAAGTGGGTTAACCAG ATGCTGACTATTAACATCCC CTGAGTTCGGAGTTGTAGGT CTATTTGACTGGTTCAAAGC GATGATGGAACGGCTTTGTT GCATGAATTAAAAAAAGAC ACACCATCACCTACTTCTAG GATAGACACATCAAACGTCC CACCGCCTAAGTCAAATACC AAGATAATTTCGTTAGTTTTC TTGTCAAGTCCGTAAGCGAG GGCCGCCGCCGTGGGCTAGT TGATAATTCGCAGAACTTTA ATCCCGGCAATTCTACTGGC ATCTTTGGTAGCCTGCCGTTG AGAGTCATTGAAATAGGCAG GGGTGGTAATTACCGCTTGC CTCACTGGTTCCCCCAGATA TGTGCTGGCATCATCTATCA GCTTGCGGACTACCTCATAC CATTTCACGAAAAACCTGAT ACACATGTAAACTCTGAAAC CCTTGCTGTATCAAAGTTTTG TAATTACGAATTACGAATTA CGAATTGATATCAGCCGAGA TTTCTTCGGGTGAAAATTCCT TGTTCAGAGCGGGACAGTGT AGCTTGACATTGCCATTACT GTCACGTACCACTTTGTAAG TAACTTGTTTTGCCTCTTGCG TAACTTCATCATACCTGCGC CCGATGAACCGCTTCACAGA ATAAAAAGTGTTTTCTGGGT TCATTACACCCTGGCGCTT

6 The Human Genome Project

7

8

9

10 A Walk in the Forest * Photo courtesy of www.webshots.com

11 Observation * Photos courtesy of www.webshots.com and Peter Smallwood

12 Observation * Photos courtesy of www.webshots.com and Peter Smallwood

13 Observation * Photos courtesy of www.webshots.com and Peter Smallwood

14 Observation * Photos courtesy of www.webshots.com and Peter Smallwood

15 Experiment * Photos courtesy of www.webshots.com and Peter Smallwood

16 Filters: Information reducers Squirrel filter

17 Filters: Information reducers Molecule filter

18 Filters: Information reducers Sequence filter How organism is made How organism works TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT CTCCGTAAAC CTCTAAC...

19 From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Rules of folding Active site

20 From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Active site Cell interaction Metabolism, Architecture Genetic codeRules of folding

21 From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Active site Gives us: Custom antibiotics Genetic code Rules of folding

22 From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Gives us: Custom antibiotics Custom antibodies Custom enzymes New materials Genetic code Rules of folding Active site

23 From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Rules of transcriptional and post-transcriptional control Begin transcription End transcription Splice transcript Begin translation ATGACTTATGATCAACGCACAGGGCTA 3% ? TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA

24 From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Rules of transcriptional and post-transcriptional control TCTACTTATATTCAATCCACAGGGCTA CACCTAGTTCTTGAAGAGTCTGTTGAA TGAACACATACATGGTTTATCTGTTTT TCTGTCTGCTCTGACCTCTGGCAGCTT TAGCCTGCCCCACTCTTAGATAAACGA ACCTTAGTGACTTCTGCTATACCAAAG TCTCCACGCCCCTCCGTAAACCTCTAA CATGATGTCAGCAAATATTAAAAATGA 97% TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA ? Begin transcription End transcription Splice transcript Begin translation

25 From Sequence to Organism How does Nature do it? Natural filters/transformations Selective transcription Selective processing Translation Folding DNA Functional protein

26 From Sequence to Organism How does Nature do it? Natural filters/transformations DNA Functional protein Simulation of NatureSurrogate Processes From Sequence to Organism How can WE do it?

27 Simulation of Nature Utterance of W Shakespeare Utterance of George W Bush “Whether ‘tis nobler in the mind to suffer the slings and arrows of outrageous fortune...” “We must give our military every tool and weapon it needs to prevail...” ???

28 From Sequence to Organism How can WE do it? Surrogate Processes Utterance of W Shakespeare Utterance of George W Bush “Whether ‘tis nobler in the mind to suffer the slings and arrows of outrageous fortune...” “We must give our military every tool and weapon it needs to prevail...” Word frequency

29 From Sequence to Organism How can WE do it? Surrogate Processes Utterance of W Shakespeare Utterance of George W Bush “Whether ‘tis nobler in the mind to suffer the slings and arrows of outrageous fortune...” “We must give our military every tool and weapon it needs to prevail...” Word frequency, words/sentence…

30 From Sequence to Organism How can WE do it? Natural filters/transformations Selective transcription Selective processing Translation Folding/function Surrogate filters TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC Characteristics of coding sequences/introns Gene finders Predicted coding regions My sequence

31 From Sequence to Organism How can WE do it? Natural filters/transformations Selective transcription Selective processing Translation Folding/function Surrogate filters Gene finders Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Function?

32 From Sequence to Organism How can WE do it? Natural filters/transformations Selective transcription Selective processing Translation Folding/function Surrogate filters Gene finders Similarity finders My predicted gene Sequence/motif databases globin globin? Similar genes

33 Surrogate Filters Gene finders Start/Stop codon search CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA CT CCA CGC CCC TCC GTA CAC CTC TAA CAT GAT CTC AGC AAA TAT TAA AAA TGA ATA AAC TTT GTG ACA TGT ACA AAT GGA AAT ATG CAA Look for start codons (ATG) (GTG,TTG) Look for stop codons (TAA,TAG,TGA)

34 CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA TTGCATATTTCCATTTGTACATGTCACAAAGTTTATTCATTTTTAATATTTGCTGAGATCATGTTAGAGGTGTACGGAGGGGCGTGGAG Surrogate Filters Gene finders Start/Stop codon search Look for start codons (ATG) (GTG,TTG) Look for stop codons (TAA,TAG,TGA) Highly inaccurate

35 Surrogate Filters Gene finders Hidden Markov Model (HMM)-based recognition Step 1: Create model through extensive training set AAA AAC AAG AAT ACA... TTG TTT Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT

36 Step 1: Create model through extensive training set AAAA: 33% AAAC: 25% AAAG: 12% AAAT: 30% Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition AAA AAC AAG AAT ACA... TTG TTT Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT

37 Step 1: Create model through extensive training set AACA: 30% AACC: 20% AACG: 15% AACT: 35% AAA AAC AAG AAT ACA... TTG TTT Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT

38 Step 2: Assess candidate genes 0.12 A C G T AAA 0.33 0.25 0.12 0.30 AAC 0.30 0.20 0.15 0.35 AAG 0.35 0.15 0.20 0.30 AAT0.30 0.15 0.20 0.25 ACA0.25 0.20 0.15 0.35... TTG0.25 0.30 0.15 0.30 TTT0.30 0.25 0.10 0.35 Candidate gene AAAGCAA… 3 rd order Markov model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition

39 Step 2: Assess candidate genes AAAGCAA… 0.12 x 0.15 3 rd order Markov model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition A C G T AAA 0.33 0.25 0.12 0.30 AAC 0.30 0.20 0.15 0.35 AAG 0.35 0.15 0.20 0.30 AAT0.30 0.15 0.20 0.25 ACA0.25 0.20 0.15 0.35... TTG0.25 0.30 0.15 0.30 TTT0.30 0.25 0.10 0.35 Candidate gene

40 Step 2: Assess candidate genes AAAGCTA… 0.12 x 0.15... So far, not a good candidate! 3 rd order Markov model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition A C G T AAA 0.33 0.25 0.12 0.30 AAC 0.30 0.20 0.15 0.35 AAG 0.35 0.15 0.20 0.30 AAT0.30 0.15 0.20 0.25 ACA0.25 0.20 0.15 0.35... TTG0.25 0.30 0.15 0.30 TTT0.30 0.25 0.10 0.35 Candidate gene

41 Step 2: Assess candidate genes 3 rd order Markov model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Candidate genesPredicted genes

42 Step 2: Assess candidate genes 3 rd order Markov model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Candidate genesPredicted genes Conform to standard model Challenge accepted beliefs

43 Computers are powerful globin Highly filtered output Easy to grasp High-level insights Unfiltered output Confusing Basic insights

44 Computers are tempting

45 Globin Computers are tempting

46 Crisis in Bioinformatics 1. Need high-level filters 2. Need access to raw phenomena 3. Need new tools for new phenomena 4. Need intuitive representation of results Need a new generation 5. Need ability to build new tools

47 View of the Future

48 View of the Future Integration of information ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Cell interaction Metabolism, Architecture Genetic codeRules of folding Active site

49 Prochlorococcus MED4 Prochlorococcus MIT9313

50 Gene present in Prochlorococcus MED4 MED4 is naturally adapted to grow in high light. How do cells control response to light? Ortholog absent in Prochlorococcus MIT9313 MIT9313 is naturally adapted to grow in low light Ortholog present in Synechocystis PCC 6803 Reason will become apparent in a moment Synechocystis PCC 6803 ortholog responds to high light Gene turns on by factor > 2 in response to high light What genes are related to the adaptation to high light? Look for:

51 Build setDisplay set Click on Build Set to begin finding orfs with the desired specifications HELPSet operation

52 All items in All open reading frames of All amino acid sequences of All intergenic regions of Human-annotated orfs of Private set Public set All open reading frames of Build set Display set Choose set type Goal is to find all open reading frames within Prochlorococcus MED4 that meet certain specifications, so click on All open reading frames in CancelHELPSet operation

53 All items in All open reading frames ofArthrobacter platensis Gloeobacter violaceus Microcystis aeruginosa Nostoc punctiforme Nostoc PCC 7120 Prochlorococcus MED4 Prochlorococcus MIT9313 Prochlorococcus S120 Synechococcus PCC6301 Synechococcus PCC7942 Synechococcus WH Synechocystis PCC 6803 Thermosynechococcus Trichodesmium Unicellulular Filamentous All Prochlorococcus MED4 Build setDisplay set Choose set typeChoose database Click on Prochlorococcus MED4 CancelHELPSet operation

54 All items in All open reading frames ofProchlorococcus MED4 Display set such that: Variable DataOperationFunctionDone Choose set typeChoose database Build set You will ask that an ortholog of each desired MED4 genes exists in Synechocystis PCC 6803. It is convenient to define the ortholog now. Click the Variable button CancelHELPSet operation

55 All items in All open reading frames ofProchlorococcus MED4 Display set such that: Variable Data Item New variable Variable Choose set typeChoose database New variable Build set Item refers to the MED4 orf under consideration. You want to define its ortholog in Synechocystis, so click on New variable OperationFunctionDone CancelHELPSet operation

56 All items in All open reading frames ofProchlorococcus MED4 Display set such that: Variable Data 6803 ortholog Type variable name = Choose set typeChoose database Build set You can name the variable representing the ortholog anything you like. For this simulation, a name is provided. Press the Enter key OperationFunctionDone CancelHELPSet operation

57 All items in All open reading frames ofProchlorococcus MED4 Display set such that: VariableData 6803 ortholog Type variable name = Closest ortholog of Protein product of Upstream region of Downstream region of Ortholog of (item Choose set typeChoose database Choose function Build set One variable can be defined with respect to another in several ways. The relationship you want is Ortholog of OperationFunctionDone CancelHELPSet operation

58 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData = Ortholog of (item in Arthrobacter platensis Gloeobacter violaceus Microcystis aeruginosa Nostoc punctiforme Nostoc PCC 7120 Prochlorococcus MED4 Prochlorococcus MIT9313 Prochlorococcus S120 Synechococcus PCC6301 Synechococcus PCC7942 Synechococcus WH Synechocystis PCC 6803 Thermosynechococcus Trichodesmium Choose database Synechocystis PCC6803 ) Choose function Build set Clicking on Synechocystis PCC6803 defines the variable 6803 ortholog as the ortholog in Synechocystis to a given orf of MED4. 6803 ortholog Type variable name OperationFunctionDone CancelHELPSet operation

59 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Synechocystis PCC 6803 Build set ) The first limitation on the MED4 orf is that no ortholog of it exists in MIT9313. To evoke the concept of ortholog, press the Function button = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name OperationFunctionDone CancelHELPSet operation

60 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set Click on Ortholog of Closest ortholog of Protein product of Upstream region of Downstream region of Ortholog of Choose function Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name OperationFunctionDone CancelHELPSet operation

61 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set As always, Item refers to the orf of MED4 that is being defined. You want to specify that an ortholog of it in MIT9313 doesn’t exist, so click on Item. Item 6803 ortholog Variable Item ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function OperationFunctionDone CancelHELPSet operation

62 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set Clicking on Prochlorococcus MIT9313 defines an ortholog of a MED4 gene in MIT9313 (if such an ortholog exists) Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Arthrobacter platensis Gloeobacter violaceus Microcystis aeruginosa Nostoc punctiforme Nostoc PCC 7120 Prochlorococcus MED4 Prochlorococcus MIT9313 Prochlorococcus S120 Synechococcus PCC6301 Synechococcus PCC7942 Synechococcus WH Synechocystis PCC 6803 Thermosynechococcus Trichodesmium Choose database ) Prochlorococcus MIT9313 OperationFunctionDone CancelHELPSet operation

63 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set You want to keep only those MED4 genes where an ortholog in MIT9313 does NOT exist, so click on doesn’t exist. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) =  exists doesn’t exist Op OperationFunctionDone CancelHELPSet operation

64 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set That completes one specification, but there are more. Click on the Operation button to connect one specification to the next. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op OperationFunctionDone CancelHELPSet operation

65 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set You want both the first specification AND the second to be true, so click on AND. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND OR AND Op OperationFunctionDone CancelHELPSet operation

66 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: Variable Data Build set The second specification is that microarray data for the 6803 ortholog meets a certain criterion. To get at that data, press the Data button Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ OperationFunctionDone CancelHELPSet operation

67 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: Variable Data Build set The data you want is for the 6803 ortholog. Click on 6803 ortholog. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ data for ( Item 6803 ortholog New variable Variable 6803 ortholog in OperationFunctionDone CancelHELPSet operation

68 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: Variable Data Build set Choose the Hihara experiment, which measured expression changes upon shift from low light to high light. If you didn’t know which experiment was appropriate, you could have clicked on Choose data set for a description of the choices Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ data for ( 6803 ortholog Variable in Microarray:Hihara1(6803) Microarray:Suzuki1(6803) Microarray:Yoshimura1(6803) Microarray:Meeks(Npun) Microarray:Golden(7120) Choose data set Microarray:Hihara1(6803) ) OperationFunctionDone CancelHELPSet operation High light vs low light experiment

69 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: Variable Data Build set You want the ratio of experimental condition to control to exceed a specified value. Click on >. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ data for ( Variable in Microarray:Hihara1(6803) Choose data set ) < < or = = > or = > > Op 6803 ortholog OperationFunctionDone CancelHELPSet operation

70 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: Variable Data Build set You can type in the value you want. For this simulation a number is supplied. Press the Enter key. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ data for ( Variable in Microarray:Hihara1(6803) Choose data set ) > OpValue ] +2 6803 ortholog OperationFunctionDone CancelHELPSet operation

71 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set No more specifications. Press the Done button. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ data for ( Variable in Microarray:Hihara1(6803) Choose data set ) > OpValue ] +2 6803 ortholog OperationFunctionDone CancelHELPSet operation

72 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set This was a complicated search. If you wanted to do it again, you could save the search description. In this case, just save the results by clicking on Save only results. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ data for ( Variable in Microarray:Hihara1(6803) Choose data set ) > OpValue ] +2 6803 ortholog Save results and script Save only results Save only results OperationFunctionDone CancelHELPSet operation

73 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set All MED4 genes meeting the given specifications will be collected into a set. You can name the set anything you want. For this simulation, a name is provided. Press the Enter key. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ data for ( Variable in Microarray:Hihara1(6803) Choose data set ) > OpValue ] +2 Light-specific genes Type name of set 6803 ortholog OperationFunctionDone CancelHELPSet operation

74 Build setDisplay set :all0687 hupL [NiFe] uptake hydrogenase large subunit, C terminus :all0687 hupL [NiFe] uptake hydrogenase large subunit, N terminus :all0688 hupS [NiFe] uptake hydrogenase small subunit :alr0692 similar to nifU :alr0874 nifH2 dinitrogenase reductase :asr1309 similar to nifU :alr1407 nifV1 homocitrate synthase :asr1408 nifZ iron-sulfur cofactor synthesis :asr1408 nifT Set: Light-specific genes ProcMed4:all0687 hupL [NiFe] uptake hydrogenase large subunit, C terminus ProcMed4:all0687 hupL [NiFe] uptake hydrogenase large subunit, N terminus ProcMed4:all0688 hupS [NiFe] uptake hydrogenase small subunit ProcMed4:alr0692 similar to nifU ProcMed4:alr0874 psbBX dinitrogenase reductase ProcMed4:asr1309 similar to nifU ProcMed4:alr1407 psbY1 homocitrate synthase ProcMed4:asr1408 psbX iron-sulfur cofactor synthesis ProcMed4:asr1408 nifT The results are displayed as a list of orfs (Of course, the search capabilities do not now exist, and the results of the described search are unknown) Clicking on the name of any orf brings you to its page (see Scenarios 1 and 2). Clicking on circles next to the orf names allows you to modify the set. The genetic neighborhood of each orf is shown to the right. DoneHELPSet operation [WARNING: Fantasy filtration not in effect!]

75 Prochlorococcus MED4: pll1290 Replicon: Chromosome Coordinates: 1533026 (stop) <- 1533931 (start-TTG)Human Length = 301 amino acids Strand: Complementary Gene name(s): proXM Function: Putative type II DNA cytosine methyltransferase (CAGCTG-specific)Human Classification: Type II beta (N4)Human Activity: Protects against: PvuII Experiment In vivo activity: existsExperiment Cyanobacterial orthologs: none ProcMED4 Proteus vulgaris Salmonella paratyphi Streptomyces spectabilis OptionsAnnotate Main Menu History More A A A A A HELP [WARNING: Fantasy filtration not in effect!]

76 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set This was a complicated search. If you wanted to do it again, you could save the search description. In this case, just save the results by clicking on Save only results. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ data for ( Variable in Microarray:Hihara1(6803) Choose data set ) > OpValue ] +2 6803 ortholog Save results and script Save only results Save results and script OperationFunctionDone CancelHELPSet operation

77 Equivalent script that bypasses interface FOR orf IN (orfs:ProcMED4) { 6803ortholog = Ortholog(orf,orfs:Syny6803); WHEN (NOT Exists(Ortholog(orf,orfs:Proc9313)) AND Data(6803ortholog,microarray:Hihara1) > +2){ COLLECT orf INTO light_specific_genes; } DISPLAY (light_specific_genes, “BNC”); or MAIL (light_specific_genes,Rocap@Ocean.Washington.Edu,“BNC”); The same search could have been conducted through the script shown above. The script interface makes possible complex searches beyond the scope of the graphical interface.

78 All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set OperationFunctionDone CancelHELPSet operation HELP ???

79 Cyanobacterial Knowledge Base Virtual Help Desk How to search for data? How to build a new filter?

80 Cyanobacterial Knowledge Base Virtual Help Desk How to......I don’t know! Virtual Help Desk Staff HELP

81 Cyanobacterial Knowledge Base Virtual Help Desk Upper echelons Staff You Virtual Help Desk Staff HELP

82 Billions and Billions of Bases How does a biologist maintain a grip on sanity? reality?

83 View of the Future Interplay of low- & high-level perception ProcMED4 Proteus vulgaris Salmonella paratyphi Streptomyces spectabilis

84 View of the Future Interplay of low- & high-level perception Anab7120 Proteus vulgaris Salmonella paratyphi Streptomyces spectabilis TCTACTTATATTCAATCCACAGGGCTA CACCTAGTTCTTGAAGAGTCTGTTGAA TGAACACATACATGGTTTATCTGTTTT TCTGTCTGCTCTGACCTCTGGCAGCTT TAGCCTGCCCCACTCTTAGATAAACGA ACCTTAGTGACTTCTGCTATACCAAAG TCTCCACGCCCCTCCGTAAACCTCTAA CATGATGTCAGCAAATATTAAAAATGA 97% TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA

85 Anabaena Chromosome (6413771 bp): 4001 to 5000 cgcccaacaataacaaatgtgtaatctagaccttctgccttgagttcctt ggcgcggttttcggcacgacggatgacgttggtattgtaaccgccgcaca aaccacgatcgccagaaataactagcaagcctactgatttaacttcccgt tttttcagtagaggtaagtctacatcttcaaaccgtagacgagtttgcaa accgtataatacttgtgccaaacggtcagcaaaaggacgagtagcgatta cttgttcttgggcgcgacgtacacgcgccgccgctaccagccgcatggct tctgtgattttcttggtgtttttgaccgactgaatgcgatcgcgtattga tttgagattaggcataatatttgttgattgtcagttgtcagttgtcagtt gtcagttgtcagtgtctattgctactgaccactgaccaatgactaatgac taattacgctgtagctttgaaggtctttttgtagtcttctaaagctgcct tcaatgctttttcttcatcatcacccagtgctttcttcgattgtacgtct tggaagtaggggttaacgccggacttcaagtaatctctcaagcctttggt gaaggtggtgactttatcaacagggatatcatctaagtaaccgttgatac ctgcgtacagaatggctacttgttcagctacggatagaggctgattttgg gactgtttgaggagttcccgcaggcgttgacctcttgccaattggtcttg ggtggctttatctaggtcggaagcaaattgcgcgaaggcttggaggtcgt caaactgtgctagttcgagcttaatcttaccagcaacttttttcatcgct ttggtttgtgccgcagaacccacacgggatacagagataccagggtttac agccggacgaataccagcgttaaataagtcagaagataagaatatctgac cgtctgtaatagaaattacgttggtaggaatgtaggcagaaacgtcacca Typical output of current programs

86 Future: Sequence plus genetic context Noncoding region

87 Future: Both filtered and raw data

88

89 Filters: Information reducers Build filter to find repeated sequences TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT TGGTCTCCGACCGACCGTAGGTCATCG CTTGTACTGAGCGAAGTCGAAGTA CTTGTACTGAGCGTAGCCGAAGTA GTTCGACTGAGCGTAGTCGAAGTC... Repeat filter Entire genomeRepeated sequences

90

91 Filters: Information reducers Build repeats filter TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT TGGTCTCCGACCGACCGTAGGTCATCG CTTGTACTGAGCGAAGTCGAAGTA CTTGTACTGAGCGTAGCCGAAGTA GTTCGACTGAGCGTAGTCGAAGTC... Repeat filter Entire genomeRepeated sequences NIS-1: repeat family

92 Alignment of NIS-1 (…271 more)

93 Filters: Information reducers Build secondary repeats filter A: CTTGTACTGAGCGAAGTCGAAGTA B: CTTGTACTGAGCGTAGCCGAAGTA Distance = 2 CTTGTACTGAGCGAAGTCGAAGTA... CTTGTACTGAGCGAAGTCGAAGTA Copy number = 10 Subfamily A CTTGTACTGAGCGTAGCCGAAGTA Copy number = 2 Subfamily B GTTCGACTGAGCGTAGTCGAAGTC Copy number = 1 Subfamily C

94 Filters: Information reducers Build secondary repeats filter Distance = 2 A: CTTGTACTGAGCGAAGTCGAAGTA C: GTTCGACTGAGCGTAGTCGAAGTC Distance = 5 CTTGTACTGAGCGAAGTCGAAGTA... CTTGTACTGAGCGAAGTCGAAGTA Copy number = 10 Subfamily A CTTGTACTGAGCGTAGCCGAAGTA Copy number = 2 Subfamily B GTTCGACTGAGCGTAGTCGAAGTC Copy number = 1 Subfamily C

95 Filters: Information reducers Build secondary repeats filter B: CTTGTACTGAGCGTAGCCGAAGTA C: GTTCGACTGAGCGTAGTCGAAGTC Distance = 5 Do for all pairs of subfamilies CTTGTACTGAGCGAAGTCGAAGTA... CTTGTACTGAGCGAAGTCGAAGTA Copy number = 10 Subfamily A CTTGTACTGAGCGTAGCCGAAGTA Copy number = 2 Subfamily B GTTCGACTGAGCGTAGTCGAAGTC Copy number = 1 Subfamily C Distance = 2

96 Diameter Copies of exact repeats Distance Number of mismatches Relationship between related repeats in genome (sequences within NIS-1 repeat family)

97 Crisis in Bioinformatics 1. Need high-level filters 2. Need access to raw phenomena Integrated knowledge base

98 Crisis in Bioinformatics 1. Need high-level filters 2. Need access to raw phenomena 3. Need new tools for new phenomena 4. Need intuitive representation of results Integrated knowledge base Tools that bridge levels of perception

99 Crisis in Bioinformatics 1. Need high-level filters 2. Need access to raw phenomena 3. Need new tools for new phenomena 4. Need intuitive representation of results Long term: Need a new generation 5. Need ability to build new tools Integrated knowledge base Tools that bridge levels of perception Short term: Graphical programming Human help

100 Billions and Billions of Bases How does a biologist maintain a grip on reality? Filtering reality Raw reality Real questions with real answers

101 Pre-genomic Molecular Biology

102

103

104

105

106

107 How do we figure out how cars are made? Genetic approachBiochemical approach

108 Pre-genomic Molecular Biology Geneticist’s Approach

109

110 Isolation of Defective Gene Pre-genomic Molecular Biology Geneticist’s Approach

111 Pre-genomic Molecular Biology How do we figure out how cars are made? Genetic approachBiochemical approach

112 Pre-genomic Molecular Biology Biochemist’s Approach

113

114

115

116 Pre-genomic Molecular Biology How do we figure out how cars are made? Genetic approachBiochemical approach

117 One component at a time Highly filtered perception Many local viewpoints Pre-genomic Molecular Biology How we viewed the world

118 Post-genomic Molecular Biology

119 Post-genomic Molecular Biology Bioinformaticist’s Approach (long term) Assemble the whole

120 Post-genomic Molecular Biology Bioinformaticist’s Approach (short term) Identify critical parts

121 Globin Current Biology

122 AATAAAGCTTTACAAACCAA ACTCTGGCTTCAATTGTGTAA CCCAAGCTTTGATTCTTTCCT CTGTTAAATCGGATTGATTAT CTTCATCAAGGGCAAGACCT ACAAATTTACCATCACGAAC AGCTTTAGACTCACTGAATT CATAACCTTCTGTAGGCCAA TAGCCAACTGTTTCACCACC ATTTTCTGAAATTTTTTCCTCT AGAATACCGAGGGCATCTTG AAATGTATCAGGATAACCAA CCTGGTCTCCAGGAGCAAAA TAAGCAACTTTTTTGCCGATG AAGTCAATGTTATCTAACTC ATCATAAAAATTTTCCCAAT CACTTTGCAATTCTCCAACAT TCCAGGTAGGACAACCAAC AACGATATAATCGTAGTTAT TGAAATCACTTGGTTCAGCTT GTGAAATATCATATAAAGTT ACAACACTATCACCACCAAA CTCCTTCTGAATTATTTCTGA TTCAGTTTGGGTATTGCCTGT TTGAGTACCAAAAAATAAAC CAATATTAGACATTTTTACTC CTTTTATGTATTTGCAAAATT ATTTCAATTAAAATATTTAGT AATAATTAATTGTTAGCTAG CTAATAATTAAATTTTTATTA CAATCATTGTAAAAGGCATT GAAAAAGTAAATAAAAATT TTTATTCTACGTTATTTCAAA AATATTTACTTACATATACTT AACCTTTATAGTGATGTAAT ATACTCTAATTCCTATTTTAC TTATAAATACCATCTCAGCTT AATGTAACGAATTTTTCTGTT TATCTTTAAATACAAAAAAT TCAACAAAACTACAGAAAA TTAATCTTAATAACACAAAA CAAGTATCAATCTGTAATAC AACTAAGCTTAAATAAATTA ATAGAAAGCTTCATCTATCT AATAGGTTGAGAATAGTTTA TGTCTAATGACATAAATTCA TTCGTGTTGATTTCATTTGGG TATATTCATCTGATTTAGGAT TTACTCCATTAAGTTTGTACT CATCAATGCCCGCCTGTTGG TATCCACAATTCTCATACAG TGCGCGAGCAAAGTAATCA ATCGTTCGTCGCCATATCTA ACTTTGAGTCAAACAAACCA GTTGGATTACCAACCCTCAA CTAATCGCTTCTTTAAGGCG AGCGATCGCACATTTAACTG TTGGTTGTCACAAGAGAACT AATACTACAGCAGTATATTT AACAACTAAGGGTGGTTCAA CTTTCGCTGCGACTCCTCCAA CGCGCTGAAATACACAGGA CTGATGCGATCGCAAACTCT TTGACTAAATTCCATACATT ATCATGACCATCTCCCAAAC AAACAAGTGGGTTAACCAG ATGCTGACTATTAACATCCC CTGAGTTCGGAGTTGTAGGT CTATTTGACTGGTTCAAAGC GATGATGGAACGGCTTTGTT GCATGAATTAAAAAAAGAC ACACCATCACCTACTTCTAG GATAGACACATCAAACGTCC CACCGCCTAAGTCAAATACC AAGATAATTTCGTTAGTTTTC TTGTCAAGTCCGTAAGCGAG GGCCGCCGCCGTGGGCTAGT TGATAATTCGCAGAACTTTA ATCCCGGCAATTCTACTGGC ATCTTTGGTAGCCTGCCGTTG AGAGTCATTGAAATAGGCAG GGGTGGTAATTACCGCTTGC CTCACTGGTTCCCCCAGATA TGTGCTGGCATCATCTATCA GCTTGCGGACTACCTCATAC CATTTCACGAAAAACCTGAT ACACATGTAAACTCTGAAAC CCTTGCTGTATCAAAGTTTTG TAATTACGAATTACGAATTA CGAATTGATATCAGCCGAGA TTTCTTCGGGTGAAAATTCCT TGTTCAGAGCGGGACAGTGT AGCTTGACATTGCCATTACT GTCACGTACCACTTTGTAAG TAACTTGTTTTGCCTCTTGCG TAACTTCATCATACCTGCGC CCGATGAACCGCTTCACAGA ATAAAAAGTGTTTTCTGGGT TCATTACACCCTGGCGCTT Future Biology

123 AATAAAGCTTTACAAACCAA ACTCTGGCTTCAATTGTGTAA CCCAAGCTTTGATTCTTTCCT CTGTTAAATCGGATTGATTAT CTTCATCAAGGGCAAGACCT ACAAATTTACCATCACGAAC AGCTTTAGACTCACTGAATT CATAACCTTCTGTAGGCCAA TAGCCAACTGTTTCACCACC ATTTTCTGAAATTTTTTCCTCT AGAATACCGAGGGCATCTTG AAATGTATCAGGATAACCAA CCTGGTCTCCAGGAGCAAAA TAAGCAACTTTTTTGCCGATG AAGTCAATGTTATCTAACTC ATCATAAAAATTTTCCCAAT CACTTTGCAATTCTCCAACAT TCCAGGTAGGACAACCAAC AACGATATAATCGTAGTTAT TGAAATCACTTGGTTCAGCTT GTGAAATATCATATAAAGTT ACAACACTATCACCACCAAA CTCCTTCTGAATTATTTCTGA TTCAGTTTGGGTATTGCCTGT TTGAGTACCAAAAAATAAAC CAATATTAGACATTTTTACTC CTTTTATGTATTTGCAAAATT ATTTCAATTAAAATATTTAGT AATAATTAATTGTTAGCTAG CTAATAATTAAATTTTTATTA CAATCATTGTAAAAGGCATT GAAAAAGTAAATAAAAATT TTTATTCTACGTTATTTCAAA AATATTTACTTACATATACTT AACCTTTATAGTGATGTAAT ATACTCTAATTCCTATTTTAC TTATAAATACCATCTCAGCTT AATGTAACGAATTTTTCTGTT TATCTTTAAATACAAAAAAT TCAACAAAACTACAGAAAA TTAATCTTAATAACACAAAA CAAGTATCAATCTGTAATAC AACTAAGCTTAAATAAATTA ATAGAAAGCTTCATCTATCT AATAGGTTGAGAATAGTTTA TGTCTAATGACATAAATTCA TTCGTGTTGATTTCATTTGGG TATATTCATCTGATTTAGGAT TTACTCCATTAAGTTTGTACT CATCAATGCCCGCCTGTTGG TATCCACAATTCTCATACAG TGCGCGAGCAAAGTAATCA ATCGTTCGTCGCCATATCTA ACTTTGAGTCAAACAAACCA GTTGGATTACCAACCCTCAA CTAATCGCTTCTTTAAGGCG AGCGATCGCACATTTAACTG TTGGTTGTCACAAGAGAACT AATACTACAGCAGTATATTT AACAACTAAGGGTGGTTCAA CTTTCGCTGCGACTCCTCCAA CGCGCTGAAATACACAGGA CTGATGCGATCGCAAACTCT TTGACTAAATTCCATACATT ATCATGACCATCTCCCAAAC AAACAAGTGGGTTAACCAG ATGCTGACTATTAACATCCC CTGAGTTCGGAGTTGTAGGT CTATTTGACTGGTTCAAAGC GATGATGGAACGGCTTTGTT GCATGAATTAAAAAAAGAC ACACCATCACCTACTTCTAG GATAGACACATCAAACGTCC CACCGCCTAAGTCAAATACC AAGATAATTTCGTTAGTTTTC TTGTCAAGTCCGTAAGCGAG GGCCGCCGCCGTGGGCTAGT TGATAATTCGCAGAACTTTA ATCCCGGCAATTCTACTGGC ATCTTTGGTAGCCTGCCGTTG AGAGTCATTGAAATAGGCAG GGGTGGTAATTACCGCTTGC CTCACTGGTTCCCCCAGATA TGTGCTGGCATCATCTATCA GCTTGCGGACTACCTCATAC CATTTCACGAAAAACCTGAT ACACATGTAAACTCTGAAAC CCTTGCTGTATCAAAGTTTTG TAATTACGAATTACGAATTA CGAATTGATATCAGCCGAGA TTTCTTCGGGTGAAAATTCCT TGTTCAGAGCGGGACAGTGT AGCTTGACATTGCCATTACT GTCACGTACCACTTTGTAAG TAACTTGTTTTGCCTCTTGCG TAACTTCATCATACCTGCGC CCGATGAACCGCTTCACAGA ATAAAAAGTGTTTTCTGGGT TCATTACACCCTGGCGCTT Future Biology

124 Globin TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT Current Biology Current Life

125 “Axis of Evil...” Current Life

126 “No war for oil...” Globin Current Life

127 “No war for oil...” Globin Current Life

128

129

130 Contact Information Jeff Elhai Department of Biology Virginia Commonwealth University Richmond, VA E-Mail: ElhaiJ@VCU.Edu Tel: 804-828-0794 Web: www.people.vcu.edu/~elhaij/


Download ppt "Billions and Billions of Bases How does a biologist maintain a grip on reality?"

Similar presentations


Ads by Google