Billions and Billions of Bases How does a biologist maintain a grip on reality?

Slides:



Advertisements
Similar presentations
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Advertisements

CSCE555 Bioinformatics Lecture 3 Gene Finding Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Bioinformatics. Bioinformatics is an applied science that uses computer programs to access molecular biology databanks to make inferences about the information.
1 Review How does a cell interpret the genetic code Explain What are codons and anticodons 2 Review What happens during translation Compare and Contrast.
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
RNA and Protein Synthesis
Finding Eukaryotic Open reading frames.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Using Bioinformatics to Make the Bio- Math Connection The Confessions of a Biology Teacher.
Integration of Bioinformatics into Inquiry Based Learning by Kathleen Gabric.
Gene Expression.
Introduction to Molecular Biology. G-C and A-T pairing.
Lecture 12 Splicing and gene prediction in eukaryotes
Finding prokaryotic genes and non intronic eukaryotic genes
Reading the blueprint of life DNA sequencing. Introduction The blueprint of life is contained in the DNA in the nuclei of eukaryotic cells and simply.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Welcome to Advanced Molecular Genetics, Bioinformatics, and Computational Genomics Pattern Recognition and Gene Finding.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Frog’s eye view of the jungle (time frozen) Push to restart time.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Anotation: Gene of which little is known What follows is a simulation of an orf page in the proposed graphical interface. The interface does not yet exist.
Protein Synthesis Transcription and Translation. The Central Dogma The information encoded with the DNA nucleotide sequence of a double helix is transferred.
Lives of the Scientist Genetic Basis of Differentiation Events in time and space...
Anotation: Gene of which something known What follows is a simulation of an orf page in the proposed graphical interface. The interface does not yet exist.
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
PART 1 - DNA REPLICATION PART 2 - TRANSCRIPTION AND TRANSLATION.
Organizing information in the post-genomic era The rise of bioinformatics.
PROTEIN SYNTHESIS The Blueprint of Life: From DNA to Protein.
Protein Synthesis IB Biology HL 1 Spring 2014 Mrs. Peters.
Chapter 17 From Gene to Protein. 2 DNA contains the genes that make us who we are. The characteristics we have are the result of the proteins our cells.
Analysis: Discovery of possible regulatory motifs What follows is a simulation of the proposed graphical interface. As you go through the simulation please.
Bioinformatics and Computational Biology
Mutations and Evolution More on Biochemistry and Vertebrate Evolution From:
Welcome to Introduction to Bioinformatics Monday, 21 March 2005 Genome Comparison Coming attractions How to compare genomes Chi-squared analysis.
How can we find genes? Search for them Look them up.
Integrated Bioinformatics Nature of research articles Comparison of genomes – Scenario Regular expressions in Python Installing and running Blast How to.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
Welcome to Introduction to BioinformaticS Intro to Scenario 8 Identification of genes of foreign origin.
Anotation Process What follows is a simulation of the process of annotating, using the proposed graphical interface. The interface does not yet exist.
Integration of Bioinformatics into Inquiry Based Learning by Kathleen Gabric.
RNA and Gene Expression BIO 224 Intro to Molecular and Cell Biology.
Analysis: Tools for directly examining sequence What follows is a simulation of the proposed sequence interface. A PC-based prototype exists, but the interface.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Structure and Function of DNA DNA Replication and Protein Synthesis.
Analysis: Discovery of coregulated genes What follows is a simulation of the proposed graphical interface. As you go through the simulation please consider.
Welcome to Advanced Molecular Genetics, Bioinformatics, and Computational Genomics Pattern Recognition and Gene Finding Today is the last class. Would.
DNA, RNA and Protein.
THE ROLES OF DNA.
Chapter – 10 Part II Molecular Biology of the Gene - Genetic Transcription and Translation.
Bacterial infection by lytic virus
ORF Calling.
bacteria and eukaryotes
Bacterial infection by lytic virus
Pattern Recognition and Gene Finding
DNA, RNA and Protein Synthesis
Modelling Proteomes.
GENE MUTATIONS aka point mutations © 2016 Paul Billiet ODWS.
Chapter 13: Protein Synthesis
Gene architecture and sequence annotation
More on translation.
Predicting Genes in Actinobacteriophages
Agenda 3/8 and 3/9 Uses in Agriculture Notes Plant Transgenic Activity
What do you with a whole genome sequence?
Introduction to Molecular Biology
(Really) Basic Molecular Biology
Chapter 14: Protein Synthesis
Presentation transcript:

Billions and Billions of Bases How does a biologist maintain a grip on reality?

46 chromosomes ~3 billion nucleotides The Human Genome Project One millionth of total

The Human Genome Project TGAGACACATATTTTTGATATTCCAGTTGTTGCAATC GAATGTAAAACATATTTAGATCTTTAAATGTATGGTAC ATTCAAGATCCAACCTTCATTCTAGTGTTTAAAGAGAAC TGATTTGTTTGCAGGGGCAGGAGGCTTTGGTTTAGGTTTTG AAATGGCAGGCTTCTCTGTACCTTTATCTGTTGAAATTGATAC C TGGGCTTGTGATACACTACGCTACAACCGCCCTGATTCAACAGT TA TTCAAAATGATATCGGTAACTTTAGTACAGAAAATGACGTTAAGA ATA TCTGCAACTTTAAACCTGATATTATTATTGGCGGGCCTCCATGCCAG GGATT TAGTATTGCTGGGCCAGCCCAAAAAGATCCTAAAGATCCTAGAAATG GTTTATT CATCAACTTTGCACAATGGATAAAATTTCTTGAACCTAAAGCGTTTGTC ATGGAAAA CGTAAAAGGATTGCTATCAAGGAAAAATGCAGAAGGTTTTAAAGTTATAG ATATTATTAAG AAAACATTTGAAGAACTTGGTTATTTTGTCGAAGTATGGGTTTTAAATGCTG CGGAATATGGCATT CCGCAAATTAGAGAACGTATTTTTATTGTTGGCAATAAAAAAGGTAAAGTACT AGGTATTCCTAAAAAAA CACATTCTCTGCAATTTTTAAATTTAAATAGGTCTCAATTATCGATCTTCGATGAT ATGAGTATTATACCTGCACTAA CTTTGTGGGACGCAATATCAGACTTACCAGAACTTAATGCGCGTGAAGGAAGTGAA GAGCAACCCTATCATTTAAAACCTC AAAATACTTATCAGACTTGGGCTAGAAATGGTAGTGCTACGCTTTACAATCATGTTGCAAT GGAACATTCTGACCGTTTAGTAGAACG TTTCCGGCATATAAAATGGGGTGAATCCAGTTCGGATGTATCTAAAGAACATGGAGCTAGACGACGT AGTGGTAATGGTGAATTATCAAACAAATCA TATGATCAGAATAATCGCCGTTTAAATCCTCATAAACCGTCTCACACTATTGCTGCGTCATTCTATGCTAATTTTG TCCATCCTTTTCAACATCGAAATTTAACAGCCCGT GAAGGAGCTAGAATCCAATCTTTTCCAGATAACTATAGATTTTTTGGAAAAAAAACTGTCGTATCTCATAAACTATTGCATCGA GAAGAAAGATTTGATGAAAAATTTCTTTGTCAATATAATC AAATCGGTAATGCTGTACCCCCTCTTCTCGCTAAAGTAATTGCACATCATCTTCTAGAGAAATTAGAGTTATGCCAACAACTGATAGAAATCCTCTA GTGCATGGATCAAATCTTGAACAAAAAGAGAATCATCGTACAAAA TACAGAGATACTGAAAGCAGGACTTTCCTTAGAGAAATCAGAACTGAATATGACAAATGGCATAAAGCAAATATGAACCTGGTTGGACCAAAATCAGAAATTACTGACCA AGATGATTCAATTATTACTCAAAGAGTGGAACTTCTCACTAAATATAAAGATTTT TTAGATCAGCAGCATTATGCAGAAAAATTTGATTCAAGATCCAACCTTCATTCTAGTGTTTTAGAGACCATTTATAAAGTAAATCTTTAGACGACTAGACGACGTAGCATAATACGAGTCATAACGGCATATATG GCAGCCTCACTCATTTCTGGGAGACGCTCATAATCCTTACTGAGACGACGGTACTGGTTTAACCAGCC AAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCTGAAATTCTTGATTAGTACGCCGGATTACCTCAACATGAGCTTGAATCATCAGCCAAACAGAGAGCGCAAATTTATCACCGTCATAGCCGGAATCAACCCAGATGACTTCAACTTTTTCCAGTAATTCTGGACGCTCTTCTAACAGTTCCATCAAAGTATA GGCGGCAAGTAATCTTTCTCCAGCATTTGCTTCACTTACAACCACTTTTAACAA AAGTCCCAGACTATCAACCAAAGTTTGCCGCTTTCGTCCTTTTACCTTCTTGCCACCATCAAAACCGTACACATCCCCCTTTTTTCAGTCGTTTTTACCGACTGGCTGTCTGCCGCGATCGCCGTGGGTTGAGTTGACTTCCCCATTTTTTGACGAACTTGATCGCGCAAAGTATGATTCATTTCAGTTGAACTAGGAGGAAAATCCCCTGGAAGCATATCCCACTGAC AACCTGTTTTCAGATGGTAGTAGATAGCGTTGCATACTTCTCGCATATCAGTTGTTCGGGGATGCCCACCGCATTTAGCGGGTGGAATCAAAGGAGCTAAAATTGCCCATTCTGAGTCATTAAGGTCTGTA GAATAAGACTTTCGTCTCATTGTTTCCTATGTAAATACACTCTACAAACAGTATCTTATCGCTGCCTTTTTATCTTAGCTCTCCTTTAGATTTACTTTATAAATAGCCTCTTAGAAGAATTTCTTTATTATTTATTTAAAGATTTAGTACAAGATTTCGGGCAGAACGCTCTTATTGGTAAGTCACACACGTTCAAAGATATTTTCTTCGTACCACCAAAATATTCTGAAATGCTCAAGCGACCTTATGCGCGAATTGAGAGAAAAGATCATGATTTCGTAATTGGTGCAACTGTTCAAGCATCGCTTGAAGCAGCACCTCCTCCAGAACAAAACCATGCTTGAGGGATCTTCACGCGCAGCAGAGGATTTAAAAGCGAGAAATCCTAACAGTTTATACCTTGTGGTTATGGAATGGATAAAACTGACCAATGATGTAAATTTACGAAAATATAAAGTTGAT CAAATTTATGTACTACGTCAGCAAAAAAATACTGATAGAGAGTTTAGGTATGAGTCAACTTACATAAAAAAT

The Human Genome Project AATAAAGCTTTACAAACCAA ACTCTGGCTTCAATTGTGTAA CCCAAGCTTTGATTCTTTCCT CTGTTAAATCGGATTGATTAT CTTCATCAAGGGCAAGACCT ACAAATTTACCATCACGAAC AGCTTTAGACTCACTGAATT CATAACCTTCTGTAGGCCAA TAGCCAACTGTTTCACCACC ATTTTCTGAAATTTTTTCCTCT AGAATACCGAGGGCATCTTG AAATGTATCAGGATAACCAA CCTGGTCTCCAGGAGCAAAA TAAGCAACTTTTTTGCCGATG AAGTCAATGTTATCTAACTC ATCATAAAAATTTTCCCAAT CACTTTGCAATTCTCCAACAT TCCAGGTAGGACAACCAAC AACGATATAATCGTAGTTAT TGAAATCACTTGGTTCAGCTT GTGAAATATCATATAAAGTT ACAACACTATCACCACCAAA CTCCTTCTGAATTATTTCTGA TTCAGTTTGGGTATTGCCTGT TTGAGTACCAAAAAATAAAC CAATATTAGACATTTTTACTC CTTTTATGTATTTGCAAAATT ATTTCAATTAAAATATTTAGT AATAATTAATTGTTAGCTAG CTAATAATTAAATTTTTATTA CAATCATTGTAAAAGGCATT GAAAAAGTAAATAAAAATT TTTATTCTACGTTATTTCAAA AATATTTACTTACATATACTT AACCTTTATAGTGATGTAAT ATACTCTAATTCCTATTTTAC TTATAAATACCATCTCAGCTT AATGTAACGAATTTTTCTGTT TATCTTTAAATACAAAAAAT TCAACAAAACTACAGAAAA TTAATCTTAATAACACAAAA CAAGTATCAATCTGTAATAC AACTAAGCTTAAATAAATTA ATAGAAAGCTTCATCTATCT AATAGGTTGAGAATAGTTTA TGTCTAATGACATAAATTCA TTCGTGTTGATTTCATTTGGG TATATTCATCTGATTTAGGAT TTACTCCATTAAGTTTGTACT CATCAATGCCCGCCTGTTGG TATCCACAATTCTCATACAG TGCGCGAGCAAAGTAATCA ATCGTTCGTCGCCATATCTA ACTTTGAGTCAAACAAACCA GTTGGATTACCAACCCTCAA CTAATCGCTTCTTTAAGGCG AGCGATCGCACATTTAACTG TTGGTTGTCACAAGAGAACT AATACTACAGCAGTATATTT AACAACTAAGGGTGGTTCAA CTTTCGCTGCGACTCCTCCAA CGCGCTGAAATACACAGGA CTGATGCGATCGCAAACTCT TTGACTAAATTCCATACATT ATCATGACCATCTCCCAAAC AAACAAGTGGGTTAACCAG ATGCTGACTATTAACATCCC CTGAGTTCGGAGTTGTAGGT CTATTTGACTGGTTCAAAGC GATGATGGAACGGCTTTGTT GCATGAATTAAAAAAAGAC ACACCATCACCTACTTCTAG GATAGACACATCAAACGTCC CACCGCCTAAGTCAAATACC AAGATAATTTCGTTAGTTTTC TTGTCAAGTCCGTAAGCGAG GGCCGCCGCCGTGGGCTAGT TGATAATTCGCAGAACTTTA ATCCCGGCAATTCTACTGGC ATCTTTGGTAGCCTGCCGTTG AGAGTCATTGAAATAGGCAG GGGTGGTAATTACCGCTTGC CTCACTGGTTCCCCCAGATA TGTGCTGGCATCATCTATCA GCTTGCGGACTACCTCATAC CATTTCACGAAAAACCTGAT ACACATGTAAACTCTGAAAC CCTTGCTGTATCAAAGTTTTG TAATTACGAATTACGAATTA CGAATTGATATCAGCCGAGA TTTCTTCGGGTGAAAATTCCT TGTTCAGAGCGGGACAGTGT AGCTTGACATTGCCATTACT GTCACGTACCACTTTGTAAG TAACTTGTTTTGCCTCTTGCG TAACTTCATCATACCTGCGC CCGATGAACCGCTTCACAGA ATAAAAAGTGTTTTCTGGGT TCATTACACCCTGGCGCTT

The Human Genome Project

A Walk in the Forest * Photo courtesy of

Observation * Photos courtesy of and Peter Smallwood

Observation * Photos courtesy of and Peter Smallwood

Observation * Photos courtesy of and Peter Smallwood

Observation * Photos courtesy of and Peter Smallwood

Experiment * Photos courtesy of and Peter Smallwood

Filters: Information reducers Squirrel filter

Filters: Information reducers Molecule filter

Filters: Information reducers Sequence filter How organism is made How organism works TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT CTCCGTAAAC CTCTAAC...

From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Rules of folding Active site

From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Active site Cell interaction Metabolism, Architecture Genetic codeRules of folding

From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Active site Gives us: Custom antibiotics Genetic code Rules of folding

From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Gives us: Custom antibiotics Custom antibodies Custom enzymes New materials Genetic code Rules of folding Active site

From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Rules of transcriptional and post-transcriptional control Begin transcription End transcription Splice transcript Begin translation ATGACTTATGATCAACGCACAGGGCTA 3% ? TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA

From Sequence to Organism How does Nature do it? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Rules of transcriptional and post-transcriptional control TCTACTTATATTCAATCCACAGGGCTA CACCTAGTTCTTGAAGAGTCTGTTGAA TGAACACATACATGGTTTATCTGTTTT TCTGTCTGCTCTGACCTCTGGCAGCTT TAGCCTGCCCCACTCTTAGATAAACGA ACCTTAGTGACTTCTGCTATACCAAAG TCTCCACGCCCCTCCGTAAACCTCTAA CATGATGTCAGCAAATATTAAAAATGA 97% TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA ? Begin transcription End transcription Splice transcript Begin translation

From Sequence to Organism How does Nature do it? Natural filters/transformations Selective transcription Selective processing Translation Folding DNA Functional protein

From Sequence to Organism How does Nature do it? Natural filters/transformations DNA Functional protein Simulation of NatureSurrogate Processes From Sequence to Organism How can WE do it?

Simulation of Nature Utterance of W Shakespeare Utterance of George W Bush “Whether ‘tis nobler in the mind to suffer the slings and arrows of outrageous fortune...” “We must give our military every tool and weapon it needs to prevail...” ???

From Sequence to Organism How can WE do it? Surrogate Processes Utterance of W Shakespeare Utterance of George W Bush “Whether ‘tis nobler in the mind to suffer the slings and arrows of outrageous fortune...” “We must give our military every tool and weapon it needs to prevail...” Word frequency

From Sequence to Organism How can WE do it? Surrogate Processes Utterance of W Shakespeare Utterance of George W Bush “Whether ‘tis nobler in the mind to suffer the slings and arrows of outrageous fortune...” “We must give our military every tool and weapon it needs to prevail...” Word frequency, words/sentence…

From Sequence to Organism How can WE do it? Natural filters/transformations Selective transcription Selective processing Translation Folding/function Surrogate filters TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC Characteristics of coding sequences/introns Gene finders Predicted coding regions My sequence

From Sequence to Organism How can WE do it? Natural filters/transformations Selective transcription Selective processing Translation Folding/function Surrogate filters Gene finders Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Function?

From Sequence to Organism How can WE do it? Natural filters/transformations Selective transcription Selective processing Translation Folding/function Surrogate filters Gene finders Similarity finders My predicted gene Sequence/motif databases globin globin? Similar genes

Surrogate Filters Gene finders Start/Stop codon search CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA CT CCA CGC CCC TCC GTA CAC CTC TAA CAT GAT CTC AGC AAA TAT TAA AAA TGA ATA AAC TTT GTG ACA TGT ACA AAT GGA AAT ATG CAA Look for start codons (ATG) (GTG,TTG) Look for stop codons (TAA,TAG,TGA)

CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA TTGCATATTTCCATTTGTACATGTCACAAAGTTTATTCATTTTTAATATTTGCTGAGATCATGTTAGAGGTGTACGGAGGGGCGTGGAG Surrogate Filters Gene finders Start/Stop codon search Look for start codons (ATG) (GTG,TTG) Look for stop codons (TAA,TAG,TGA) Highly inaccurate

Surrogate Filters Gene finders Hidden Markov Model (HMM)-based recognition Step 1: Create model through extensive training set AAA AAC AAG AAT ACA... TTG TTT Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT

Step 1: Create model through extensive training set AAAA: 33% AAAC: 25% AAAG: 12% AAAT: 30% Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition AAA AAC AAG AAT ACA... TTG TTT Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT

Step 1: Create model through extensive training set AACA: 30% AACC: 20% AACG: 15% AACT: 35% AAA AAC AAG AAT ACA... TTG TTT Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT

Step 2: Assess candidate genes 0.12 A C G T AAA AAC AAG AAT ACA TTG TTT Candidate gene AAAGCAA… 3 rd order Markov model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition

Step 2: Assess candidate genes AAAGCAA… 0.12 x rd order Markov model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition A C G T AAA AAC AAG AAT ACA TTG TTT Candidate gene

Step 2: Assess candidate genes AAAGCTA… 0.12 x So far, not a good candidate! 3 rd order Markov model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition A C G T AAA AAC AAG AAT ACA TTG TTT Candidate gene

Step 2: Assess candidate genes 3 rd order Markov model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Candidate genesPredicted genes

Step 2: Assess candidate genes 3 rd order Markov model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Candidate genesPredicted genes Conform to standard model Challenge accepted beliefs

Computers are powerful globin Highly filtered output Easy to grasp High-level insights Unfiltered output Confusing Basic insights

Computers are tempting

Globin Computers are tempting

Crisis in Bioinformatics 1. Need high-level filters 2. Need access to raw phenomena 3. Need new tools for new phenomena 4. Need intuitive representation of results Need a new generation 5. Need ability to build new tools

View of the Future

View of the Future Integration of information ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Cell interaction Metabolism, Architecture Genetic codeRules of folding Active site

Prochlorococcus MED4 Prochlorococcus MIT9313

Gene present in Prochlorococcus MED4 MED4 is naturally adapted to grow in high light. How do cells control response to light? Ortholog absent in Prochlorococcus MIT9313 MIT9313 is naturally adapted to grow in low light Ortholog present in Synechocystis PCC 6803 Reason will become apparent in a moment Synechocystis PCC 6803 ortholog responds to high light Gene turns on by factor > 2 in response to high light What genes are related to the adaptation to high light? Look for:

Build setDisplay set Click on Build Set to begin finding orfs with the desired specifications HELPSet operation

All items in All open reading frames of All amino acid sequences of All intergenic regions of Human-annotated orfs of Private set Public set All open reading frames of Build set Display set Choose set type Goal is to find all open reading frames within Prochlorococcus MED4 that meet certain specifications, so click on All open reading frames in CancelHELPSet operation

All items in All open reading frames ofArthrobacter platensis Gloeobacter violaceus Microcystis aeruginosa Nostoc punctiforme Nostoc PCC 7120 Prochlorococcus MED4 Prochlorococcus MIT9313 Prochlorococcus S120 Synechococcus PCC6301 Synechococcus PCC7942 Synechococcus WH Synechocystis PCC 6803 Thermosynechococcus Trichodesmium Unicellulular Filamentous All Prochlorococcus MED4 Build setDisplay set Choose set typeChoose database Click on Prochlorococcus MED4 CancelHELPSet operation

All items in All open reading frames ofProchlorococcus MED4 Display set such that: Variable DataOperationFunctionDone Choose set typeChoose database Build set You will ask that an ortholog of each desired MED4 genes exists in Synechocystis PCC It is convenient to define the ortholog now. Click the Variable button CancelHELPSet operation

All items in All open reading frames ofProchlorococcus MED4 Display set such that: Variable Data Item New variable Variable Choose set typeChoose database New variable Build set Item refers to the MED4 orf under consideration. You want to define its ortholog in Synechocystis, so click on New variable OperationFunctionDone CancelHELPSet operation

All items in All open reading frames ofProchlorococcus MED4 Display set such that: Variable Data 6803 ortholog Type variable name = Choose set typeChoose database Build set You can name the variable representing the ortholog anything you like. For this simulation, a name is provided. Press the Enter key OperationFunctionDone CancelHELPSet operation

All items in All open reading frames ofProchlorococcus MED4 Display set such that: VariableData 6803 ortholog Type variable name = Closest ortholog of Protein product of Upstream region of Downstream region of Ortholog of (item Choose set typeChoose database Choose function Build set One variable can be defined with respect to another in several ways. The relationship you want is Ortholog of OperationFunctionDone CancelHELPSet operation

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData = Ortholog of (item in Arthrobacter platensis Gloeobacter violaceus Microcystis aeruginosa Nostoc punctiforme Nostoc PCC 7120 Prochlorococcus MED4 Prochlorococcus MIT9313 Prochlorococcus S120 Synechococcus PCC6301 Synechococcus PCC7942 Synechococcus WH Synechocystis PCC 6803 Thermosynechococcus Trichodesmium Choose database Synechocystis PCC6803 ) Choose function Build set Clicking on Synechocystis PCC6803 defines the variable 6803 ortholog as the ortholog in Synechocystis to a given orf of MED ortholog Type variable name OperationFunctionDone CancelHELPSet operation

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Synechocystis PCC 6803 Build set ) The first limitation on the MED4 orf is that no ortholog of it exists in MIT9313. To evoke the concept of ortholog, press the Function button = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name OperationFunctionDone CancelHELPSet operation

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set Click on Ortholog of Closest ortholog of Protein product of Upstream region of Downstream region of Ortholog of Choose function Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name OperationFunctionDone CancelHELPSet operation

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set As always, Item refers to the orf of MED4 that is being defined. You want to specify that an ortholog of it in MIT9313 doesn’t exist, so click on Item. Item 6803 ortholog Variable Item ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function OperationFunctionDone CancelHELPSet operation

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set Clicking on Prochlorococcus MIT9313 defines an ortholog of a MED4 gene in MIT9313 (if such an ortholog exists) Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Arthrobacter platensis Gloeobacter violaceus Microcystis aeruginosa Nostoc punctiforme Nostoc PCC 7120 Prochlorococcus MED4 Prochlorococcus MIT9313 Prochlorococcus S120 Synechococcus PCC6301 Synechococcus PCC7942 Synechococcus WH Synechocystis PCC 6803 Thermosynechococcus Trichodesmium Choose database ) Prochlorococcus MIT9313 OperationFunctionDone CancelHELPSet operation

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set You want to keep only those MED4 genes where an ortholog in MIT9313 does NOT exist, so click on doesn’t exist. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) =  exists doesn’t exist Op OperationFunctionDone CancelHELPSet operation

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set That completes one specification, but there are more. Click on the Operation button to connect one specification to the next. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op OperationFunctionDone CancelHELPSet operation

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set You want both the first specification AND the second to be true, so click on AND. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND OR AND Op OperationFunctionDone CancelHELPSet operation

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: Variable Data Build set The second specification is that microarray data for the 6803 ortholog meets a certain criterion. To get at that data, press the Data button Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ OperationFunctionDone CancelHELPSet operation

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: Variable Data Build set The data you want is for the 6803 ortholog. Click on 6803 ortholog. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ data for ( Item 6803 ortholog New variable Variable 6803 ortholog in OperationFunctionDone CancelHELPSet operation

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: Variable Data Build set Choose the Hihara experiment, which measured expression changes upon shift from low light to high light. If you didn’t know which experiment was appropriate, you could have clicked on Choose data set for a description of the choices Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ data for ( 6803 ortholog Variable in Microarray:Hihara1(6803) Microarray:Suzuki1(6803) Microarray:Yoshimura1(6803) Microarray:Meeks(Npun) Microarray:Golden(7120) Choose data set Microarray:Hihara1(6803) ) OperationFunctionDone CancelHELPSet operation High light vs low light experiment

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: Variable Data Build set You want the ratio of experimental condition to control to exceed a specified value. Click on >. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ data for ( Variable in Microarray:Hihara1(6803) Choose data set ) < < or = = > or = > > Op 6803 ortholog OperationFunctionDone CancelHELPSet operation

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: Variable Data Build set You can type in the value you want. For this simulation a number is supplied. Press the Enter key. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ data for ( Variable in Microarray:Hihara1(6803) Choose data set ) > OpValue ] ortholog OperationFunctionDone CancelHELPSet operation

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set No more specifications. Press the Done button. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ data for ( Variable in Microarray:Hihara1(6803) Choose data set ) > OpValue ] ortholog OperationFunctionDone CancelHELPSet operation

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set This was a complicated search. If you wanted to do it again, you could save the search description. In this case, just save the results by clicking on Save only results. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ data for ( Variable in Microarray:Hihara1(6803) Choose data set ) > OpValue ] ortholog Save results and script Save only results Save only results OperationFunctionDone CancelHELPSet operation

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set All MED4 genes meeting the given specifications will be collected into a set. You can name the set anything you want. For this simulation, a name is provided. Press the Enter key. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ data for ( Variable in Microarray:Hihara1(6803) Choose data set ) > OpValue ] +2 Light-specific genes Type name of set 6803 ortholog OperationFunctionDone CancelHELPSet operation

Build setDisplay set :all0687 hupL [NiFe] uptake hydrogenase large subunit, C terminus :all0687 hupL [NiFe] uptake hydrogenase large subunit, N terminus :all0688 hupS [NiFe] uptake hydrogenase small subunit :alr0692 similar to nifU :alr0874 nifH2 dinitrogenase reductase :asr1309 similar to nifU :alr1407 nifV1 homocitrate synthase :asr1408 nifZ iron-sulfur cofactor synthesis :asr1408 nifT Set: Light-specific genes ProcMed4:all0687 hupL [NiFe] uptake hydrogenase large subunit, C terminus ProcMed4:all0687 hupL [NiFe] uptake hydrogenase large subunit, N terminus ProcMed4:all0688 hupS [NiFe] uptake hydrogenase small subunit ProcMed4:alr0692 similar to nifU ProcMed4:alr0874 psbBX dinitrogenase reductase ProcMed4:asr1309 similar to nifU ProcMed4:alr1407 psbY1 homocitrate synthase ProcMed4:asr1408 psbX iron-sulfur cofactor synthesis ProcMed4:asr1408 nifT The results are displayed as a list of orfs (Of course, the search capabilities do not now exist, and the results of the described search are unknown) Clicking on the name of any orf brings you to its page (see Scenarios 1 and 2). Clicking on circles next to the orf names allows you to modify the set. The genetic neighborhood of each orf is shown to the right. DoneHELPSet operation [WARNING: Fantasy filtration not in effect!]

Prochlorococcus MED4: pll1290 Replicon: Chromosome Coordinates: (stop) < (start-TTG)Human Length = 301 amino acids Strand: Complementary Gene name(s): proXM Function: Putative type II DNA cytosine methyltransferase (CAGCTG-specific)Human Classification: Type II beta (N4)Human Activity: Protects against: PvuII Experiment In vivo activity: existsExperiment Cyanobacterial orthologs: none ProcMED4 Proteus vulgaris Salmonella paratyphi Streptomyces spectabilis OptionsAnnotate Main Menu History More A A A A A HELP [WARNING: Fantasy filtration not in effect!]

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set This was a complicated search. If you wanted to do it again, you could save the search description. In this case, just save the results by clicking on Save only results. Item Variable ( in Synechocystis PCC 6803 ) = Ortholog of (item in Choose databaseChoose function 6803 ortholog Type variable name Ortholog of Choose function Prochlorococcus MIT9313 Choose database ) doesn’t exist Op AND Op [ data for ( Variable in Microarray:Hihara1(6803) Choose data set ) > OpValue ] ortholog Save results and script Save only results Save results and script OperationFunctionDone CancelHELPSet operation

Equivalent script that bypasses interface FOR orf IN (orfs:ProcMED4) { 6803ortholog = Ortholog(orf,orfs:Syny6803); WHEN (NOT Exists(Ortholog(orf,orfs:Proc9313)) AND Data(6803ortholog,microarray:Hihara1) > +2){ COLLECT orf INTO light_specific_genes; } DISPLAY (light_specific_genes, “BNC”); or MAIL The same search could have been conducted through the script shown above. The script interface makes possible complex searches beyond the scope of the graphical interface.

All items in All open reading frames of Choose set type Prochlorococcus MED4 Choose database Display set such that: VariableData Build set OperationFunctionDone CancelHELPSet operation HELP ???

Cyanobacterial Knowledge Base Virtual Help Desk How to search for data? How to build a new filter?

Cyanobacterial Knowledge Base Virtual Help Desk How to......I don’t know! Virtual Help Desk Staff HELP

Cyanobacterial Knowledge Base Virtual Help Desk Upper echelons Staff You Virtual Help Desk Staff HELP

Billions and Billions of Bases How does a biologist maintain a grip on sanity? reality?

View of the Future Interplay of low- & high-level perception ProcMED4 Proteus vulgaris Salmonella paratyphi Streptomyces spectabilis

View of the Future Interplay of low- & high-level perception Anab7120 Proteus vulgaris Salmonella paratyphi Streptomyces spectabilis TCTACTTATATTCAATCCACAGGGCTA CACCTAGTTCTTGAAGAGTCTGTTGAA TGAACACATACATGGTTTATCTGTTTT TCTGTCTGCTCTGACCTCTGGCAGCTT TAGCCTGCCCCACTCTTAGATAAACGA ACCTTAGTGACTTCTGCTATACCAAAG TCTCCACGCCCCTCCGTAAACCTCTAA CATGATGTCAGCAAATATTAAAAATGA 97% TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA

Anabaena Chromosome ( bp): 4001 to 5000 cgcccaacaataacaaatgtgtaatctagaccttctgccttgagttcctt ggcgcggttttcggcacgacggatgacgttggtattgtaaccgccgcaca aaccacgatcgccagaaataactagcaagcctactgatttaacttcccgt tttttcagtagaggtaagtctacatcttcaaaccgtagacgagtttgcaa accgtataatacttgtgccaaacggtcagcaaaaggacgagtagcgatta cttgttcttgggcgcgacgtacacgcgccgccgctaccagccgcatggct tctgtgattttcttggtgtttttgaccgactgaatgcgatcgcgtattga tttgagattaggcataatatttgttgattgtcagttgtcagttgtcagtt gtcagttgtcagtgtctattgctactgaccactgaccaatgactaatgac taattacgctgtagctttgaaggtctttttgtagtcttctaaagctgcct tcaatgctttttcttcatcatcacccagtgctttcttcgattgtacgtct tggaagtaggggttaacgccggacttcaagtaatctctcaagcctttggt gaaggtggtgactttatcaacagggatatcatctaagtaaccgttgatac ctgcgtacagaatggctacttgttcagctacggatagaggctgattttgg gactgtttgaggagttcccgcaggcgttgacctcttgccaattggtcttg ggtggctttatctaggtcggaagcaaattgcgcgaaggcttggaggtcgt caaactgtgctagttcgagcttaatcttaccagcaacttttttcatcgct ttggtttgtgccgcagaacccacacgggatacagagataccagggtttac agccggacgaataccagcgttaaataagtcagaagataagaatatctgac cgtctgtaatagaaattacgttggtaggaatgtaggcagaaacgtcacca Typical output of current programs

Future: Sequence plus genetic context Noncoding region

Future: Both filtered and raw data

Filters: Information reducers Build filter to find repeated sequences TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT TGGTCTCCGACCGACCGTAGGTCATCG CTTGTACTGAGCGAAGTCGAAGTA CTTGTACTGAGCGTAGCCGAAGTA GTTCGACTGAGCGTAGTCGAAGTC... Repeat filter Entire genomeRepeated sequences

Filters: Information reducers Build repeats filter TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT TGGTCTCCGACCGACCGTAGGTCATCG CTTGTACTGAGCGAAGTCGAAGTA CTTGTACTGAGCGTAGCCGAAGTA GTTCGACTGAGCGTAGTCGAAGTC... Repeat filter Entire genomeRepeated sequences NIS-1: repeat family

Alignment of NIS-1 (…271 more)

Filters: Information reducers Build secondary repeats filter A: CTTGTACTGAGCGAAGTCGAAGTA B: CTTGTACTGAGCGTAGCCGAAGTA Distance = 2 CTTGTACTGAGCGAAGTCGAAGTA... CTTGTACTGAGCGAAGTCGAAGTA Copy number = 10 Subfamily A CTTGTACTGAGCGTAGCCGAAGTA Copy number = 2 Subfamily B GTTCGACTGAGCGTAGTCGAAGTC Copy number = 1 Subfamily C

Filters: Information reducers Build secondary repeats filter Distance = 2 A: CTTGTACTGAGCGAAGTCGAAGTA C: GTTCGACTGAGCGTAGTCGAAGTC Distance = 5 CTTGTACTGAGCGAAGTCGAAGTA... CTTGTACTGAGCGAAGTCGAAGTA Copy number = 10 Subfamily A CTTGTACTGAGCGTAGCCGAAGTA Copy number = 2 Subfamily B GTTCGACTGAGCGTAGTCGAAGTC Copy number = 1 Subfamily C

Filters: Information reducers Build secondary repeats filter B: CTTGTACTGAGCGTAGCCGAAGTA C: GTTCGACTGAGCGTAGTCGAAGTC Distance = 5 Do for all pairs of subfamilies CTTGTACTGAGCGAAGTCGAAGTA... CTTGTACTGAGCGAAGTCGAAGTA Copy number = 10 Subfamily A CTTGTACTGAGCGTAGCCGAAGTA Copy number = 2 Subfamily B GTTCGACTGAGCGTAGTCGAAGTC Copy number = 1 Subfamily C Distance = 2

Diameter Copies of exact repeats Distance Number of mismatches Relationship between related repeats in genome (sequences within NIS-1 repeat family)

Crisis in Bioinformatics 1. Need high-level filters 2. Need access to raw phenomena Integrated knowledge base

Crisis in Bioinformatics 1. Need high-level filters 2. Need access to raw phenomena 3. Need new tools for new phenomena 4. Need intuitive representation of results Integrated knowledge base Tools that bridge levels of perception

Crisis in Bioinformatics 1. Need high-level filters 2. Need access to raw phenomena 3. Need new tools for new phenomena 4. Need intuitive representation of results Long term: Need a new generation 5. Need ability to build new tools Integrated knowledge base Tools that bridge levels of perception Short term: Graphical programming Human help

Billions and Billions of Bases How does a biologist maintain a grip on reality? Filtering reality Raw reality Real questions with real answers

Pre-genomic Molecular Biology

How do we figure out how cars are made? Genetic approachBiochemical approach

Pre-genomic Molecular Biology Geneticist’s Approach

Isolation of Defective Gene Pre-genomic Molecular Biology Geneticist’s Approach

Pre-genomic Molecular Biology How do we figure out how cars are made? Genetic approachBiochemical approach

Pre-genomic Molecular Biology Biochemist’s Approach

Pre-genomic Molecular Biology How do we figure out how cars are made? Genetic approachBiochemical approach

One component at a time Highly filtered perception Many local viewpoints Pre-genomic Molecular Biology How we viewed the world

Post-genomic Molecular Biology

Post-genomic Molecular Biology Bioinformaticist’s Approach (long term) Assemble the whole

Post-genomic Molecular Biology Bioinformaticist’s Approach (short term) Identify critical parts

Globin Current Biology

AATAAAGCTTTACAAACCAA ACTCTGGCTTCAATTGTGTAA CCCAAGCTTTGATTCTTTCCT CTGTTAAATCGGATTGATTAT CTTCATCAAGGGCAAGACCT ACAAATTTACCATCACGAAC AGCTTTAGACTCACTGAATT CATAACCTTCTGTAGGCCAA TAGCCAACTGTTTCACCACC ATTTTCTGAAATTTTTTCCTCT AGAATACCGAGGGCATCTTG AAATGTATCAGGATAACCAA CCTGGTCTCCAGGAGCAAAA TAAGCAACTTTTTTGCCGATG AAGTCAATGTTATCTAACTC ATCATAAAAATTTTCCCAAT CACTTTGCAATTCTCCAACAT TCCAGGTAGGACAACCAAC AACGATATAATCGTAGTTAT TGAAATCACTTGGTTCAGCTT GTGAAATATCATATAAAGTT ACAACACTATCACCACCAAA CTCCTTCTGAATTATTTCTGA TTCAGTTTGGGTATTGCCTGT TTGAGTACCAAAAAATAAAC CAATATTAGACATTTTTACTC CTTTTATGTATTTGCAAAATT ATTTCAATTAAAATATTTAGT AATAATTAATTGTTAGCTAG CTAATAATTAAATTTTTATTA CAATCATTGTAAAAGGCATT GAAAAAGTAAATAAAAATT TTTATTCTACGTTATTTCAAA AATATTTACTTACATATACTT AACCTTTATAGTGATGTAAT ATACTCTAATTCCTATTTTAC TTATAAATACCATCTCAGCTT AATGTAACGAATTTTTCTGTT TATCTTTAAATACAAAAAAT TCAACAAAACTACAGAAAA TTAATCTTAATAACACAAAA CAAGTATCAATCTGTAATAC AACTAAGCTTAAATAAATTA ATAGAAAGCTTCATCTATCT AATAGGTTGAGAATAGTTTA TGTCTAATGACATAAATTCA TTCGTGTTGATTTCATTTGGG TATATTCATCTGATTTAGGAT TTACTCCATTAAGTTTGTACT CATCAATGCCCGCCTGTTGG TATCCACAATTCTCATACAG TGCGCGAGCAAAGTAATCA ATCGTTCGTCGCCATATCTA ACTTTGAGTCAAACAAACCA GTTGGATTACCAACCCTCAA CTAATCGCTTCTTTAAGGCG AGCGATCGCACATTTAACTG TTGGTTGTCACAAGAGAACT AATACTACAGCAGTATATTT AACAACTAAGGGTGGTTCAA CTTTCGCTGCGACTCCTCCAA CGCGCTGAAATACACAGGA CTGATGCGATCGCAAACTCT TTGACTAAATTCCATACATT ATCATGACCATCTCCCAAAC AAACAAGTGGGTTAACCAG ATGCTGACTATTAACATCCC CTGAGTTCGGAGTTGTAGGT CTATTTGACTGGTTCAAAGC GATGATGGAACGGCTTTGTT GCATGAATTAAAAAAAGAC ACACCATCACCTACTTCTAG GATAGACACATCAAACGTCC CACCGCCTAAGTCAAATACC AAGATAATTTCGTTAGTTTTC TTGTCAAGTCCGTAAGCGAG GGCCGCCGCCGTGGGCTAGT TGATAATTCGCAGAACTTTA ATCCCGGCAATTCTACTGGC ATCTTTGGTAGCCTGCCGTTG AGAGTCATTGAAATAGGCAG GGGTGGTAATTACCGCTTGC CTCACTGGTTCCCCCAGATA TGTGCTGGCATCATCTATCA GCTTGCGGACTACCTCATAC CATTTCACGAAAAACCTGAT ACACATGTAAACTCTGAAAC CCTTGCTGTATCAAAGTTTTG TAATTACGAATTACGAATTA CGAATTGATATCAGCCGAGA TTTCTTCGGGTGAAAATTCCT TGTTCAGAGCGGGACAGTGT AGCTTGACATTGCCATTACT GTCACGTACCACTTTGTAAG TAACTTGTTTTGCCTCTTGCG TAACTTCATCATACCTGCGC CCGATGAACCGCTTCACAGA ATAAAAAGTGTTTTCTGGGT TCATTACACCCTGGCGCTT Future Biology

AATAAAGCTTTACAAACCAA ACTCTGGCTTCAATTGTGTAA CCCAAGCTTTGATTCTTTCCT CTGTTAAATCGGATTGATTAT CTTCATCAAGGGCAAGACCT ACAAATTTACCATCACGAAC AGCTTTAGACTCACTGAATT CATAACCTTCTGTAGGCCAA TAGCCAACTGTTTCACCACC ATTTTCTGAAATTTTTTCCTCT AGAATACCGAGGGCATCTTG AAATGTATCAGGATAACCAA CCTGGTCTCCAGGAGCAAAA TAAGCAACTTTTTTGCCGATG AAGTCAATGTTATCTAACTC ATCATAAAAATTTTCCCAAT CACTTTGCAATTCTCCAACAT TCCAGGTAGGACAACCAAC AACGATATAATCGTAGTTAT TGAAATCACTTGGTTCAGCTT GTGAAATATCATATAAAGTT ACAACACTATCACCACCAAA CTCCTTCTGAATTATTTCTGA TTCAGTTTGGGTATTGCCTGT TTGAGTACCAAAAAATAAAC CAATATTAGACATTTTTACTC CTTTTATGTATTTGCAAAATT ATTTCAATTAAAATATTTAGT AATAATTAATTGTTAGCTAG CTAATAATTAAATTTTTATTA CAATCATTGTAAAAGGCATT GAAAAAGTAAATAAAAATT TTTATTCTACGTTATTTCAAA AATATTTACTTACATATACTT AACCTTTATAGTGATGTAAT ATACTCTAATTCCTATTTTAC TTATAAATACCATCTCAGCTT AATGTAACGAATTTTTCTGTT TATCTTTAAATACAAAAAAT TCAACAAAACTACAGAAAA TTAATCTTAATAACACAAAA CAAGTATCAATCTGTAATAC AACTAAGCTTAAATAAATTA ATAGAAAGCTTCATCTATCT AATAGGTTGAGAATAGTTTA TGTCTAATGACATAAATTCA TTCGTGTTGATTTCATTTGGG TATATTCATCTGATTTAGGAT TTACTCCATTAAGTTTGTACT CATCAATGCCCGCCTGTTGG TATCCACAATTCTCATACAG TGCGCGAGCAAAGTAATCA ATCGTTCGTCGCCATATCTA ACTTTGAGTCAAACAAACCA GTTGGATTACCAACCCTCAA CTAATCGCTTCTTTAAGGCG AGCGATCGCACATTTAACTG TTGGTTGTCACAAGAGAACT AATACTACAGCAGTATATTT AACAACTAAGGGTGGTTCAA CTTTCGCTGCGACTCCTCCAA CGCGCTGAAATACACAGGA CTGATGCGATCGCAAACTCT TTGACTAAATTCCATACATT ATCATGACCATCTCCCAAAC AAACAAGTGGGTTAACCAG ATGCTGACTATTAACATCCC CTGAGTTCGGAGTTGTAGGT CTATTTGACTGGTTCAAAGC GATGATGGAACGGCTTTGTT GCATGAATTAAAAAAAGAC ACACCATCACCTACTTCTAG GATAGACACATCAAACGTCC CACCGCCTAAGTCAAATACC AAGATAATTTCGTTAGTTTTC TTGTCAAGTCCGTAAGCGAG GGCCGCCGCCGTGGGCTAGT TGATAATTCGCAGAACTTTA ATCCCGGCAATTCTACTGGC ATCTTTGGTAGCCTGCCGTTG AGAGTCATTGAAATAGGCAG GGGTGGTAATTACCGCTTGC CTCACTGGTTCCCCCAGATA TGTGCTGGCATCATCTATCA GCTTGCGGACTACCTCATAC CATTTCACGAAAAACCTGAT ACACATGTAAACTCTGAAAC CCTTGCTGTATCAAAGTTTTG TAATTACGAATTACGAATTA CGAATTGATATCAGCCGAGA TTTCTTCGGGTGAAAATTCCT TGTTCAGAGCGGGACAGTGT AGCTTGACATTGCCATTACT GTCACGTACCACTTTGTAAG TAACTTGTTTTGCCTCTTGCG TAACTTCATCATACCTGCGC CCGATGAACCGCTTCACAGA ATAAAAAGTGTTTTCTGGGT TCATTACACCCTGGCGCTT Future Biology

Globin TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT Current Biology Current Life

“Axis of Evil...” Current Life

“No war for oil...” Globin Current Life

“No war for oil...” Globin Current Life

Contact Information Jeff Elhai Department of Biology Virginia Commonwealth University Richmond, VA Tel: Web: