Download presentation
Presentation is loading. Please wait.
Published byRodger Walton Modified over 9 years ago
1
Manifestations of a Code Genes, genomes, bioinformatics and cyberspace – and the promise they hold for biology education
2
The iPlant Collaborative Vision www.iPlantCollaborative.org Enable life science researchers and educators to use and extend cyberinfrastructure
3
A GENOME is all of a living thing’s genetic material. The genetic material is DNA (DeoxyriboNucleic Acid) DNA, a double helical molecule, is made up of four nucleotide “letters”: A-- G-- T-- C-- What is a genome? Slide: JGI, 2009
4
Just as computer software is rendered in long strings of 0s and 1s, the GENOME or “software” of life is represented by a string of the four nucleotides, A, G, C, and T. To understand the software of either - a computer or a living organism - we must know the order, or sequence, of these informative bits. What is sequencing? Slide: JGI, 2009
5
¢0.57 ¢0.19 ¢0.35 Sequence production (Billions of bases/month) ¢0.50 ¢1 0 0 Cost: Cents per base 1.0 0 0 2.0 3.0 1989 1991 1993 1995 1997 1999 2003 2005 2001 ¢0.46 ¢0.08 2007 Human Genome completed Economics of Scale Human Genome launched > ¢0.05 Slide: JGI, 2009
6
1986 DOE announces Human Genome Initiative-- $5.3 million to develop technology. 1990 DOE & NIH present their HGP plan to Congress. 1997 Escherichia coli genome published 1997 Yeast genome published 2000 Fruit fly (Drosophila) genome published. 2000 Working draft of the human genome announced. 2000 Thale cress (Arabidopsis) genome published (2x). 2002 Rice genome published (2x). 2003 Human genome published. 2006 First tree genome published in Science. 2007 First metagenomics study published Important Dates in Genomics
7
Coming into the Genome Age For the first time in the history of science students can work with the same data and tools that are used by researchers. Learning by posing and answering question. Students generate new knowledge.
8
Workshop Objectives Illustrate the evolving concept of “gene.” Conceptualize a “big picture” of complex, dynamic genomes. Guide students to address real problems through modern genome science. Use educational and research interfaces for bioinformatics. Work with “real” genome sequences gathered by students – in the lab or online. http://gfx.dnalc.org/files/evidence
9
Exciting? >mouse_ear_cress_1080 GAAATAATCAATGGAATATGTAGAGGTCTCCTGTACCTTCACAGAGATTCTAGGCTGAGAGCAGTGCATATAGATATCTTT CGTACTCATCTGCTTTTTCTGGTCTCCATCACAAAAGCCAACTAGGTAATCATATCAATCTCTCTTTACCGTTTACTCGAC CTTTTCCAATCAGGTGCT TCTGGTGTGTCTACTACTATCAGTTTTAGGTCTTTGTATACCTGATCTTATCTGCTACTG AGGCTTGTAAAAGTGATTAAAACTGTGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACCGGTG TGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAAGAAGATGAACTCTCATTGACTGAAAGCGGGT TGAAGAGTGAAGATGGCGTTATTATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAATTTACCAAGGGA GAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTA GTGTGTTTGAAGTTTCTTAACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGTTTTGGGATGTAGAGCTAACC AGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTTGTTCAGTACCTGAATACCAGTTTTAAATTACCGTCAG ATGTTGATCTTGTTGGTAATAATGGAGAAACGGAAGAATAATTAGACGAAACAAACTCTTTAAGAACGTATCTTTCAGTTT TCCATCACAAATTTTCTTACAAGCTACAAAAATCGAACTATATATAACTGAACCGAATTTAAACCGGAGGGAGGGTTTGAC TTTGGTCAATCACATTTCCAATGATACCGTCGTTTGGTTTGGGGAAGCCTCGTCGTACAAATACGACGTCGTTTAAGGAAA GCCCTCCTTAACCCCAGTTATAAGCTCAAAGTTGTACTTGACCTTTTTAAAGAAGCACGAAACGAAAAACCCTAAAATTCC CAAGCAGAGAAAGAGAGACAGAGCAAGTACAGATTTCAACTAGCTCAAGATGATCATCCCTGTTCGTTGCTTTACTTGTGG AAAGGTTGATATTTTCCCCTTCGCTTTGGTCTTATTTAGGGTTTTACTCCGTCTTTATAGGGTTTTAGTTACTCCAAATTT GGCTAAGAAGAGATCTTTACTCTCTGTATTTGACACGAATGTTTTTAATCGGTTGGATACATGTTGGGTCGATTAGAGAAA TAAAGTATTGAGCTTTACTAAGCTTTCACCTTGTGATTGGTTTAGGTGATTGGAAACAAATGGGATCAGTATCTTGATCTT CTCCAGCTCGACTACACTGAAGGGTAAGCTTACAATGATTCTCACTTCTTGCTGCTCTAATCATCATACTTTGTGTCAAAA AGAGAGTAATTGCTTTGCGTTTTAGAGAAATTAGCCCAGATTTCGTATTGGGTCTGTGAAGTTTCATATTAGCTAACACAC TTCTCTAATTGATAACAGAAGCTATAAAATAGATTTGCTGATGAAGGAGTTAGCTTTTTATAATCTTCTGTGTTTGTGTTT TACTGTCTGTGTCATTGGAAGAGACTATGTCCTGCCTATATAATCTCTATGTGCCTATCTAGATTTTCTATACAATTGATA TTTGATAGAAGTAGAAAGTAAGACTTAAGGTCTTTTGATTAGACTTGTGCCCATCTACATGATTCTTATTGGACTAATCAT TCTTTGTGTGAAAATAGAATACTTTGTCTGAACATGAGAGAATGGTTCATAATACGTGTGAAGTATGGGATTAGTTCAACA ATTTCGCTATTGGAGAAGCAAACCAAGGGTTAATCGTTTATAGGGTTAAGCTAATGCTCTGCTCTTTATATGTTATTGGAA CAGACTATTGTTGTGCCTATCTTGTTTAGTTGTAGATTCTATCTCGACTGTTATAAGTATGACTGAAGGCTTGATGACTTA TGATTCTCTTTACACCTGTAGAAGGATTTAAGCTTGGTGTCTAGATATTCAATCTGTGTTGGTTTTGTCTTTCTTTTGGCT CTTAGTGTTGTTCAATCTCCTCAATAGGTATGAAGTTACAATATCCTTATTATTTTGCAGGGACGCACTTGATGCACTCCA GCTAGTCAGATACTGCTGCAGGCGTATGCTAATGACCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTAGTGTGT
10
This better?
11
What do we know about genes? Expressed (Transcribed) – Transcriptional start & termination sites (TXSS, TXTS) – Transcription artefacts (cDNA & ESTs) Regulated – Promoters (TATAAA) – Transcription Factor Binding Sites – CpG (Cytosin methylation) Meaningful (Translated) – 3n basepairs – Codon usage – Translational start & stop/termination codons (TLSS, TLTS) – Translation artefacts (proteins) Spliced – Splice sites (GT-AG) Derived (Homology: Paralogy/Orthology) – Search for known genes, proteins (BLAST)
12
How might this knowledge help to find genes? Predict genes – Look for potential starts and stops. – Connect them into open reading frames (ORFs). – Filter for “correct’ length & codon usage. Search databases – Known genes: UniGene – Known proteins: UniProt Use transcript evidence – cDNA – ESTs – proteins
13
Exon Intron Pre-mRNA 5’ Splice Site 3’ Splice Site Reddy, S.N. Annu. Rev. Plant Biol. 2007 58:267-94 Of 1588 examined predicted splice sites in Arabidopsis 1470 sites (93%) followed the canonical GT…AG consensus. (Plant (2004) 39, 877–885) Canonical splice sites
14
Multiple splice variants produced from the same gene An example from A. thaliana Not a rare event!!!
15
Find Gene Families Generate mathematical evidence Analyze large data amounts Browse in context Build gene models Gather biological evidence Annotation workflow Get DNA sequence
16
Walk or…
17
Early concept (2009)
18
DNA Subway 2014
19
Molecular biology and bioinformatics concepts RepeatMasker Eukaryotic genomes contain large amounts of repetitive DNA. Transposons can be located anywhere. Transposons can mutate like any other DNA sequence. FGenesH Gene Predictor Protein-coding information begins with start, followed by codons, ends in stop. Codons in mRNA (AUG, UAA,…) have sequence equivalents in DNA (ATG, TAA,…). Most eukaryotic introns have “canonical splice sites,” GT---AG (mRNA: GU---AG). Gene prediction programs search for patterns to predict genes and their structure. Different gene prediction programs may predict different genes and/or structures. Multiple Gene Predictors The protein coding sequence of a mRNA is flanked by untranslated regions (UTRs). UTRs hold regulatory information. BLAST Searches Gene or protein homologs share similarities due to common ancestry. Biological evidence is needed to curate gene models predicted by computers. mRNA transcripts and protein sequence data provide “hard” evidence for genes.
20
What is a gene? Can we define a gene? Has the definition of a gene changed? How can we find genes?
21
Views Genes as “independent hereditary units (1866), Mendel Genes as “beads on strings” (1926), Morgan One gene, one enzyme (1941), Beadle & Tatum DNA is molecule of heredity (), Avery DNA > RNA > Protein (1953), Crick, Watson, Wilkins
22
More Insights Transposons (1940s-50s), McClintock Repetitive DNA (Human: 50%; Lily: 98%) Reverse transcription (1970), Temin & Baltimore Split genes (1977), Roberts & Sharp RNA interference (1998), Fire and Mello “Fluid” genomes (Philadelphia Chromosome)
23
Sequence & course material repository http://gfx.dnalc.org/files/evidence & iPlant Wiki Don’t open items, save them to your computer!! Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations (.ppt files) Prospecting (sequences) Readings (Bioinformatics tools, splicing, etc.) Worksheets (Word docs, handouts, etc.)
25
Let’s Do I! >mouse_ear_cress_1080 GAAATAATCAATGGAATATGTAGAGGTCTCCTGTACCTTCACAGAGATTCTAG GCTGAGAGCAGTGCATATAGATATCTTTCGTACTCATCTGCTTTTTCTGGTCT CCATCACAAAAGCCAACTAGGTAATCATATCAATCTCTCTTTACCGTTTACTC GACCTTTTCCAATCAGGTGCTTCTGGTGTGTCTACTACTATCAGTTTTAGGTC TTTGTATACCTGATCTTATCTGCTACTGAGGCTTGTAAAAGTGATTAAAACTG TGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACCG GTGTGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAA GAAGATGAACTCTCATTGACTGAAAGCGGGTTGAAGAGTGAAGATGGCGTTAT TATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAATTTACC AAGGGAGAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGC ATCAACATCTTTACTTAGAGCTCTACGGGTTTTAGTGTGTTTGAAGTTTCTTA ACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGTTTTGGGATGTA GAGCTAACCAGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTT GTTCAGTACCTGAATACCAGTTTTAAATTACCGTCAGATGTTGATCTTGTTGG TAATAATGGAGA
27
How can we find genes? Search for them Look them up
28
How do I get from this… >mouse_ear_cress_1080 GAAATAATCAATGGAATATGTAGAGGTCTCCTGTACCTTCACAGAGATTCTAGGCTGAGAGCAGTGCATATAGATATCTTT CGTACTCATCTGCTTTTTCTGGTCTCCATCACAAAAGCCAACTAGGTAATCATATCAATCTCTCTTTACCGTTTACTCGAC CTTTTCCAATCAGGTGCT TCTGGTGTGTCTACTACTATCAGTTTTAGGTCTTTGTATACCTGATCTTATCTGCTACTG AGGCTTGTAAAAGTGATTAAAACTGTGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACCGGTG TGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAAGAAGATGAACTCTCATTGACTGAAAGCGGGT TGAAGAGTGAAGATGGCGTTATTATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAATTTACCAAGGGA GAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTA GTGTGTTTGAAGTTTCTTAACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGTTTTGGGATGTAGAGCTAACC AGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTTGTTCAGTACCTGAATACCAGTTTTAAATTACCGTCAG ATGTTGATCTTGTTGGTAATAATGGAGAAACGGAAGAATAATTAGACGAAACAAACTCTTTAAGAACGTATCTTTCAGTTT TCCATCACAAATTTTCTTACAAGCTACAAAAATCGAACTATATATAACTGAACCGAATTTAAACCGGAGGGAGGGTTTGAC TTTGGTCAATCACATTTCCAATGATACCGTCGTTTGGTTTGGGGAAGCCTCGTCGTACAAATACGACGTCGTTTAAGGAAA GCCCTCCTTAACCCCAGTTATAAGCTCAAAGTTGTACTTGACCTTTTTAAAGAAGCACGAAACGAAAAACCCTAAAATTCC CAAGCAGAGAAAGAGAGACAGAGCAAGTACAGATTTCAACTAGCTCAAGATGATCATCCCTGTTCGTTGCTTTACTTGTGG AAAGGTTGATATTTTCCCCTTCGCTTTGGTCTTATTTAGGGTTTTACTCCGTCTTTATAGGGTTTTAGTTACTCCAAATTT GGCTAAGAAGAGATCTTTACTCTCTGTATTTGACACGAATGTTTTTAATCGGTTGGATACATGTTGGGTCGATTAGAGAAA TAAAGTATTGAGCTTTACTAAGCTTTCACCTTGTGATTGGTTTAGGTGATTGGAAACAAATGGGATCAGTATCTTGATCTT CTCCAGCTCGACTACACTGAAGGGTAAGCTTACAATGATTCTCACTTCTTGCTGCTCTAATCATCATACTTTGTGTCAAAA AGAGAGTAATTGCTTTGCGTTTTAGAGAAATTAGCCCAGATTTCGTATTGGGTCTGTGAAGTTTCATATTAGCTAACACAC TTCTCTAATTGATAACAGAAGCTATAAAATAGATTTGCTGATGAAGGAGTTAGCTTTTTATAATCTTCTGTGTTTGTGTTT TACTGTCTGTGTCATTGGAAGAGACTATGTCCTGCCTATATAATCTCTATGTGCCTATCTAGATTTTCTATACAATTGATA TTTGATAGAAGTAGAAAGTAAGACTTAAGGTCTTTTGATTAGACTTGTGCCCATCTACATGATTCTTATTGGACTAATCAT TCTTTGTGTGAAAATAGAATACTTTGTCTGAACATGAGAGAATGGTTCATAATACGTGTGAAGTATGGGATTAGTTCAACA ATTTCGCTATTGGAGAAGCAAACCAAGGGTTAATCGTTTATAGGGTTAAGCTAATGCTCTGCTCTTTATATGTTATTGGAA CAGACTATTGTTGTGCCTATCTTGTTTAGTTGTAGATTCTATCTCGACTGTTATAAGTATGACTGAAGGCTTGATGACTTA TGATTCTCTTTACACCTGTAGAAGGATTTAAGCTTGGTGTCTAGATATTCAATCTGTGTTGGTTTTGTCTTTCTTTTGGCT CTTAGTGTTGTTCAATCTCCTCAATAGGTATGAAGTTACAATATCCTTATTATTTTGCAGGGACGCACTTGATGCACTCCA GCTAGTCAGATACTGCTGCAGGCGTATGCTAATGACCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTAGTGTGT
29
…to this?
30
Meaning?
31
Mathematical Tools (Code; statistics)
32
Comparative Tools (Database searches)
33
Operating computationally Go to beginning of sequence start SCAN If ATG register putative TLSS; then – Move in 3-steps & count steps (=COUNTS) – If 3-step = (TAA or TAG or TGA), register putative TLTS – If register evaluate COUNTS (= triplets) If COUNTS < minimum discard; then go behind ATG above and start SCAN If COUNTS > maximum discard; then go behind ATG above and start SCAN If minimum < COUNTS < maximum record as GENE with TLSS, TLTS; then go behind ATG above and start SCAN. Arrive at end of sequence stop SCAN
34
Find gene families Mathematical evidence Analyze large data sets Browse in ccontext Construct gene models Annotation workflow Biological evidence Browse results Get/Generate sequence
35
Annotation Cheat Sheet Open existing project or generate new (Red square) Run RepeatMasker Generate evidence (Predictions, BLAST searches) Synthesize evidence into gene models (Apollo) Browse results locally and in context (Phytozome) Conduct functional analysis (link from Browser) Prospect for gene family (Yellow Line from Browser) Select region that holds biological gene evidence Optimize work space and zoom to region (View tab) Expand all tiers (Tiers tab) Drag evidence item(s) onto workspace (mouse) Edit to match biol. evidence (right-click item for tools) Record what was done in Annotation Info Editor Assess necessity to build alternative model(s) Upload model(s) to DNA Subway (File tab) A. DNA Subway B. Apollo
36
Predictors (mathematical evidence) Utilize predominantly mathematical methods (statistical). Search for patterns –Some score starts, stops, splice sites (GenScan). –Some score nucleotides (Augustus, FGenesH). Few incorporate EST data and/or known genes/proteins. Require optimization for each new species (training). Accuracy: –False positives (scoring non-genes as genes):5% - 50%. –False negatives (missed genes): 5%-40%. –Weak or unable in determining first and last exons, and UTRs. Specific for gene models (spliced genes, non-spliced genes). Specialty predictors (tRNA Scan, RepeatMasker).
37
Search tools (biological evidence) Search sequence (molecules; tangible) databases: –Known genes –Known proteins –cDNAs & ESTs Utilize alignment methods (BLAST, BLAT). Reliability: –Good in determining gene locations and general gene structures. –Weak in exactly determining exon/intron borders. –Unlikely to correctly determine TXSS and TXTS. –Should be used with cDNA/EST from same species as genome.
38
Sequence & course material repository http://gfx.dnalc.org/files/evidence Don’t open items, save them to your computer!! Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations (.ppt files) Prospecting (sequences) Readings (Bioinformatics tools, splicing, etc.) Worksheets (Word docs, handouts, etc.) BCR-ABL (temporary; not course-related)
39
Exon Intron Pre-mRNA 5’ Splice Site 3’ Splice Site Reddy, S.N. Annu. Rev. Plant Biol. 2007 58:267-94 Of 1588 examined predicted splice sites in Arabidopsis 1470 sites (93%) followed the canonical GT…AG consensus. (Plant (2004) 39, 877–885) Canonical splice sites
40
Multiple splice variants produced from the same gene An example from A. thaliana Not a rare event!!!
41
Alternative Splicing Removing different segments from mRNAs leads to alternative splice forms of a gene/transcript. Can occur in any part of the transcript including UTRs and can alter start codons, stop codons, reading frame, CDS, UTRs. May alter stability-life, translation (time, location, duration), protein sequence, some or all of the above. Alternative splice forms = Protein isoforms Contributes to protein diversity Degree of alternative splicing varies with species DNALC Clip: http://dnalc.org/resources/3d/http://dnalc.org/resources/3d/
42
Alternative Splicing The exons and introns of a particular gene get shuffled to create multiple isoforms of a particular protein First demonstrated in the late 1970’s in adenovirus Fairly well characterized in animals (at least somewhat better than in plants) Contributes to protein diversity Affects mRNA stability
43
How are AS events detected? Based on cDNA and EST data Alignment against genome sequence High-throughput RNA-seq PCR based assays
44
Alternative splicing in metazoans Alternative splicing well characterized in animals. As many as 96% of human genes may have multiple splice forms. Functional significance of alternative spicing still poorly understood. Alternative splicing in animals. Nature Genetics Research 36; 2004 Bridging the gap between genome and transcriptome Nucleic Acids Research 32, 2004. Splice statistics for human genes
45
RuBisCo alternative splicing one of first plant examples: “The data presented here demonstrate the existence of alternative splicing in plant systems, but the physiological significance of synthesizing two forms of rubisco activase remains unclear. However, this process may have important implications in photosynthesis. If these polypeptides were functionally equivalent enzymes in the chloroplast, there would be no need for the production of both….” Alternative splicing in plants
46
Biological significance of AS in plants …includes: -regulation of flowering; -resistance to diseases; -enzyme activity (timing, duration, turn-over time, location). Most genome databases give alternatively spliced plant gene variants
47
Example: Disease resistance in tobacco -Nicotiana tabacum resistance gene N involved in resistance to TMV. -Alternative splicing required to achieve resistance. -Alternative transcripts N s (short) and N L (long). -N S encodes full-length, N L a truncated protein. -Splicevariants produced by alternative splicing confer resistance (D). -Splicevariants produced by cDNAs do not confer resistance (A, B, C). ii
48
Example: Jasmonate signaling in Arabidopsis -Plant hormone; affects cell division, growth, reproduction and responses to insects, pathogens, and abiotic stress factors. -Jasmonate Signaling Repressor Protein JAZ 10 splice variants JAZ 10.1, JAZ 10.3 and JAZ 10.4 differ in susceptibility to degradation. -Phenotypic effects include male sterility, altered root growth.
49
Example: Jasmonate signaling in Arabidopsis -Alternative splice sites C’ and D’ lead to different splice variants -JAZ10.3: premature stop codon in D exon, intact JAS domain -JAZ10.4: truncated C exon, protein lacks JAS domain -JAZ 10 encoded by At5G13220
50
Sequence & course material repository http://gfx.dnalc.org/files/evidence Don’t open items, save them to your computer!! Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations (.ppt files) Prospecting (sequences) Readings (Bioinformatics tools, splicing, etc.) Worksheets (Word docs, handouts, etc.) BCR-ABL (temporary; not course-related)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.