Sequence & course material repository Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations.

Slides:



Advertisements
Similar presentations
A very short introduction (in plants)
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
PROMoter SCanning/ANalysis tool. Goal Creating a tool to analyse a set of putative promoter sequences and recognize known and unknown promoters, with.
Finding Eukaryotic Open reading frames.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
© 2006 W.W. Norton & Company, Inc. DISCOVER BIOLOGY 3/e
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Genome Annotation BCB 660 October 20, From Carson Holt.
Central Dogma First described by Francis Crick
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Protein Synthesis The genetic code – the sequence of nucleotides in DNA – is ultimately translated into the sequence of amino acids in proteins – gene.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
Chapter 17 Notes From Gene to Protein.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Manifestations of a Code Genes, genomes, bioinformatics and cyberspace – and the promise they hold for biology education.
Manifestations of a Code Genes, genomes, bioinformatics and cyberspace – and the promise they hold for biology education.
Intelligent Systems for Bioinformatics Michael J. Watts
Protein Synthesis. DNA acts like an "instruction manual“ – it provides all the information needed to function the actual work of translating the information.
Gene Expression and Gene Regulation. The Link between Genes and Proteins At the beginning of the 20 th century, Garrod proposed: – Genetic disorders such.
RNA and Protein Synthesis
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Molecular Biology in a Nutshell (via UCSC Genome Browser) Personalized Medicine: Understanding Your Own Genome Fall 2014.
Fig.1.8 DNA STRUCTURE 5’ 3’ Antiparallel DNA strands Hydrogen bonds between bases DOUBLE HELIX 5’ 3’
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Genome Annotation Rosana O. Babu.
LECTURE CONNECTIONS 14 | RNA Molecules and RNA Processing © 2009 W. H. Freeman and Company.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Sackler Medical School
Mark D. Adams Dept. of Genetics 9/10/04
Bioinformatics and Computational Biology
Chapter 14.  Ricin (found in castor-oil plant used in plastics, paints, cosmetics) is toxic because it inactivates ribosomes, the organelles which assemble.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Introduction to Bioinformatics II Lecture 5 By Ms. Shumaila Azam.
How can we find genes? Search for them Look them up.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure.
Lesson Four Structure of a Gene. Gene Structure What is a gene? Gene: a unit of DNA on a chromosome that codes for a protein(s) –Exons –Introns –Promoter.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Finding genes in the genome
CFE Higher Biology DNA and the Genome Transcription.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
The Central Dogma of Molecular Biology DNA  RNA  Protein  Trait.
Gene Activity 1Outline Function of Genes  One Gene-One Enzyme Hypothesis Genetic Code Transcription  Processing Messenger RNA Translation  Transfer.
Unit-II Synthetic Biology: Protein Synthesis Synthetic Biology is - A) the design and construction of new biological parts, devices, and systems, and B)
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
Introduction to molecular biology Data Mining Techniques.
Using DNA Subway in the Classroom Genome Annotation: Red Line.
Gene Activity Chapter 14. Gene Activity 2Outline Function of Genes  One Gene-One Enzyme Hypothesis Genetic Code Transcription  Processing Messenger.
Features of the genetic code: Triplet codons (total 64 codons) Nonoverlapping Three stop or nonsense codons UAA (ocher), UAG (amber) and UGA (opal)
Ch. 11: DNA Replication, Transcription, & Translation Mrs. Geist Biology, Fall Swansboro High School.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Using DNA Subway in the Classroom
Eukaryotic Gene Structure
Lesson Four Structure of a Gene.
Lesson Four Structure of a Gene.
Gene architecture and sequence annotation
Synthetic Biology: Protein Synthesis
Introduction to Bioinformatics II
Biology, 9th ed,Sylvia Mader
From DNA to Protein Class 4 02/11/04 RBIO-0002-U1.
Presentation transcript:

Sequence & course material repository Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations (.ppt files) Prospecting (sequences) Readings (Bioinformatics tools, splicing, etc.) Worksheets (Word docs, handouts, etc.) BCR-ABL (temporary; not course-related)

Manifestations of a Code Genes, genomes, bioinformatics and cyberspace – and the promise they hold for biology education

Plants are amazing – and so are their genomes Largest flower (~ 1m)Oldest plant (> 5000 years) Tallest organism (> 100m) Slide: ASPB, 2009

A GENOME is all of a living thing’s genetic material. The genetic material is DNA (DeoxyriboNucleic Acid) DNA, a double helical molecule, is made up of four nucleotide “letters”: A----G T----C What is a genome? Slide: JGI, 2009

Just as computer software is rendered in long strings of 0s and 1s, the GENOME or “ software ” of life is represented by a string of the four nucleotides, A, G, C, and T. To understand the software of either - a computer or a living organism - we must know the order, or sequence, of these informative bits. What is sequencing? Slide: JGI, 2009

Exciting? >mouse_ear_cress_1080 GAAATAATCAATGGAATATGTAGAGGTCTCCTGTACCTTCACAGAGATTCTAGGCTGAGAGCAGTGCATATAGATATCTTT CGTACTCATCTGCTTTTTCTGGTCTCCATCACAAAAGCCAACTAGGTAATCATATCAATCTCTCTTTACCGTTTACTCGAC CTTTTCCAATCAGGTGCT TCTGGTGTGTCTACTACTATCAGTTTTAGGTCTTTGTATACCTGATCTTATCTGCTACTG AGGCTTGTAAAAGTGATTAAAACTGTGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACCGGTG TGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAAGAAGATGAACTCTCATTGACTGAAAGCGGGT TGAAGAGTGAAGATGGCGTTATTATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAATTTACCAAGGGA GAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTA GTGTGTTTGAAGTTTCTTAACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGTTTTGGGATGTAGAGCTAACC AGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTTGTTCAGTACCTGAATACCAGTTTTAAATTACCGTCAG ATGTTGATCTTGTTGGTAATAATGGAGAAACGGAAGAATAATTAGACGAAACAAACTCTTTAAGAACGTATCTTTCAGTTT TCCATCACAAATTTTCTTACAAGCTACAAAAATCGAACTATATATAACTGAACCGAATTTAAACCGGAGGGAGGGTTTGAC TTTGGTCAATCACATTTCCAATGATACCGTCGTTTGGTTTGGGGAAGCCTCGTCGTACAAATACGACGTCGTTTAAGGAAA GCCCTCCTTAACCCCAGTTATAAGCTCAAAGTTGTACTTGACCTTTTTAAAGAAGCACGAAACGAAAAACCCTAAAATTCC CAAGCAGAGAAAGAGAGACAGAGCAAGTACAGATTTCAACTAGCTCAAGATGATCATCCCTGTTCGTTGCTTTACTTGTGG AAAGGTTGATATTTTCCCCTTCGCTTTGGTCTTATTTAGGGTTTTACTCCGTCTTTATAGGGTTTTAGTTACTCCAAATTT GGCTAAGAAGAGATCTTTACTCTCTGTATTTGACACGAATGTTTTTAATCGGTTGGATACATGTTGGGTCGATTAGAGAAA TAAAGTATTGAGCTTTACTAAGCTTTCACCTTGTGATTGGTTTAGGTGATTGGAAACAAATGGGATCAGTATCTTGATCTT CTCCAGCTCGACTACACTGAAGGGTAAGCTTACAATGATTCTCACTTCTTGCTGCTCTAATCATCATACTTTGTGTCAAAA AGAGAGTAATTGCTTTGCGTTTTAGAGAAATTAGCCCAGATTTCGTATTGGGTCTGTGAAGTTTCATATTAGCTAACACAC TTCTCTAATTGATAACAGAAGCTATAAAATAGATTTGCTGATGAAGGAGTTAGCTTTTTATAATCTTCTGTGTTTGTGTTT TACTGTCTGTGTCATTGGAAGAGACTATGTCCTGCCTATATAATCTCTATGTGCCTATCTAGATTTTCTATACAATTGATA TTTGATAGAAGTAGAAAGTAAGACTTAAGGTCTTTTGATTAGACTTGTGCCCATCTACATGATTCTTATTGGACTAATCAT TCTTTGTGTGAAAATAGAATACTTTGTCTGAACATGAGAGAATGGTTCATAATACGTGTGAAGTATGGGATTAGTTCAACA ATTTCGCTATTGGAGAAGCAAACCAAGGGTTAATCGTTTATAGGGTTAAGCTAATGCTCTGCTCTTTATATGTTATTGGAA CAGACTATTGTTGTGCCTATCTTGTTTAGTTGTAGATTCTATCTCGACTGTTATAAGTATGACTGAAGGCTTGATGACTTA TGATTCTCTTTACACCTGTAGAAGGATTTAAGCTTGGTGTCTAGATATTCAATCTGTGTTGGTTTTGTCTTTCTTTTGGCT CTTAGTGTTGTTCAATCTCCTCAATAGGTATGAAGTTACAATATCCTTATTATTTTGCAGGGACGCACTTGATGCACTCCA GCTAGTCAGATACTGCTGCAGGCGTATGCTAATGACCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTAGTGTGT

Much better

Find Gene Families Generate mathematical evidence Analyze large data amounts Browse in context Build gene models Gather biological evidence Annotation workflow Get DNA sequence

Walk or…

…take DNA Subway

Molecular biology and bioinformatics concepts RepeatMasker Eukaryotic genomes contain large amounts of repetitive DNA. Transposons can be located anywhere. Transposons can mutate like any other DNA sequence. FGenesH Gene Predictor Protein-coding information begins with start, is followed by codons, ends with stop. Codons in mRNA (AUG, UAA,…) have sequence equivalents in DNA (ATG, TAA,…). Most eukaryotic introns have “canonical splice sites,” GT---AG (mRNA: GU---AG). Gene prediction programs search for patterns to predict genes and their structure. Different gene prediction programs may predict different genes and/or structures. Multiple Gene Predictors The protein coding sequence of a mRNA is flanked by untranslated regions (UTRs). UTRs hold information for the half-lives of mRNAs and regulatory purposes. Gene > mRNA > CDS. BLAST Searches Gene or protein homologs share similarities due to common ancestry. Biological evidence is needed to curate gene models predicted by computers. mRNA transcripts and protein sequence data provide “hard” evidence for genes.

How do we find genes? Search for them Look them up

How do I get to this…

From this… >mouse_ear_cress_1080 GAAATAATCAATGGAATATGTAGAGGTCTCCTGTACCTTCACAGAGATTCTAGGCTGAGAGCAGTGCATATAGATATCTTT CGTACTCATCTGCTTTTTCTGGTCTCCATCACAAAAGCCAACTAGGTAATCATATCAATCTCTCTTTACCGTTTACTCGAC CTTTTCCAATCAGGTGCT TCTGGTGTGTCTACTACTATCAGTTTTAGGTCTTTGTATACCTGATCTTATCTGCTACTG AGGCTTGTAAAAGTGATTAAAACTGTGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACCGGTG TGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAAGAAGATGAACTCTCATTGACTGAAAGCGGGT TGAAGAGTGAAGATGGCGTTATTATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAATTTACCAAGGGA GAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTA GTGTGTTTGAAGTTTCTTAACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGTTTTGGGATGTAGAGCTAACC AGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTTGTTCAGTACCTGAATACCAGTTTTAAATTACCGTCAG ATGTTGATCTTGTTGGTAATAATGGAGAAACGGAAGAATAATTAGACGAAACAAACTCTTTAAGAACGTATCTTTCAGTTT TCCATCACAAATTTTCTTACAAGCTACAAAAATCGAACTATATATAACTGAACCGAATTTAAACCGGAGGGAGGGTTTGAC TTTGGTCAATCACATTTCCAATGATACCGTCGTTTGGTTTGGGGAAGCCTCGTCGTACAAATACGACGTCGTTTAAGGAAA GCCCTCCTTAACCCCAGTTATAAGCTCAAAGTTGTACTTGACCTTTTTAAAGAAGCACGAAACGAAAAACCCTAAAATTCC CAAGCAGAGAAAGAGAGACAGAGCAAGTACAGATTTCAACTAGCTCAAGATGATCATCCCTGTTCGTTGCTTTACTTGTGG AAAGGTTGATATTTTCCCCTTCGCTTTGGTCTTATTTAGGGTTTTACTCCGTCTTTATAGGGTTTTAGTTACTCCAAATTT GGCTAAGAAGAGATCTTTACTCTCTGTATTTGACACGAATGTTTTTAATCGGTTGGATACATGTTGGGTCGATTAGAGAAA TAAAGTATTGAGCTTTACTAAGCTTTCACCTTGTGATTGGTTTAGGTGATTGGAAACAAATGGGATCAGTATCTTGATCTT CTCCAGCTCGACTACACTGAAGGGTAAGCTTACAATGATTCTCACTTCTTGCTGCTCTAATCATCATACTTTGTGTCAAAA AGAGAGTAATTGCTTTGCGTTTTAGAGAAATTAGCCCAGATTTCGTATTGGGTCTGTGAAGTTTCATATTAGCTAACACAC TTCTCTAATTGATAACAGAAGCTATAAAATAGATTTGCTGATGAAGGAGTTAGCTTTTTATAATCTTCTGTGTTTGTGTTT TACTGTCTGTGTCATTGGAAGAGACTATGTCCTGCCTATATAATCTCTATGTGCCTATCTAGATTTTCTATACAATTGATA TTTGATAGAAGTAGAAAGTAAGACTTAAGGTCTTTTGATTAGACTTGTGCCCATCTACATGATTCTTATTGGACTAATCAT TCTTTGTGTGAAAATAGAATACTTTGTCTGAACATGAGAGAATGGTTCATAATACGTGTGAAGTATGGGATTAGTTCAACA ATTTCGCTATTGGAGAAGCAAACCAAGGGTTAATCGTTTATAGGGTTAAGCTAATGCTCTGCTCTTTATATGTTATTGGAA CAGACTATTGTTGTGCCTATCTTGTTTAGTTGTAGATTCTATCTCGACTGTTATAAGTATGACTGAAGGCTTGATGACTTA TGATTCTCTTTACACCTGTAGAAGGATTTAAGCTTGGTGTCTAGATATTCAATCTGTGTTGGTTTTGTCTTTCTTTTGGCT CTTAGTGTTGTTCAATCTCCTCAATAGGTATGAAGTTACAATATCCTTATTATTTTGCAGGGACGCACTTGATGCACTCCA GCTAGTCAGATACTGCTGCAGGCGTATGCTAATGACCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTAGTGTGT

Meaning?

Mathematical Tools (Code; statistics)

Comparative Tools (Database searches)

What do we know about genes? Expressed (Transcribed) – Transcriptional start & termination sites (TXSS, TXTS) – Transcription artefacts (cDNA & ESTs) Regulated – Promoters (TATAAA) – Transcription Factor Binding Sites – CpG (Cytosin methylation) Meaningful (Translated) – 3n basepairs – Codon usage – Translational start & stop/termination codons (TLSS, TLTS) – Translation artefacts (proteins) Spliced – Splice sites (GT-AG) Derived (Homology: Paralogy/Orthology) – Search for known genes, proteins (BLAST)

How might this knowledge help to find genes? Predict genes – Look for potential starts and stops. – Connect them into open reading frames (ORFs). – Filter for “correct’ length & codon usage. Search databases – Known genes: UniGene – Known proteins: UniProt Use transcript evidence – cDNA – ESTs – proteins

Operating computationally Go to beginning of sequence  start SCAN If ATG  register putative TLSS; then – Move in 3-steps & count steps (=COUNTS) – If 3-step = (TAA or TAG or TGA),  register putative TLTS – If register  evaluate COUNTS (= triplets) If COUNTS < minimum  discard; then go behind ATG above and start SCAN If COUNTS > maximum  discard; then go behind ATG above and start SCAN If minimum < COUNTS < maximum  record as GENE with TLSS, TLTS; then go behind ATG above and start SCAN. Arrive at end of sequence  stop SCAN

Find gene families Mathematical evidence Analyze large data sets Browse in ccontext Construct gene models Annotation workflow Biological evidence Browse results Get/Generate sequence

Annotation Cheat Sheet Open existing project or generate new (Red square) Run RepeatMasker Generate evidence (Predictions, BLAST searches) Synthesize evidence into gene models (Apollo) Browse results locally and in context (Phytozome) Conduct functional analysis (link from Browser) Prospect for gene family (Yellow Line from Browser) Select region that holds biological gene evidence Optimize work space and zoom to region (View tab) Expand all tiers (Tiers tab) Drag evidence item(s) onto workspace (mouse) Edit to match biol. evidence (right-click item for tools) Record what was done in Annotation Info Editor Assess necessity to build alternative model(s) Upload model(s) to DNA Subway (File tab) A. DNA Subway B. Apollo

Predictors (mathematical evidence) Utilize predominantly mathematical methods (statistical). Search for patterns – Some score starts, stops, splice sites (GenScan). – Some score nucleotides (Augustus, FGenesH). Few incorporate EST data and/or known genes/proteins. Require optimization for each new species (training). Accuracy: – False positives (scoring non-genes as genes):5% - 50%. – False negatives (missed genes): 5%-40%. – Weak or unable in determining first and last exons, and UTRs. Specific for gene models (spliced genes, non-spliced genes). Specialty predictors (tRNA Scan, RepeatMasker).

Search tools (biological evidence) Search sequence databases: – Known genes – Known proteins – cDNAs & ESTs Utilize alignment methods (BLAST, BLAT). Reliability: – Good in determining gene locations and general gene structures. – Weak in exactly determining exon/intron borders. – Unlikely to correctly determine TXSS and TXTS. – Should be used with cDNA/EST from same species.

mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together. Internal RNA segments that are removed are named introns; the spliced segments are defined as exons. Causes mRNA to be “missing” segments present in DNA template and primary transcript. Most transcripts in eukaryotes spliced. Erosion: 1-exon genes (no exons without introns).

Exon Intron Pre-mRNA 5’ Splice Site 3’ Splice Site Reddy, S.N. Annu. Rev. Plant Biol : Of 1588 examined predicted splice sites in Arabidopsis 1470 sites (93%) followed the canonical GT…AG consensus. (Plant (2004) 39, 877–885) Canonical splice sites

Multiple splice variants = multiple proteins from the same gene Alternative Splicing Not a rare event!!! -Alternative splice sites C’ and D’ lead to different splice variants -JAZ10.3: premature stop codon in D exon, intact JAS domain -JAZ10.4: truncated C exon, protein lacks JAS domain -JAZ 10 encoded by At5G13220

Example: Jasmonate signaling in Arabidopsis -Plant hormone; affects cell division, growth, reproduction and responses to insects, pathogens, and abiotic stress factors. -Jasmonate Signaling Repressor Protein JAZ 10 splice variants JAZ 10.1, JAZ 10.3 and JAZ 10.4 differ in susceptibility to degradation. -Phenotypic consequences include male sterility and altered root growth.

Example: Disease resistance in tobacco -Nicotiana tabacum resistance gene N involved in resistance to TMV. -Alternative splicing required to achieve resistance. -Alternative transcripts N s (short) and N L (long). -N S encodes full-length, N L a truncated protein. -Splicevariants produced by alternative splicing confer resistance (D). -Splicevariants produced by cDNAs do not confer resistance (A, B, C).

Molecular biology and bioinformatics concepts RepeatMasker Eukaryotic genomes contain large amounts of repetitive DNA. Transposons can be located anywhere. Transposons can mutate like any other DNA sequence. FGenesH Gene Predictor Protein-coding information begins with start, is followed by codons, ends with stop. Codons in mRNA (AUG, UAA,…) have sequence equivalents in DNA (ATG, TAA,…). Most eukaryotic introns have “canonical splice sites,” GT---AG (mRNA: GU---AG). Gene prediction programs search for patterns to predict genes and their structure. Different gene prediction programs may predict different genes and/or structures. Multiple Gene Predictors The protein coding sequence of a mRNA is flanked by untranslated regions (UTRs). UTRs hold information for the half-lives of mRNAs and regulatory purposes. Gene > mRNA > CDS. BLAST Searches Gene or protein homologs share similarities due to common ancestry. Biological evidence is needed to curate gene models predicted by computers. mRNA transcripts and protein sequence data provide “hard” evidence for genes.

…take DNA Subway