Sequence Analysis with Artemis and Artemis Comparison Tool (ACT) Carribean Bioinformatics Workshop 18 th -29 th January, 2010.

Slides:



Advertisements
Similar presentations
Click Here to Begin Your Lab
Advertisements

Translation By Josh Morris.
Mutations. DNA mRNA Transcription Introduction of Molecular Biology Cell Polypeptide (protein) Translation Ribosome.
Transcription & Translation Worksheet
Transcription and Translation
Transcription and Translation
Proteins are made by decoding the Information in DNA Proteins are not built directly from DNA.
FEATURES OF GENETIC CODE AND NON SENSE CODONS
Chapter 17: From Gene to Protein.
Concepts and Applications Eighth Edition
How Proteins are Produced
DNA.
Sec 5.1 / 5.2. One Gene – One Polypeptide Hypothesis early 20 th century – Archibald Garrod physician that noticed that some metabolic errors were found.
DNA The Secret of Life. Deoxyribonucleic Acid DNA is the molecule responsible for controlling the activities of the cell It is the hereditary molecule.
PowerPoint ® Lecture Slides prepared by Janice Meeking, Mount Royal College C H A P T E R Copyright © 2010 Pearson Education, Inc. 3 Cells: The Living.
Wellcome Trust Workshop Working with Pathogen Genomes Module 2 Gene Prediction.
GENE EXPRESSION. Gene Expression Our phenotype is the result of the expression of proteins Different alleles encode for slightly different proteins Protein.
Gene Expression: From Gene to Protein
Gene to Protein Gene Expression.
RNA Structure Like DNA, RNA is a nucleic acid. RNA is a nucleic acid made up of repeating nucleotides.
Figure 14.1 Figure 14.1 How does a single faulty gene result in the dramatic appearance of an albino deer? 1.
7. Protein Synthesis and the Genetic Code a). Overview of translation i). Requirements for protein synthesis ii). messenger RNA iii). Ribosomes and polysomes.
Chapter 11 DNA and Genes.
Cell Division and Gene Expression
Chapter 14 Genetic Code and Transcription. You Must Know The differences between replication (from chapter 13), transcription and translation and the.
Chapter 17 From Gene to Protein. Protein Synthesis  The information content of DNA  Is in the form of specific sequences of nucleotides along the DNA.
©1998 Timothy G. Standish From DNA To RNA To Protein Timothy G. Standish, Ph. D.
Parts is parts…. AMINO ACID building block of proteins contain an amino or NH 2 group and a carboxyl (acid) or COOH group PEPTIDE BOND covalent bond link.
Today 14.2 & 14.4 Transcription and Translation /student_view0/chapter3/animation__p rotein_synthesis__quiz_3_.html.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
G U A C G U A C C A U G G U A C A C U G UUU UUC UUA UCU UUG UCC UCA
Protein Synthesis Translation e.com/watch?v=_ Q2Ba2cFAew (central dogma song) e.com/watch?v=_ Q2Ba2cFAew.
Figure 17.4 DNA molecule Gene 1 Gene 2 Gene 3 DNA strand (template) TRANSCRIPTION mRNA Protein TRANSLATION Amino acid ACC AAACCGAG T UGG U UU G GC UC.
How Genes Work: From DNA to RNA to Protein Chapter 17.
Gene Translation:RNA -> Protein How does a particular sequence of nucleotides specify a particular sequence of amino acids?nucleotidesamino acids The answer:
F. PROTEIN SYNTHESIS [or translating the message]
DNA.
From DNA to Protein.
Translation PROTEIN SYNTHESIS.
Whole process Step by step- from chromosomes to proteins.
Please turn in your homework
The blueprint of life; from DNA to Protein
Where is Cytochrome C? What is the role? Where does it come from?
Overview: The Flow of Genetic Information
Mutations.
Transcription and Translation
What is Transcription and who is involved?
From Gene to Phenotype- part 2
Ch. 17 From Gene to Protein Thought Questions
Gene Expression: From Gene to Protein
Overview: The Flow of Genetic Information
Section Objectives Relate the concept of the gene to the sequence of nucleotides in DNA. Sequence the steps involved in protein synthesis.
Protein Synthesis Translation.
Overview: The Flow of Genetic Information
DNA The Secret of Life.
Cards created by Kelly Riedell Brookings High School Brookings, SD
Transcription You’re made of meat, which is made of protein.
Gene Expression: From Gene to Protein
SC-100 Class 25 Molecular Genetics
Warm Up 3 2/5 Can DNA leave the nucleus?
Today’s notes from the student table Something to write with
Transcription and Translation
Overview: The Flow of Genetic Information
Central Dogma and the Genetic Code
Bellringer Please answer on your bellringer sheet:
DNA, RNA, Amino Acids, Proteins, and Genes!.
How does DNA control our characteristics?
DNA and Words Activity.
Mutations Timothy G. Standish, Ph. D..
12.2 Replication of DNA DNA replication is the process of copying a DNA molecule. Semiconservative replication - each strand of the original double helix.
Presentation transcript:

Sequence Analysis with Artemis and Artemis Comparison Tool (ACT) Carribean Bioinformatics Workshop 18 th -29 th January, 2010

Genome Informatics Workshop Gene Finding

atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt tttaattaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatca tttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcttttcacttccaattttatattccg cagtacatcgaattctaaaaaaaaaaataaataatatataatatataataaataatatataataaataatatataatatataataaataatatataatat ataatatataataaataatatataatatataatatataataaataatatataataaataatatataatatataatatataatactttggaaagattattt atatgaatatatacacctttaataggatacacacatcatatttatatatatacatataaatattccataaatatttatacaacctcaaataaaataaaca tacatatatatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattagg agatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaa ttgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatat tatcatttatgtccttatcaaaatttattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaa ttcaatcttaactccctccttcacttcactcattttatatattccttaatttttactatgtttattaaattaacatatatataaacaaatatgtcactaa taatatatatatatatatatatatatatatatattataaatgttttactctattttcacatcttgtccttttttttttaaaaatcccaattcttattcat taaataataatgtattttttttttttttttttttttttattaattattatgttactgttttattatatacactcttaatcatatatatatatttatatat atatatatatatatatatatattattcccttttcatgttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatatttttataacatatgt attattaaaatgtatatataaaaatatatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattacta ccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaata tatatatatatatatatacatataatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaaga atttaattttaattaaatatatataattacatacatctaatattattatatatatataataagttttccaaatagaatacttatatattatatatatata tatatatatatatattcttccataaaaagaataaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtattt ataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaacattttcttcattatcaaaaatatttatttcctaattttttttttttg taaaatatatttaaaaatgtaatagattatgtattaaataatataaatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattc aaaagatacaggtaaaaaaaaaaaaataaagtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatgacatgttataatataatataa taaataaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataactaacattcata tctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaat tctgatcattgatccgtcttccttaaatgttattacaataaatacagatctgtatgtagttgatttcctttttaatgagaaaaataagaatcttattgtt ttagggtaatgaaatatatatagatttatatttttatttatttattatatattattttttaatttttcttttatatatttattttatttagtgtataaaa tgatatcctttatatttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatatatatatatatgtatattttttt tttttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagtta agcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaata aagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaa ttcatatgtatatatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttaaatttatcctacctcagagaatct ataaataataaaaaaaagcatataaataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataa

atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt tttaattaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatca tttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcttttcacttccaattttatattccg cagtacatcgaattctaaaaaaaaaaataaataatatataatatataataaataatatataataaataatatataatatataataaataatatataatat ataatatataataaataatatataatatataatatataataaataatatataataaataatatataatatataatatataatactttggaaagattattt atatgaatatatacacctttaataggatacacacatcatatttatatatatacatataaatattccataaatatttatacaacctcaaataaaataaaca tacatatatatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattagg agatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaa ttgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatat tatcatttatgtccttatcaaaatttattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaa ttcaatcttaactccctccttcacttcactcattttatatattccttaatttttactatgtttattaaattaacatatatataaacaaatatgtcactaa taatatatatatatatatatatatatatatatattataaatgttttactctattttcacatcttgtccttttttttttaaaaatcccaattcttattcat taaataataatgtattttttttttttttttttttttttattaattattatgttactgttttattatatacactcttaatcatatatatatatttatatat atatatatatatatatatatattattcccttttcatgttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatatttttataacagatgt attattaaaatgtatatataaaaatatatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattacta ccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaata tatatatatatatatatacatataatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaaga atttaattttaattaaatatatataattacatacatctaatattattatatatatataataagttttccaaatagaatacttatatattatatatatata tatatatatatatattcttccataaaaagaataaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtattt ataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaagtttttcttcattatcaaaaatatttatttcctaattttttttttttg taaaatatatttaaaaatgtaatagattatgtattaaataatataaatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattc aaaagatacaggtaaaaaaaaaaaaataaagtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatgacatgttataatataatataa taaataaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataactaacattcata tctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaat tctgatcattgatccgtcttccttaggtgttattacaataaatacagatctgtatgtagttgatttcctttttaatgagaaaaataagaatcttattgtt ttagggtaatgaaatatatatagatttatatttttatttatttattatatattattttttaatttttcttttatatatttattttatttagtgtataaaa tgatatcctttatatttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatatatatatatatgtatattttttt tttttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagtta agcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaata aagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaa ttcatatgtatatatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttaaatttatcctacctcagagaatct ataaataataaaaaaaagcatataaataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataa Sequencing is just the beginning of the process Extracting information & interpreting What´s there where are the genes which genes how to find them? SEQUENCE ANNOTATION Sequencing is just the beginning of the process Extracting information & interpreting What´s there where are the genes which genes how to find them? SEQUENCE ANNOTATION

Strategies for sequence annotation  Predictive methods  Comparative methods  Experimental methods Interpretation of the DNA sequence into genes according to rules

Strategies for sequence annotation  Predictive methods  Comparative methods  Experimental methods Interpretation of the DNA sequence into genes according to rules Interpretation of the DNA sequence into genes according to similarities with other sequences

Strategies for sequence annotation  Predictive methods  Comparative methods  Experimental methods Interpretation of the DNA sequence into genes according to rules Interpretation of the DNA sequence into genes according to similarities with other sequences Interpretation of the DNA sequence into genes according to experimental results (e.g. cDNA)

EST Blast Hit

Gene prediction programs: ORFs and CDSs ORFs are not equivalent to CDSs Not all open reading frames are coding sequences

Gene prediction Gene finderGlimmer Orpheus PHAT GeneMark

Gene finding Accurately predict sample set of genes Sequence base composition sequence alignment to related gene (e.g. orthologue) sequence alignment transcript data (e.g. EST) training set Gene finding software Full gene set

Gene finding programs Genefinding software packages use Hidden Markov Models. Predict coding, intergenic and intron sequences Need to be trained on a specific organism. Never perfect!

Gene prediction programs: Problems ORFs are not equivalent to CDSs Gene prediction programs find new genes that share properties with a given set of genes. They can be confounded by: –Sequence constraints (ribosomal proteins etc.) –Sequence biases –Different sets of genes –Horizontal gene transfer –Non-coding DNA

Gene prediction programs: Problems Different gene training sets: Plasmodium falciparum Original annotation Updated annotation

Gene prediction programs: Problems Non-protein coding regions: S. typhi ribosomal RNA genes glimmer genefinder final orpheus glimmer genefinder final orpheus

Gene prediction programs: Problems Non-protein coding regions: N. meningitidis DNA repeats glimmer orpheus final glimmer orpheus final

Gene prediction programs: Problems Pseudogenes M. leprae

Gene prediction programs: Problems Pseudogenes: M. leprae Glimmer

Gene prediction programs: Problems Pseudogenes: M. leprae ORPHEUS

Gene prediction programs: Problems Pseudogenes: M. leprae WUBLASTX vs. M. tuberculosis

Gene prediction programs: Problems Pseudogenes: M. leprae Final annotation

Campylobacter jejuni Neisseria meningitidis A Salmonella typhi Yersinia pestis Organism Size (Mb)G+C CDS prediction GlimmerORPHEUSotherFinal Mycobacterium leprae Start-to-stop >100 aa Gene prediction programs: Statistics 1605 intact 1115 pseudo G TIGR CMR ( GeneFinder (Krogh+Larson pers comm)

The Gene Prediction Process DNA SEQUENCE ANNALYSIS SOFTWARE Usefull CDS Prediction Annotator AT content Gene finders Codon Usage BlastX FASTA ESTs

Eukaryotic gene AAAAAAAAAA CAP AAAAAAAAAA CAP TTTTTTTTT intron Exon II 5’UTR Exon I stop 3’UTR EST cDNA mRNA EST Exon III ATG GT AG GT AG

AT content Coding regions have higher GC content in AT rich genomes

AT content

CODON USAGE Codon bias is different for each organism. DNA content in coding regions is restricted – but it is not restricted in non coding regions. The codon usage for any particular gene can influence expression.

Codon usage All organisms have a preferred set of codons. Malaria Trypanosoma GUU 0.41 GUU 0.28 GUC 0.06GUC 0.19 GUA 0.42 GUA 0.14 GUG 0.11 GUG 0.39

Codon Usage

Codon Usage Table UUU 34.3( 26847) UCU 15.3( 11956) UAU 45.6( 35709) UGU 15.3( 11942) UUC 7.3( 5719) UCC 5.3( 4141) UAC 5.5( 4340) UGC 2.4( 1872) UUA 49.2( 38527) UCA 18.2( 14239) UAA 1.0( 813) UGA 0.2( 188) UUG 10.1( 7911) UCG 2.8( 2154) UAG 0.2( 123) UGG 5.2( 4066) CUU 8.7( 6776) CCU 9.1( 7148) CAU 19.5( 15287) CGU 3.3( 2561) CUC 1.7( 1354) CCC 2.5( 1982) CAC 3.9( 3020) CGC 0.5( 354) CUA 5.4( 4217) CCA 13.1( 10221) CAA 25.1( 19650) CGA 2.4( 1878) CUG 1.3( 1044) CCG 0.9( 742) CAG 3.3( 2598) CGG 0.2( 184) AUU 34.0( 26611) ACU 12.8( 10050) AAU105.5( 82591) AGU 21.6( 16899) AUC 5.9( 4636) ACC 5.5( 4312) AAC 18.5( 14518) AGC 3.8( 2994) AUA 44.7( 34976) ACA 22.8( 17822) AAA 90.5( 70863) AGA 16.9( 13213) AUG 20.9( 16326) ACG 3.8( 2951) AAG 19.2( 15056) AGG 3.9( 3091) GUU 18.1( 14200) GCU 12.5( 9811) GAU 55.5( 43424) GGU 16.6( 12960) GUC 2.6( 2063) GCC 3.2( 2541) GAC 8.6( 6696) GGC 1.6( 1269) GUA 18.2( 14258) GCA 12.6( 9871) GAA 65.8( 51505) GGA 16.7( 13043) GUG 4.9( 3806) GCG 1.1( 890) GAG 10.1( 7878) GGG 2.9( 2243)

Codon Usage in Artemis Forward frames Reverse frames

Codon usage & gene finding in : Leishmania

Transcriptional units in Leishmania: DNA strand-switches

GC frame plot Plots the third position GC content of each frame of a DNA sequence. In coding DNA the GC content of the 3 rd base is often higher. Good prediction of coding in malaria and trypanosomes.

GC frame plot of tubulin gene cluster on T. brucei Chr 1

Large-scale nucleotide plots in Artemis: S. typhi genome GC content, GC deviation, Karlin signature

Homology Data Coding regions are more conserved than non coding regions due to selective pressure. Comparing all possible translations against all known proteins will give clues to known genes. Blastx

Gene finding: using ACT TBLASTX comparisons P. knowlesi P. falciparum P. yoelii

Using FASTA / BLAST Results FASTA is a global alignment tool BLAST is a local alignment tool BLAST FASTA

Functional assignment: alignments of modular proteins A B A B C A B C

Gene finding by RNA-Seq (Transcriptional landscape of Neospora caninum Tachyzoites Day 3 Tachyzoites (RNAseq) Day 4 Tachyzoites (RNAseq)

Day 3 Tachyzoites (RNAseq) Day 4 Tachyzoites (RNAseq) N. caninum Chr08 T. gondii Chr08 5’ UTR 3’ UTR TBLASTX matches visualised in ACT Transcriptome sequencing in Neospora (RNAseq is useful for predicting/confirming UTR boundaries)

RNA-Seq: correcting gene models Before %GC After %GC __16hr, __32hr, __48hr