Wellcome Trust Workshop Working with Pathogen Genomes Module 2 Gene Prediction.

Slides:



Advertisements
Similar presentations
Click Here to Begin Your Lab
Advertisements

Translation By Josh Morris.
Mutations. DNA mRNA Transcription Introduction of Molecular Biology Cell Polypeptide (protein) Translation Ribosome.
Transcription & Translation Worksheet
How Genes work Chapter 12.
Transcription and Translation
Transcription and Translation
Proteins are made by decoding the Information in DNA Proteins are not built directly from DNA.
FEATURES OF GENETIC CODE AND NON SENSE CODONS
Concepts and Applications Eighth Edition
How Proteins are Produced
DNA.
Sec 5.1 / 5.2. One Gene – One Polypeptide Hypothesis early 20 th century – Archibald Garrod physician that noticed that some metabolic errors were found.
PowerPoint ® Lecture Slides prepared by Janice Meeking, Mount Royal College C H A P T E R Copyright © 2010 Pearson Education, Inc. 3 Cells: The Living.
GENE EXPRESSION. Gene Expression Our phenotype is the result of the expression of proteins Different alleles encode for slightly different proteins Protein.
Gene Expression: From Gene to Protein
Gene to Protein Gene Expression.
RNA Structure Like DNA, RNA is a nucleic acid. RNA is a nucleic acid made up of repeating nucleotides.
Sequence Analysis with Artemis and Artemis Comparison Tool (ACT) Carribean Bioinformatics Workshop 18 th -29 th January, 2010.
Figure 14.1 Figure 14.1 How does a single faulty gene result in the dramatic appearance of an albino deer? 1.
7. Protein Synthesis and the Genetic Code a). Overview of translation i). Requirements for protein synthesis ii). messenger RNA iii). Ribosomes and polysomes.
Chapter 11 DNA and Genes.
Cell Division and Gene Expression
Chapter 14 Genetic Code and Transcription. You Must Know The differences between replication (from chapter 13), transcription and translation and the.
Chapter 17 From Gene to Protein. Protein Synthesis  The information content of DNA  Is in the form of specific sequences of nucleotides along the DNA.
©1998 Timothy G. Standish From DNA To RNA To Protein Timothy G. Standish, Ph. D.
Parts is parts…. AMINO ACID building block of proteins contain an amino or NH 2 group and a carboxyl (acid) or COOH group PEPTIDE BOND covalent bond link.
Today 14.2 & 14.4 Transcription and Translation /student_view0/chapter3/animation__p rotein_synthesis__quiz_3_.html.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
G U A C G U A C C A U G G U A C A C U G UUU UUC UUA UCU UUG UCC UCA
Protein Synthesis Translation e.com/watch?v=_ Q2Ba2cFAew (central dogma song) e.com/watch?v=_ Q2Ba2cFAew.
Figure 17.4 DNA molecule Gene 1 Gene 2 Gene 3 DNA strand (template) TRANSCRIPTION mRNA Protein TRANSLATION Amino acid ACC AAACCGAG T UGG U UU G GC UC.
How Genes Work: From DNA to RNA to Protein Chapter 17.
Gene Translation:RNA -> Protein How does a particular sequence of nucleotides specify a particular sequence of amino acids?nucleotidesamino acids The answer:
F. PROTEIN SYNTHESIS [or translating the message]
DNA.
From DNA to Protein.
Translation PROTEIN SYNTHESIS.
Whole process Step by step- from chromosomes to proteins.
Please turn in your homework
The blueprint of life; from DNA to Protein
Where is Cytochrome C? What is the role? Where does it come from?
Overview: The Flow of Genetic Information
Mutations.
What is Transcription and who is involved?
From Gene to Phenotype- part 2
Ch. 17 From Gene to Protein Thought Questions
Gene Expression: From Gene to Protein
From Gene to Protein The information content of DNA is in the form of specific sequences of nucleotides The DNA inherited by an organism leads to specific.
Overview: The Flow of Genetic Information
Section Objectives Relate the concept of the gene to the sequence of nucleotides in DNA. Sequence the steps involved in protein synthesis.
Protein Synthesis Translation.
Overview: The Flow of Genetic Information
DNA The Secret of Life.
Cards created by Kelly Riedell Brookings High School Brookings, SD
Transcription You’re made of meat, which is made of protein.
Gene Expression: From Gene to Protein
SC-100 Class 25 Molecular Genetics
Warm Up 3 2/5 Can DNA leave the nucleus?
Protein Structure Timothy G. Standish, Ph. D..
Today’s notes from the student table Something to write with
Transcription and Translation
Overview: The Flow of Genetic Information
Central Dogma and the Genetic Code
Bellringer Please answer on your bellringer sheet:
DNA, RNA, Amino Acids, Proteins, and Genes!.
How does DNA control our characteristics?
DNA and Words Activity.
Mutations Timothy G. Standish, Ph. D..
12.2 Replication of DNA DNA replication is the process of copying a DNA molecule. Semiconservative replication - each strand of the original double helix.
Presentation transcript:

Wellcome Trust Workshop Working with Pathogen Genomes Module 2 Gene Prediction

The Annotation Process DNA SEQUENCE ANNALYSIS SOFTWARE Useful Information Annotator

Gene finding Accurately predict sample set of genes Sequence base composition sequence alignment to related gene (e.g. orthologue) sequence alignment transcript data (e.g. EST) training set Gene finding software Full gene set

AT content Forward translations Reverse Translations DNA and amino acids DNA in Artemis

Gene prediction programs: ORFs and CDSs ORFs are not equivalent to CDSs Not all open reading frames are coding sequences

GC content Coding regions have higher GC content in AT-rich genomes

GC content

CODON USAGE Codon bias is different for each organism. DNA content in coding regions is restricted – but it is not restricted in non coding regions. The codon usage for any particular gene can influence expression.

Codon usage All organisms have a preferred set of codons. Malaria Trypanosoma GUU 0.41 GUU 0.28 GUC 0.06GUC 0.19 GUA 0.42 GUA 0.14 GUG 0.11 GUG 0.39

Codon Usage

Codon Usage Table UUU 34.3( 26847) UCU 15.3( 11956) UAU 45.6( 35709) UGU 15.3( 11942) UUC 7.3( 5719) UCC 5.3( 4141) UAC 5.5( 4340) UGC 2.4( 1872) UUA 49.2( 38527) UCA 18.2( 14239) UAA 1.0( 813) UGA 0.2( 188) UUG 10.1( 7911) UCG 2.8( 2154) UAG 0.2( 123) UGG 5.2( 4066) CUU 8.7( 6776) CCU 9.1( 7148) CAU 19.5( 15287) CGU 3.3( 2561) CUC 1.7( 1354) CCC 2.5( 1982) CAC 3.9( 3020) CGC 0.5( 354) CUA 5.4( 4217) CCA 13.1( 10221) CAA 25.1( 19650) CGA 2.4( 1878) CUG 1.3( 1044) CCG 0.9( 742) CAG 3.3( 2598) CGG 0.2( 184) AUU 34.0( 26611) ACU 12.8( 10050) AAU105.5( 82591) AGU 21.6( 16899) AUC 5.9( 4636) ACC 5.5( 4312) AAC 18.5( 14518) AGC 3.8( 2994) AUA 44.7( 34976) ACA 22.8( 17822) AAA 90.5( 70863) AGA 16.9( 13213) AUG 20.9( 16326) ACG 3.8( 2951) AAG 19.2( 15056) AGG 3.9( 3091) GUU 18.1( 14200) GCU 12.5( 9811) GAU 55.5( 43424) GGU 16.6( 12960) GUC 2.6( 2063) GCC 3.2( 2541) GAC 8.6( 6696) GGC 1.6( 1269) GUA 18.2( 14258) GCA 12.6( 9871) GAA 65.8( 51505) GGA 16.7( 13043) GUG 4.9( 3806) GCG 1.1( 890) GAG 10.1( 7878) GGG 2.9( 2243)

Codon Usage in Artemis Forward frames Reverse frames

Gene prediction: Amino acid usage: Correlation scores Within each window, plots correlation between amino acid usage in window and global amino-acid usage in EMBL “Magic number” = 52.7 Arbitrary units

Gene prediction: Correlation scores M. tuberculosis NADH dehydrogenase operon

Gene prediction: Positional base preference (FramePlot) Plots the GC content in each position of each reading frame of the DNA sequence. In G+C-rich organisms the GC content of the 3rd base is often higher; in A+T rich organisms it is lower. Good prediction of coding in malaria and trypanosomes and G+C-rich prokaryotes. G+C content of chromosome Frame- specific G+C content 1 2 3

Genefinding programs Genefinding software packages use Hidden Markov Models. Predict coding, intergenic and intron sequences Need to be trained on a specific organism. Never perfect!

What is an HMM A statistical model that represents a gene. Similar to a “weight matrix” but one that can recognise gaps and treat them in a systematic way. Has a different “states” that represent introns, exons, intergenic regions, etc Considers the “state” of preceding sequence

A typical HMM

Gene prediction programs: Problems ORFs are not equivalent to CDSs Gene prediction programs find new genes that share properties with a given set of genes. They can be confounded by: –Sequence constraints (ribosomal proteins etc.) –Sequence biases –Sequence quality –Different sets of genes –Horizontal gene transfer –Non-coding DNA

Gene prediction programs: Problems Sequence composition variation Y. pestis ribosomal proteins glimmer orpheus final

Gene prediction programs: Problems Non-protein coding regions: S. typhi ribosomal RNA genes glimmer genefinder final orpheus glimmer genefinder final orpheus

Gene prediction programs: Problems Non-protein coding regions: N. meningitidis DNA repeats glimmer orpheus final glimmer orpheus final

Gene prediction programs: Problems Pseudogenes M. leprae

Gene prediction programs: Problems Pseudogenes: M. leprae Glimmer

Gene prediction programs: Problems Pseudogenes: M. leprae ORPHEUS

Gene prediction programs: Problems Pseudogenes: M. leprae WUBLASTX vs. M. tuberculosis

Gene prediction programs: Problems Pseudogenes: M. leprae Final annotation

Gene prediction programs: Statistics Krogh+Larson pers comm Programgenessame start and stop same stop only total sharing stop false negative false positive Glimmer % % % % % GeneMark % % % % % Glimmer % % % % % EasyGene % % % % % Orpheus % % % % % Mycobacterium marinum; 6,636,827 bp, 65.7% G+C compared to manually curated gene set: 5519 genes (incl 46 pseudogenes)

Gene prediction programs: Problems splicing Plasmodium falciparum Original annotation Updated annotation

Homology Data Coding regions are more conserved than non coding regions due to selective pressure. Comparing all possible translations against all known proteins will give clues to known genes. Blastx

BLASTX

Blastx on frame lines

EST sequencing AAAAAAAAAA CAP AAAAAAAAAA CAP TTTTTTTTT intron exon 5’UTR M stop 3’UTR EST cDNA mRNA

ESTs

Showing Multiple Evidence

Schistosoma mansoni expression

The Gene Prediction Process DNA SEQUENCE ANNALYSIS SOFTWARE Usefull CDS Prediction Annotator AT content Gene finders Codon Usage BlastX FASTA ESTs

highlightedmanually reviewed gene structure pale brownhit to H. contortus EST cluster in Nembase found using PASA brown-greenhit to H.contortus individual ESTs in NCBI database found using PASA pink/red blockshits to Uniprot bright greentwinscan prediction (homology based) pale pink snap prediction (ab initio) yellowhmmgene prediction (ab initio) pale bluegenscan prediction (ab initio) redgenefinder (ab initio) dark bluefgenesh prediction (ab initio) jade greenaugustus hints prediction (homology based) orangeaugustus prediction (ab initio) purplegenewise prediction (homology based) Gene prediction in eukaryotes: HMMs

A B P. falciparum gene predictions (PlasmoDB)

Gene prediction in eukaryotes: HMMs Dictyostelium discoideum gene predictions Bartfinder hmmgene geneid Phat EST (contig) combined prediction

Manual refinement P. falciparum P. knowlesi

Ongoing manual annotation e.g. PF14_0021, PF14_0022 P. falciparum P. vivax Revised annotation (back to Two genes!)

Using FASTA Results FASTA is a global alignment tool BLAST FASTA Reduces sensitivity increases specificity