Predicting Genes in Actinobacteriophages

Slides:



Advertisements
Similar presentations
Application to find Eukaryotic Open reading frames. Lab.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Predicting Genes in Mycobacteriophages December 8, In Silico Workshop Training D. Jacobs-Sera.
Finding Eukaryotic Open reading frames.
Single DNA Sequence Analysis Tools BME 110: CompBio Tools Todd Lowe May 6, 2008.
Bacteriophage Gene Functions Welkin Pope SEA-PHAGES In Silico Workshop, 2014.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Finding prokaryotic genes and non intronic eukaryotic genes
Annotation Presentation Alternative Start Codons &
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Organizing information in the post-genomic era The rise of bioinformatics.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
DNA and Translation Gene: section of DNA that creates a specific protein Approx 25,000 human genes Proteins are used to build cells and tissue Protein.
Bacteriophage Gene Functions Welkin Pope SEA-PHAGES Bioinformatics Workshop, 2015.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Bacterial infection by lytic virus
Bacteriophage Gene Functions
ORF Calling.
bacteria and eukaryotes
Genome Annotation (protein coding genes)
Bacterial infection by lytic virus
A Quest for Genes What’s a gene? gene (jēn) n.
What is a Hidden Markov Model?
Isolation and characterization of the A3 bacteriophage Kady from the host Mycobacterium smegmatis John Sherwood1, Victoria Torres1, Jasmina Cunmulaj1,
SEA-PHAGES Bioinformatics Workshop Overview
Step 3 in Protein Synthesis
Comparison of Cluster S Phages
Amino acids are coded by mRNA base sequences.
Amino acids are coded by mRNA base sequences.
Amino acids are coded by mRNA base sequences.
Amino acids are coded by mRNA base sequences.
Gene architecture and sequence annotation
Ab initio gene prediction
More on translation.
Amino acids are coded by mRNA base sequences.
Amino acids are coded by mRNA base sequences.
Identification and Characterization of pre-miRNA Candidates in the C
Amino acids are coded by mRNA base sequences.
Isolation and Annotation of Arthrobacteriophage
Amino acids are coded by mRNA base sequences.
What do you with a whole genome sequence?
Transcription and Translation
Python.
Amino acids are coded by mRNA base sequences.
Amino acids are coded by mRNA base sequences.
Amino acids are coded by mRNA base sequences.
Amino acids are coded by mRNA base sequences.
Microbial gene identification using interpolated Markov models
Amino acids are coded by mRNA base sequences.
Amino acids are coded by mRNA base sequences.
Amino acids are coded by mRNA base sequences.
Amino acids are coded by mRNA base sequences.
Amino acids are coded by mRNA base sequences.
Amino acids are coded by mRNA base sequences.
GOMASHI ANNOTATION GENES 1-13
4a. Know the general pathway by which ribosomes synthesize proteins, using tRNAs to translate genetic information in mRNA.
Amino acids are coded by mRNA base sequences.
Welkin Pope SEA-PHAGES Bioinformatics Workshop, 2017
Prokaryotes Eukaryotes  
Presentation transcript:

Predicting Genes in Actinobacteriophages 2017 Bioinformatics Workshop Training SEA-PHAGES Cohort 10 D. Jacobs-Sera

How did they get to be that way? Context What are they? How did they get to be that way?

Major capsid protein Tail assembly chaperone Terminase Protease Portal

It is all about finding the patterns… Since the beginning of time, woman (being human) has tried to make order and sense out of her surroundings. Gene annotation and analysis is just a primal instinct to make order. Young children, as they prepare to enter school, are tested to see if they are ready by recognizing patterns, a form of making order. 1. Where will the dot appear in the 4th box? Remember, everything you need to know, you learned in kindergarten….

Make-Believe or Putative Remember, you are working in the putative gene world. All gene predictions are made with the best evidence to date. Most of that evidence is computational (bioinformatic), not experimental. Tomorrow’s data may give us better evidence, but your prediction today is the best it can be … today! Make good predictions following a consistent approach. Let these predictions lead to experimentation that can provide the evidence to improve future predictions.

>2730 1429 Number in GenBank: 821 How many phage genome sequences are in GenBank? >2730 How many mycobacteriophage genomes are sequenced? 1429 How many mycobacteriophage genomes are published? Tricky Question Number in GenBank: 821 684 last yeat

What is the universal format for a sequence? How many As, Cs, Ts, and Gs are in a mycobacteriophage genome? On average: ~70,000 base-pairs Range: ~40,000 to ~165,000 bp What is the universal format for a sequence? FASTA

How do you make sense of the nucleotide sequence? Convert to genes How do you convert ATCGs to genes? Codons Code for Amino Acids, Starts, Stops

Phages use the Bacterial and Plant Plastid code (NCBI: Table 11) 3 starts ATG (methionine) GTG (valine) TTG (leucine) 3 stops (TAA, TAG, TGA) Space in-between: Open Reading Frame -- ORF www.cen.ulaval.ca

ATGGACCTCTCGCCC ATG GAC CTC TCG CCC TGG ACC TCT CGC …. If there are 3 choices (frames) in the forward direction, how many are in the reverse direction?

Six-Frame Translations

Features found in mycobacteriophage genomes protein coding genes ✓ tRNAs ✓ tmRNAs AttP sites Terminators Frame shifts ✓ …

Gene Evaluations For each feature we have 3 questions Is it a gene? What is its start? What is its function?

Gene Evaluations We are always looking for the supporting data. We use 2 programs, Glimmer and GeneMark, to identify coding potential. We use Phamerator output for a visual representation of gene and nucleotide similarity. We use the guiding principles to remind of us all of the parameters. As we evaluate, we can: Add a gene Delete a gene Change a gene start

Supporting Data #1: Coding Potential Glimmer and GeneMark Use Hidden Markov Models to identify coding potential Use a sample of the genome Identify longest ORFS in that sample Calculate patterns in the nucleotides: 2 at a time, 4 at a time A hidden Markov model (HMM) is one in which you observe a sequence of emissions, but do not know the sequence of states the model went through to generate the emissions. Analyses of hidden Markov models seek to recover the sequence of states from the observed data. http://www.mathworks.com/help/stats/hidden-markov-models-hmm.html

GLIMMER http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi

GeneMark Output (trained on M. tuberculosis)

OF BACTERIOPHAGE GENOME ANNOTATION GUIDING PRINCIPLES OF BACTERIOPHAGE GENOME ANNOTATION Found in “Phage Annotation, Genomics and Data Interpretation” Section of the Bioinformatics Guide 15 Key Directives Read for tomorrow

Comparisons with what we already know Phamerator comparisons BLAST comparisons At NCBI ***At phagesDB*** Starterator data “New & Improved!!!”

Phamerator map

Starterator data

Blast Comparisons

Tips for DNA Master Files Save .dnam5 file often Save .dnam5 file as a new name. (Then don’t save the old named one.) If working in a virtual machine, be mindful about closing/shutting down

Let’s get started! Gather Data Auto-annotate in DNA Master Gene Calling Functional Assignments

Mycobacteriophage RagingRooster Genome length: 68670 Cluster B3 GC% content: 67.6 Physical ends: none, circularly permuted Found in Baltimore, MD by Karoline Baldus from Coastal Carolina University

DNA Master Current Build 2534 ✓ ✓ seaphages@gmail.com aviator31 ✓ ✓