ORF Calling.

Slides:



Advertisements
Similar presentations
Biological Motivation Gene Finding
Advertisements

Bioinformatics as Hard Disk Investigation Assuming you can read all the bits on a 1000 year old hard drive Can you figure out what does what? - Distinguish.
Bioinformatics. Bioinformatics is an applied science that uses computer programs to access molecular biology databanks to make inferences about the information.
 -GLOBIN MUTATIONS AND SICKLE CELL DISORDER (SCD) - RESTRICTION FRAGMENT LENGTH POLYMORPHISMS (RFLP)
ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
Supplementary Fig.1: oligonucleotide primer sequences.
Gene Identification Lab
chromosome organization, what about genome organization?
Introduction to Molecular Biology. G-C and A-T pairing.
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
Reading the blueprint of life DNA sequencing. Introduction The blueprint of life is contained in the DNA in the nuclei of eukaryotic cells and simply.
Gene Structure and Identification
IGEM Arsenic Bioremediation Possibly finished biobrick for ArsR by adding a RBS and terminator. Will send for sequencing today or Monday.
Nature and Action of the Gene
Chapter 10 Molecular Biology of the Gene. Information transfer is from DNA  RNA  protein Replication What is it? Where does it occur? REPLICATION Copying.
DNA, RNA, and Proteins.  Students know and understand the characteristics and structure of living things, the processes of life, and how living things.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Protein Synthesis Using RNA to make proteins. Going from DNA to Proteins Let’s review what we’ve done so far: We take our DNA and convert it into RNA.
Undifferentiated Differentiated (4 d) Supplemental Figure S1.
A.B. C. orf60(pOrf60) 042orf orf60(pOrf60-M5 ) orf60(pOrf60-M1) orf60(pOrf60-M4) 042orf60 042orf60(pOrf60-M5) orf60(pOrf60) 042orf60(pOrf60-M1)
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
PART 1 - DNA REPLICATION PART 2 - TRANSCRIPTION AND TRANSLATION.
TRANSLATION: information transfer from RNA to protein the nucleotide sequence of the mRNA strand is translated into an amino acid sequence. This is accomplished.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
From Genomes to Genes Rui Alves.
Passing Genetic Notes in Class CC106 / Discussion D by John R. Finnerty.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
Transcription and Translation Activity 1.You will work with the person sitting next to you. 2.One of you will take the role of RNA polymerase and transcribe.
The Genetic Code Objective: F2 - Explain … process of transcription & translation …, describe the role of DNA & RNA during protein synthesis, & recognize.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Suppl. Figure 1 APP23 + X Terc +/- Terc +/-, APP23 + X Terc +/- G1Terc -/-, APP23 + X G1Terc -/- G2Terc -/-, APP23 + X G2Terc -/- G3Terc -/-, APP23 + and.
GENE EXPRESSION. Transcription 1. RNA polymerase unwinds DNA 2. RNA polymerase adds RNA nucleotides (A ↔ U, G ↔ C) 3. mRNA is formed! DNA reforms a double.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
Name of presentation Month 2009 SPARQ-ed PROJECT Mutations in the tumor suppressor gene p53 Pulari Thangavelu (PhD student) April Chromosome Instability.
DNA, RNA and Protein.
Bacterial infection by lytic virus
bacteria and eukaryotes
Genome Annotation (protein coding genes)
Bacterial infection by lytic virus
Metagenomics Rob Edwards.
DNA, RNA and Protein Synthesis
Protein Synthesis DNA RNA Protein.
Modelling Proteomes.
Supplementary information Table-S1 (Xiao)
Sequence – 5’ to 3’ Tm ˚C Genome Position HV68 TMER7 Δ mt. Forward
GENE MUTATIONS aka point mutations © 2016 Paul Billiet ODWS.
Molecular Biology of the Gene
Interpolated Markov Models for Gene Finding
Gene architecture and sequence annotation
In: What are INTRONS and EXONS again?
Ab initio gene prediction
PROTEIN SYNTHESIS RELAY
More on translation.
Introduction to Bioinformatics II
NOTE SHEET 13 – Protein Synthesis
What do you with a whole genome sequence?
Python.
How genes on a chromosome determine what proteins to make
DNA to proteins.
Reading mRNA and synthesizing protein
Presentation transcript:

ORF Calling

ORF Calling Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity studies Proteins are better for remote similarities than DNA sequences Protein sequences change slower than DNA sequences

ORF Calling Extrinsic gene calling Intrinsic gene calling Compare your DNA sequences to known sequences. Needs other sequences that are known! Intrinsic gene calling Only use information in your DNA sequences. Does not use other information.

Extrinsic gene calling Start with DNA sequence Translate in all 6 reading frames

Why are there 6 reading frames? AG TAA AAC TTT AAT TGT TGG TTA A 3 A GTA AAA CTT TAA TTG TTG GTT AA 2 AGT AAA ACT TTA ATT GTT GGT TAA 1 AGT AAA ACT TTA ATT GTT GGT TAA TCA TTT TGA AAT TAA CAA CCA ATT | | | | | | | | | | | | | | | | | | | | | | | | TCA TTT TGA AAT TAA CAA CCA ATT -1 TC ATT TTG AAA TTA ACA ACC AAT T -2 T CAT TTT GAA ATT AAC AAC CAA TT -3

Extrinsic gene calling Start with DNA sequence Translate in all 6 reading frames Compare your sequence to known protein sequences Find the ends of each, and call those genes!

} For example DNA sequence Similar protein sequences e.g. from BLAST Protein encoding gene DNA sequence } Similar protein sequences e.g. from BLAST

Uses of extrinsic calling This is how (most) metagenome ORF calling is done Eukaryotic ORF calling – especially using EST sequences

Problems with extrinsic calling Very slow (depending on search algorithm) Dependent on your database Only finds known genes

Alternatives to extrinsic gene calling Intrinsic gene calling Ab initio gene calling What are the start codons? What are the stop codons? ATG TAA TAG TGA

How frequently do stop codons appear? Approximately once every 20 amino acids at random! A stretch of 100 amino acids is likely to have a stop codon!

How to call ORFs (the easy way) 3 2 1 DNA -1 -2 -3

Find all the stop codons 3 2 1 DNA -1 -2 -3

Find all the ORFs > x amino acids X is often 100 amino acids 3 2 1 DNA -1 -2 -3

Trim to those ORFs that have a start 3 2 1 DNA -1 -2 -3

Remove “shadow” ORFs Short ORFs that overlap others 3 2 1 DNA -1 -2 -3

Trim the start sites to first ATG 3 2 1 DNA -1 -2 -3

These are the ORFs 3 2 1 DNA -1 -2 -3

Intrinsic ORF calling using Markov Models

Markov Models Based on language processing Common for gene and protein finding, alignments, and so on

What is the most common word? English: the Spanish: el (la) Portuguese: que

Scrabble

Scrabble In scrabble, how do they score the letters? The most abundant letters (easiest to place on the board) are given the lowest score

Scrabble 1 point: E, A, I, O, N, R, T, L, S, U 2 points: D, G 3 points: B, C, M, P 4 points: F, H, V, W, Y 5 points: K 8 points: J, X 10 points: Q, Z

Frequency of letters

Making up sentences If I want to make up a sentence, I could choose some letters at random, based on their occurrence in the alphabet (i.e their scrabble score) rla bsht es stsfa ohhofsd

Lets get clever! What follows a period (“.”)? What follows a t? Usually a space “ ” Usually an “i” (-tion, -tize, ...)

Frequency of two letters When the first letter is “t” (from 3,269 words): ti 51% te 20% ta 15% th 8%

Level 1 analysis Choose a letter based on the probability that it follows the letter before: s h a n d t u c t h i n e y m e l e o l l d

Levels of analysis 1 letter (a, e, o …) 2 letters (th, ti, sh …) 3 letters (the, and, …) 4 letters (that, …) Zero order model First order model Second order model Third order model

Markov models With about 10th order Markov models of English you get complete words and sentences!

Markov models With about 10th order Markov models of English you get complete words and sentences!

Markov Models and ORF calling Codons have three letters (ATG, CAC, GGG, ...) Use a 2nd order Markov model for ORF calling The frequency of a letter is predicted based on the frequency of the two letters before

Scrabble

Scrabble (México) Do English and Spanish use the same letters?

Scrabble (México)

Scrabble (US) 1 point: E, A, I, O, N, R, T, L, S, U 2 points: D, G 3 points: B, C, M, P 4 points: F, H, V, W, Y 5 points: K 8 points: J, X 10 points: Q, Z Based on the front page of the NY Times!

Scrabble (Spanish) 1 point: A, E, O, I, S, N, L, R, U, T 2 points: D, G 3 points: C, B, M, P 4 points: H, F, V, Y 5 points: CH, Q 8 points: J, LL, Ñ, RR, X 10 points: Z

What about scrabble scores for DNA? Will vary with the composition of the organism! Remember, some organisms have high G+C compared to A+T

Markov Models and ORF calling Use a 2nd order Markov model for ORF calling The frequency of a letter is predicted based on the frequency of the two letters before

Problems! Need to train the Markov model – not all organisms are the same Can use phylogentically close organisms Can use “long orfs” – likely to be correct because unlikely to be random stretches without a stop codon!

Interpolated Markov Model (The imm in GLIMMER) Markov Models order 1-8 (word size 2-9) Discard (or ↓ weight) for rare words Promote (or ↑ weight) for common words Probability is the sum of all probabilities from 1-8 2-9

RNA genes As with proteins, two main methods: Ab initio Intrinsic Homology based extrinsic

Ribosomes Ribosomes are made of proteins and RNA

30S subunit from Thermus aquaticus Blue: protein Orange: rRNA

E. coli 16S rRNA secondary structure

Variable region Conserved region

V6 V5 V7 (43) (37) (28, 29) V4 V8 (P23-1, (45, 46) 24) V9 (49) V3 (18) Variable regions in the 16S rRNA. Vn – 9 regions (n) – variable loop(s) forward/rev primers V1 (6) Van de Peer Y, Chapelle S, De Wachter R. (1996) A quantitative map of nucleotide substitution rates in bacterial rRNA. Nucl. Acids Res. 24:3381-3391 V2 (8-11)

Ribosomes Ribosomes are made of proteins and RNA Prokaryotic ribosome: Large subunit: 50S 5S and 23S rRNA genes Small subunit: 30S 16S rRNA gene

Finding 16S genes Easiest way is iterative: BLAST ALIGN TRIM Problem: secondary structure makes identification of the ends difficult

Finding tRNA genes Not as easy as rRNA Much shorter Varied sequence Only conservation is 2° structure

tRNAScan-SE Sean Eddy Use it!

How does this relate to tRNA? tRNA-Phe by Yikrazuul - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons https://commons.wikimedia.org/wiki/File:TRNA-Phe_yeast_en.svg

tRNA structure Start of acceptor stem (7-9 bp) D-loop (4-6-bp) stem plus loop anticodon arm (6-bp) stem plus loop with anticodon T-loop (4-5-bp) stem plus loop End of acceptor stem (7-9 bp) CCA to attach amino acid (may not be in sequence ... added during processing)