Biological Motivation Gene Finding

Slides:



Advertisements
Similar presentations
An Introduction to Bioinformatics Finding genes in prokaryotes.
Advertisements

Genomics and Gene Recognition CIS 667 April 27, 2004.
CSCE555 Bioinformatics Lecture 3 Gene Finding Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
RNA and Protein Synthesis
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
BIOS816/VBMS818 Lecture 7 – Gene Prediction Guoqing Lu Office: E115 Beadle Center Tel: (402) Website:
The Molecular Genetics of Gene Expression
ECE 501 Introduction to BME
8 The Molecular Genetics of Gene Expression. Fig. 8.6c Transcription Elongation.
(CHAPTER 12- Brooker Text)
1. Important Features a. DNA contains genetic template" for proteins.
Protein Synthesis.
Gene expression.
Biological Motivation Gene Finding in Eukaryotic Genomes
GENE: RNA polymerases and transcription factors. Structure of genes Prokaryotic and eukaryotic genes differ in their structure, however there are a number.
Protein Synthesis The genetic code – the sequence of nucleotides in DNA – is ultimately translated into the sequence of amino acids in proteins – gene.
Transcription Transcription- synthesis of RNA from only one strand of a double stranded DNA helix DNA  RNA(  Protein) Why is RNA an intermediate????
Gene Structure and Identification
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Today: Genetic Technology Wrap-up Exam Review Remember: Final Exam is Wednesday, 12/13 at 1 pm!
RNA and Protein Synthesis
RNA and Protein Synthesis
Part Transcription 1 Transcription 2 Translation.
Chapter 10 Transcription RNA processing Translation Jones and Bartlett Publishers © 2005.
1 TRANSCRIPTION AND TRANSLATION. 2 Central Dogma of Gene Expression.
1 Genes and How They Work Chapter Outline Cells Use RNA to Make Protein Gene Expression Genetic Code Transcription Translation Spliced Genes – Introns.
From Gene to Protein A.P. Biology. Regulatory sites Promoter (RNA polymerase binding site) Start transcription DNA strand Stop transcription Typical Gene.
Molecular Biology in a Nutshell (via UCSC Genome Browser) Personalized Medicine: Understanding Your Own Genome Fall 2014.
Fig.1.8 DNA STRUCTURE 5’ 3’ Antiparallel DNA strands Hydrogen bonds between bases DOUBLE HELIX 5’ 3’
PROTEIN SYNTHESIS. Protein Synthesis: overview  DNA is the code that controls everything in your body In order for DNA to work the code that it contains.
Transcription Packet #20 5/31/2016 2:49 AM1. Introduction  The process by which information encoded in DNA specifies the sequences of amino acids in.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Chapter 17 From Gene to Protein. Gene Expression DNA leads to specific traits by synthesizing proteins Gene expression – the process by which DNA directs.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Ch. 17 From Gene to Protein. Genes specify proteins via transcription and translation DNA controls metabolism by directing cells to make specific enzymes.
From Genomes to Genes Rui Alves.
GENE EXPRESSION What is a gene? Mendel –Unit of inheritance conferring a phenotype Modern definition –Unit of DNA directing the synthesis of a polypeptide.
Genes and How They Work Chapter The Nature of Genes information flows in one direction: DNA (gene)RNAprotein TranscriptionTranslation.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
RNA and Gene Expression BIO 224 Intro to Molecular and Cell Biology.
Lecture 4: Transcription in Prokaryotes Chapter 6.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
RNA, Transcription, and the Genetic Code. RNA = ribonucleic acid -Nucleic acid similar to DNA but with several differences DNARNA Number of strands21.
N Chapter 17~ From Gene to Protein. Protein Synthesis: overview n One gene-one enzyme hypothesis (Beadle and Tatum) –The function of a gene is to dictate.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
Bacterial infection by lytic virus
ORF Calling.
bacteria and eukaryotes
Gene Expression - Transcription
Bacterial infection by lytic virus
Eukaryotic Gene Structure
A Quest for Genes What’s a gene? gene (jēn) n.
Transcription.
Chapter 15: RNA Ribonucleic Acid.
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
PROTEIN SYNTHESIS.
The 11th lecture in MOLECULAR BIOLOGY
From DNA to Protein Class 4 02/11/04 RBIO-0002-U1.
credit: modification of work by NIH
CHAPTER 17 FROM GENE TO PROTEIN
Gene Structure.
Chapter 15: RNA Ribonucleic Acid.
Prokaryotes Eukaryotes  
Gene Structure.
Presentation transcript:

Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones

Gene Finding Why do it? Find and annotate all the genes within the large volume of DNA sequence data how many genes in an organism? homologies?   Gain understanding of problems in basic science e.g. gene regulation-what are the mechanisms involved in transcription, splicing, etc? Different emphasis in these goals has some effect on the design of computational approaches for gene finding.

Gene Finding by Biological Methods: Extract mRNA reverse transcribe cDNA Label cDNA Detecting by using cDNA probe Gene found DNA library

Gene Finding by Computational Methods Dependent on good experimental data to build reliable predictive models Various aspects of gene structure/function provide information used in gene finding programs

Figure 12.3 Figure 12.3 In prokaryotes, these processes are coupled In Eukaryotes, these processes are physically separated and there are more steps! Figure 12.3

The Informatics View of Genes Genes are character strings embedded in much larger strings called the genome Genes are composed of ordered elements associated with the fundamental genetic processes including transcription, splicing, and translation.

Gene Finding Cells recognize genes from DNA sequence find genes via their bioprocesses Not so easy for us..

CTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAGGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGTTTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTGGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAACAAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTATTGTTATGAGACTGGATATAT... Even when entire genome is sequenced much work remains to make sense of the sequence May appear simple but is not For example, where do genes begin? Where do they end? how do we find individual changes in sequence that are meaningful? Some aren’t: because not all sequences contain coding information for amino acids and proteins; Even in coding sequences some differences are tolerated. Is like finding a needle in a haystack Example here of gene, that when mutated, is responsible for cystic fibrosis

G CTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAGGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGTTTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTGGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAACAAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTATTGTTATGAGACTGGATATAT...

Types of Genes Protein coding RNA genes most genes rRNA tRNA snRNA (small nuclear RNA) snoRNA (small nucleolar RNA) snRNAs: small nuclear RNAs; usually associated with proteins and are then known as SNRNPs or snurps Ex: those involved with splicing snoRNAs: small nucleolar RNAs: involved in processing of rRNAs; perhaps in ribosome assembly PolI: most rRNAs; snoRNAs PolII: mRNAs, snRNAs PolIII: tRNAs, 5srRNA

3 Major Categories of Information used in Gene Finding Programs Signals/features = a sequence pattern with functional significance e.g. splice donor & acceptor sites, start and stop codons, promoter features such as TATA boxes, TF binding sites, CpG islands Content/composition -statistical properties of coding vs. non-coding regions. e.g. codon-bias; length of ORFs in prokaryotes;GC content Similarity-compare DNA sequence to known sequences in database Not only known proteins but also ESTs, cDNAs Similarity: -involves translation in all 6 possible reading frames. Need to discuss repetitive sequences & how to handle them. Can also use database information to aid in the prediction –not only known proteins but also ESTs, cDNAs (explain these).  

Looking for Protein Coding Genes Look for ORF (begins with start codon, ends with stop codon, no internal stops!) long (usually > 60-100 aa) If homologous to “known” protein more likely Look for basal signals Transcription, splicing, translation Look for regulatory signals Depends on organism Prokaryotes vs Eukaryotes Vertebrate vs fungi Yeast, ~1% of genes have ORFs<100 aa

Easier problem: Gene Finding in Bacterial Genomes Why? Dense Genomes Short intergenic regions Uninterrupted ORFs Conserved signals Abundant comparative information Complete Genomes available for many

What do Prokaryotic Genes look like? 5’ 3’ Open Reading Frame Promoter region (maybe) Ribosome binding site (maybe) Termination sequence (maybe) Start codon / Stop Codon

Prokaryotic Gene Expression Promoter Cistron1 Cistron2 CistronN Terminator Transcription RNA Polymerase mRNA 5’ 3’ 1 2 N Translation Ribosome, tRNAs, Protein Factors SD in polycistronic message N N C N C C 1 2 3 Polypeptides Slide modified from: http://biology.uky.edu/520/Lecture/lect8/lect8Notes.ppt

Open Reading Frame (ORF) Any stretch of DNA that potentially encodes a protein The identification of an ORF is the first indication that a segment of DNA may be part of a functional gene

Open Reading Frames A C G T A A C T G A C T A G G T G A A T Each grouping of the nucleotides into consecutive triplets constitutes a reading frame. There are three different reading frames in the 5’->3’ direction and a further three in the reverse direction on the opposite strand. A sequence of triplets that contains no stop codon is an Open Reading Frame (ORF) A C G T A A C T G A C T A G G T G A A T CGT AAC TGA CTA GGT GAA GTA ACT GAC TAG GTG AAT

ORFs as gene candidates An open reading frame that begins with a start codon (usually ATG, GTG or TTG, but this is species-dependent) Most prokaryotic genes code for proteins that are 60 or more amino acids in length The probability that a random sequence of nucleotides of length n has no stop codons is (61/64)n When n is 50, there is a probability of 92% that the random sequence contains a stop codon When n is 100, this probability exceeds 99%

Codon Bias Genetic code degenerate Codon usage varies Biological basis Equivalent triplet codons code for the same amino acid http://www.pangloss.com/seidel/Protocols/codon.html Codon usage varies organism to organism gene to gene Biological basis Avoidance of codons similar to stop Preference for codons that correspond to abundant tRNAs within the organism

Codon Bias Gene Differences GAL4 ADH1 Gly GGG 0.21 0 Gly GGA 0.17 0 Gly GGT 0.38 0.93 Gly GGC 0.24 0.07 Slide modified from: http://biology.uky.edu/520/Lecture/lect8/lect8Notes.ppt

Codon Bias Organism differences Yeast Genome: arg specified by AGA 48% of time (other five equivalent codons ~10% each) Fruitfly Genome: arg specified by CGC 33% of time (other five ~13% each) Complete set of codon usage biases can be found at: http://www.kazusa.or.jp/codon/

GC content GC relative to AT is a distinguishing factor of bacterial genomes Varies dramatically across species Serves as a means to identify bacterial species For various biological reasons Mutational bias of particular DNA polymerases DNA repair mechanisms horizontal gene transfer (transformation, transduction, conjugation)

GC Content GC content may be different in recently acquired genes than elsewhere This can lead to variations in the frequency of codon usage within coding regions There may be significant differences in codon bias within different genes of a single bacterium’s genome

Ribosome Binding Sites RBS is also known as a Shine-Dalgarno sequence (species-dependent) that should bind well with the 3’ end of 16S rRNA (part of the ribosome) Usually found within 4-18 nucleotides of the start codon of a true gene

Shine-Dalgarno Sequence Is a nucleotide sequence (consensus = AGGAGG) that is present in the 5'-untranslated region of prokaryotic mRNAs. This sequence serves as a binding site for ribosomes and is thought to influence the reading frame. If a subsequence aligning well with the Shine-Dalgarno sequence is found within 4-18 nucleotides of an ORF’s start codon, that improves the ORF’s candidacy.

Not so simple: remember, these are consensus sequences Bacterial Promoter -35 T82T84G78A65C54A45… (16-18 bp)… T80A95T45A60A50T96…(A,G) -10 +1 Not so simple: remember, these are consensus sequences

Termination Sequences 3’-U tail Stem/loop Inverted repeat immediately preceding the runs of uracil Termination sequence