Genome Annotation Haixu Tang School of Informatics.

Slides:



Advertisements
Similar presentations
From Gene to Protein How Genes Work
Advertisements

Probabilistic sequence modeling I: frequency and profiles Haixu Tang School of Informatics.
An Introduction to Bioinformatics Finding genes in prokaryotes.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Gene Prediction: Statistical Approaches Lecture 22.
Introduction to Molecular Biology. G-C and A-T pairing.
1. Important Features a. DNA contains genetic template" for proteins.
Biological Motivation Gene Finding in Eukaryotic Genomes
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
FROM GENE TO PROTEIN: TRANSCRIPTION & RNA PROCESSING Chapter 17.
Central Dogma First described by Francis Crick
3. Genome Annotation: Gene Prediction. Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions.
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Nature and Action of the Gene
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Chapter 10 genome, gene expression; genes as units of inheritance transmission of heritable characteristics; gene regulation, eukaryote chromosomes, alleles.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
Gene prediction. Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
RNA and Protein Synthesis
RNA and Protein Synthesis
Genome sequencing Haixu Tang School of Informatics.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
1 TRANSCRIPTION AND TRANSLATION. 2 Central Dogma of Gene Expression.
1 Genes and How They Work Chapter Outline Cells Use RNA to Make Protein Gene Expression Genetic Code Transcription Translation Spliced Genes – Introns.
Chapter 13. The Central Dogma of Biology: RNA Structure: 1. It is a nucleic acid. 2. It is made of monomers called nucleotides 3. There are two differences.
Gene Prediction: Statistical Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 20, 2005 ChengXiang Zhai Department of Computer Science.
From Genomes to Genes Rui Alves.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Protein Synthesis Chapter 17. Protein synthesis  DNA  Responsible for hereditary information  DNA divided into genes  Gene:  Sequence of nucleotides.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
CHAPTER 13 RNA and Protein Synthesis. Differences between DNA and RNA  Sugar = Deoxyribose  Double stranded  Bases  Cytosine  Guanine  Adenine 
Microbial Genetics.  DNA replication is semi- conservative:  What does it mean? During cell division, each daughter cell inherits 2 DNA strands, One.
DNA and RNA II Sapling Chapter 6 short version You are responsible for textbook material covered by the worksheets. CP Biology Paul VI Catholic High School.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Finding genes in the genome
The ‘redundant’ code.
Gene Structure Prediction (Gene Finding) I519 Introduction to Bioinformatics, 2012.
CFE Higher Biology DNA and the Genome Transcription.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
From Gene to Protein Transcription and Translation.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Eukaryotic Gene Structure
RNA and Protein Synthesis
Which of the following would be the corresponding amino acid sequence that would be translated as a protein product of the following segment of DNA? A.
Molecular Biology DNA Expression
Wishful Adaptation Paragraph
Chapter 10 How Proteins are Made.
Overview: The Flow of Genetic Information
Chapter 13: Protein Synthesis
Gene architecture and sequence annotation
Recitation 7 2/4/09 PSSMs+Gene finding
Outline What is an amino acid / protein
More on translation.
Introduction to Bioinformatics II
Gene Prediction: Statistical Approaches
Genes and How They Work Chapter 15
Central Dogma Central Dogma categorized by: DNA Replication Transcription Translation From that, we find the flow of.
Chapter 14: Protein Synthesis
The Toy Exon Finder.
12-3: RNA and Protein Synthesis (part 1)
Presentation transcript:

Genome Annotation Haixu Tang School of Informatics

Genome and genes Genome: an organism’s genetic material (Car encyclopedia) Gene: a discrete units of hereditary information located on the chromosomes and consisting of DNA. (Chapters to make components of a car, or to use and drive a car).

Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctcatgcgg Gene!

Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions of genes in a genome Gene Prediction: Computational Challenge

Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA Central Dogma: DNA -> RNA -> Protein

Codon: 3 consecutive nucleotides 4 3 = 64 possible codons Genetic code is degenerative and redundant –Includes start and stop codons –An amino acid may be coded by more than one codon (codon degeneracy) Translating Nucleotides into Amino Acids

In 1961 Sydney Brenner and Francis Crick discovered frameshift mutations Systematically deleted nucleotides from DNA –Single and double deletions dramatically altered protein product –Effects of triple deletions were minor –Conclusion: every triplet of nucleotides, each codon, codes for exactly one amino acid in a protein Codons

UAA, UAG and UGA correspond to 3 Stop codons that (together with Start codon ATG) delineate Open Reading Frames Genetic Code and Stop Codons

Six Frames in a DNA Sequence stop codons – TAA, TAG, TGA start codons - ATG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC

Detect potential coding regions by looking at ORFs –A genome of length n is comprised of (n/3) codons –Stop codons break genome into segments between consecutive Stop codons –The subsegments of these that start from the Start codon (ATG) are ORFs ORFs in different frames may overlap Genomic Sequence Open reading frame ATGTGA Open Reading Frames (ORFs)

Long open reading frames may be a gene –At random, we should expect one stop codon every (64/3) ~= 21 codons –However, genes are usually much longer than this A basic approach is to scan for ORFs whose length exceeds certain threshold –This is naïve because some genes (e.g. some neural and immune system genes) are relatively short Long vs.Short ORFs

Testing ORFs: Codon Usage Create a 64-element hash table and count the frequencies of codons in an ORF Amino acids typically have more than one codon, but in nature certain codons are more in use Uneven use of the codons may characterize a real gene This compensate for pitfalls of the ORF length test

Codon Usage in Human Genome

AA codon /1000 frac Ser TCG Ser TCA Ser TCT Ser TCC Ser AGT Ser AGC Pro CCG Pro CCA Pro CCT Pro CCC AA codon /1000 frac Leu CTG Leu CTA Leu CTT Leu CTC Ala GCG Ala GCA Ala GCT Ala GCC Gln CAG Gln CAA Codon Usage in Mouse Genome

Transcription in prokaryotes Coding region Promoter Transcription start side Untranslated regions Transcribed region start codon stop codon 5’ 3’ upstream downstream Transcription stop side

Microbial gene finding Microbial genome tends to be gene rich (80%-90% of the sequence is coding sequence) Major problem – finding genes without known homologue.

Open Reading Frame Open Reading Frame (ORF) is a sequence of codons which starts with start codon, ends with a stop codon and has no stop codons in-between. Searching for ORFs – consider all 6 possible reading frames: 3 forward and 3 reverse Is the ORF a coding sequence? 1.Must be long enough (roughly 300 bp or more) 2.Should have average amino-acid composition specific for a given organism. 3.Should have codon usage specific for the given organism.

Gene finding using codon frequency frequency in coding regionfrequency in non-coding region Input sequence Compare Coding region or non-coding region

Example Codon position ACTG 128%33%18%21% 232%16%21%32% 333%15%14%38% frequency in genome 31%18%19%31% Assume: bases making codon are independent P(x|in coding) P(x|random) =  P(Ai at ith position) P(Ai in the sequence) i Score of AAAGAT:.28*.32*.33*.21*.26*.14.31*.31*.31*.31*.31*.19

Using codon frequency to find correct reading frame Consider sequence x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9…. where x i is a nucleotide let p 1 = p x1 x2 x3 p x3 x4 x5 …. p 2 = p x2 x3 x4 p x5 x6 x7 …. p 3 = p x3 x4 x5 p x6 x7 x8 …. then probability that ith reading frame is the coding frame is: p i p 1 + p 2 + p 3 Algorithm: slide a window along the sequence and compute P i Plot the results P i =

Eukaryotic gene finding On average, vertebrate gene is about 30KB long Coding region takes about 1KB Exon sizes vary from double digit numbers to kilobases An average 5’ UTR is about 750 bp An average 3’UTR is about 450 bp but both can be much longer.

Exons and Introns In eukaryotes, the gene is a combination of coding segments (exons) that are interrupted by non-coding segments (introns) This makes computational gene prediction in eukaryotes even more difficult Prokaryotes don’t have introns - Genes in prokaryotes are continuous

Gene Structure

Gene structure in eukaryotes Promoter Transcription start side Untranslated regions Transcribed region start codon stop codon 5’ 3’ Transcription stop side exons Initial exon Final exon donor and acceptor sides GT AG

Central Dogma and Splicing exon1 exon2exon3 intron1intron2 transcription translation splicing exon = coding intron = non-coding

Splicing Signals Exons are interspersed with introns and typically flanked by GT and AG

Splice site detection 5’ 3’ Donor site Position %

Consensus splice sites Donor: 7.9 bits Acceptor: 9.4 bits

Promoters Promoters are DNA segments upstream of transcripts that initiate transcription Promoter attracts RNA Polymerase to the transcription start site 5’ Promoter 3’

Statistical: coding segments (exons) have typical sequences on either end and use different subwords than non-coding segments (introns). Similarity-based: many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes. Two Approaches to Eukaryotic Gene Prediction

Ribosomal Binding Site

( Donor: 7.9 bits Acceptor: 9.4 bits (Stephens & Schneider, 1996) Donor and Acceptor Sites: Motif Logos

Similarity-based gene finding Alignment of –Genomic sequence and (assembled) EST sequences –Genomic sequence and known (similar) protein sequences –Two or more similar genomic sequences

Cell or tissue Isolate mRNA and Reverse transcribe into cDNA Clone cDNA into a vector to Make a cDNA library 5’ 3’ EST Pick a clone And sequence the 5’ and 3’ Ends of cDNA insert dbEST Submit To dbEST Vectors Expressed Sequence Tags

Central Dogma and Splicing exon1 exon2exon3 intron1intron2 transcription translation splicing exon = coding intron = non-coding

Splicing Sequence Alignment Potential splicing sites

Using Similarities to Find the Exon Structure Human EST (mRNA) sequence is aligned to different locations in the human genome Find the “best” path to reveal the exon structure of human gene EST sequence Human Genome

An annotated gene in human genome