Genome Annotation BBSI July 14, 2005 Rita Shiang.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

An Introduction to Bioinformatics Finding genes in prokaryotes.
Genomics and Gene Recognition CIS 667 April 27, 2004.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction methods Gene indices Mapping cDNA on genomic DNA Genome-genome.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
The Molecular Genetics of Gene Expression
Genes. Outline  Genes: definitions  Molecular genetics - methodology  Genome Content  Molecular structure of mRNA-coding genes  Genetics  Gene regulation.
Introduction to BioInformatics GCB/CIS535
Eukaryotic Gene Finding
From Gene to Protein. Genes code for... Proteins RNAs.
10-1 Copyright  2005 McGraw-Hill Australia Pty Ltd PPTs t/a Biology: An Australian focus 3e by Knox, Ladiges, Evans and Saint Chapter 10: The genetic.
Eukaryotic Gene Finding
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 3 Cell Structures and Their Functions Dividing Cells.
Biological Motivation Gene Finding in Eukaryotic Genomes
Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.
Express yourself That darn ribosome Mighty Mighty Proteins Mutants RNA to the Rescue
Genome organization Eukaryotic genomes are complex and DNA amounts and organization vary widely between species.
Protein Synthesis The genetic code – the sequence of nucleotides in DNA – is ultimately translated into the sequence of amino acids in proteins – gene.
Gene Structure and Identification
Essentials of the Living World Second Edition George B. Johnson Jonathan B. Losos Chapter 13 How Genes Work Copyright © The McGraw-Hill Companies, Inc.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
Initiating translation
Protein Synthesis 12-3.
Part Transcription 1 Transcription 2 Translation.
Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.
Chapter 10 Transcription RNA processing Translation Jones and Bartlett Publishers © 2005.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
1 TRANSCRIPTION AND TRANSLATION. 2 Central Dogma of Gene Expression.
Chapter 13. The Central Dogma of Biology: RNA Structure: 1. It is a nucleic acid. 2. It is made of monomers called nucleotides 3. There are two differences.
Molecular Biology in a Nutshell (via UCSC Genome Browser) Personalized Medicine: Understanding Your Own Genome Fall 2014.
Replication Transcription Translation
Genome Annotation Rosana O. Babu.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Transcription Vocabulary of transcription: transcription - synthesis of RNA under the direction of DNA messenger RNA (mRNA) - carries genetic message from.
Transcription. Recall: What is the Central Dogma of molecular genetics?
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
RNA and Gene Expression BIO 224 Intro to Molecular and Cell Biology.
Lesson Four Structure of a Gene. Gene Structure What is a gene? Gene: a unit of DNA on a chromosome that codes for a protein(s) –Exons –Introns –Promoter.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
RNA, Transcription, and the Genetic Code. RNA = ribonucleic acid -Nucleic acid similar to DNA but with several differences DNARNA Number of strands21.
CH 12.3 RNA & Protein Synthesis. Genes are coded DNA instructions that control the production of proteins within the cell…
Human Molecular Genetics Institute of Medical Genetics.
PROTEIN SYNTHESIS. CENTRAL DOGMA OF MOLECULAR BIOLOGY: DNA is used as the blueprint to direct the production of certain proteins.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Eukaryotic Gene Structure
AP Biology Crosby High School
Lesson Four Structure of a Gene.
Lesson Four Structure of a Gene.
Gene Action and Expression
Ch 10: Protein Synthesis DNA to RNA to Proteins
Genes, Genomes, and Genomics
Transcription and Translation
Organisms are made up of cells, cells are largely protein and DNA carries the instructions for the synthesis of those proteins.
From DNA to Protein Class 4 02/11/04 RBIO-0002-U1.
Genome Annotation and the Human Genome
Presentation transcript:

Genome Annotation BBSI July 14, 2005 Rita Shiang

Genome Annotation Identification of important components in genomic DNA

What is a Gene? Fundamental unit of heredity DNA involved in producing a polypeptide; it includes regions preceding and following the coding region (leader and trailer) as well as intervening sequences (introns) Entire DNA sequence including exons, introns, and noncoding transcription-control regions

What Components are Important in Protein Coding Genes? Sequences that initiate transcription Sequences that process hnRNA to mRNA Signals important in translation

TATA Box Lodishet al, Molecular Cell Biology, 2000, Fig

Other Promoters Initiator consensus – 5’Py Py A(+1) N T/A Py Py Py N = A, T, G or C Py = pyrimidine = C or T GC rich sequences – Stretch of GC nucleotides ~100 bp upstream of start site (CpG not common in genome) – Housekeeping genes – Multiple initiation sites

Polyadenylation & Cleavage Addition of a string of As to mRNAs Polyadenylation signal AAUAAA found before cleavage site GU or UU rich region ~50 bp from the cleavage site Stabilizes mRNA transcripts Lodishet al, Molecular Cell Biology, 2000, Fig

Splicing Lodishet al, Molecular Cell Biology, 2000, Fig. 11,13. Electron micrograph of adenovirus DNA and hexon gene mRNA

Splice Reaction Lodishet al, Molecular Cell Biology, 2000, Fig

Splice Sites Lodishet al, Molecular Cell Biology, 2000, Fig. 11,14.

Additional Splice Sites ConsensusPy7NCAG-G(exon)AG – GUAAGU 98.12% Nonconsensus GC U12 introns AC PuUAUCCUPy 0.76% Other rare sequences 1% Py = C or U Pu = A or G

Translation Signals 5’ Cap structure directs ribosomal binding AUG codes for methionine. The first AUG in a transcript is where translation starts Open reading frame (ORF) – Stretch of sequence that codes for amino acids before a stop codon Translation stop codons UAG, UAA, UGA

Capping of 5’RNA with 7’- methylguanylate (m 7 G) Lodish et al, Molecular Cell Biology, 2000, Fig

Known Gene Components Lodishet al, Molecular Cell Biology, 2000, Fig

Genome Annotation What is in a genome besides protein coding genes?

Repetitive DNA makes up at least 50% of the genome Transposon-derived interspersed repeats Inactive retroposed copies of genes –pseudogenes Simple short repeats Segmental Duplications Blocks of tandemly repeated sequences – Centromeres – Telomeres – Short arm of acrocentric chromosomes – Ribosomal gene clusters

Non-protein coding genes or non- coding RNA (ncRNA) tRNA genes rRNA genes snRNA genes – Splicing – Telomere maintenance snoRNA genes Other – microRNA

Annotation of Genomic DNA Identifying Protein Coding Genes Placing the genes on the genome (where are they?)

How Many Genes in the Genome? Early on based on reassociation kinetics the estimate was ~40,000 Walter Gilbert estimated ~100,000 based on gene and genome size 70,000 – 80,000 based on an extrapolated number of CpG islands With the Human sequence the estimate is 30,000 – 40,000

Annotation of Genomic DNA Specifically for Genes that Code for Proteins Match genomic DNA to genes that have been previously cloned and sequenced looking for sequence similarity using BLAST programs Predict genes using computer programs to scan genomic DNA using known elements Many strategies use a combination of both methods

Lodishet al, Molecular Cell Biology, 2000, Fig cDNA Library Construction

Lodishet al, Molecular Cell Biology, 2000, Fig. 7.15

Gene Annotation Celera Constructed gene models using sequence from cDNAs Used Unigene database Partitions GenBank sequences (mRNAs & ESTs) into non- redundant set using 3’ UTRs 111,064 Unigene clusters for human

Gene Annotation Celera cont. Predicts gene boundaries by identifying overlapping sets of EST and protein matches Known full-length genes were annotated on the map (matched w/50% of the length & >92% identity) Clusters that did not match a full-length gene were evaluated using other references – Conservation of genomic sequence between mouse & human – Similarity between human & rodent transcripts – Similarity to known proteins

Validation Validated by construction of known genes (RefSeq) 6.1% of RefSeq genes were not annotated by Otto

Gene Annotation - Human Genome Sequencing Consortium Start with Ensemble predicted genes – ab initio predictions using Genscan Based on probabilistic model of genome sequence composition and gene structure – Confirm similarity to mRNAs, ESTs, protein motifs from all organisms – Extend protein matches using GeneWise Compares protein based information to genomic sequence and allows for frameshifts and large introns – Produces partial gene predictions

Consortium cont. Merge Ensemble gene predictions w/ Genie predictions – Genie identifies matches of mRNAs and ESTs Employs hidden Markov models (HMMs) to extend matches using ab initio statistical methods Links information from 5’ and 3’ ESTs from the same cDNA clone to complete a sequence from the ATG to the stop codon Can generate alternatively spliced products (though only longest used in this build) Merge results with genes in RefSeq, SWISSPROT and TrEMBL databases

Validation Validate method by comparing to a new set of known genes, a set of mouse cDNAs and genes on Chromosome 22 (Finished Sequence) 85% Sensitivity 13% spurious predictions

Factors Affecting Gene Annotation Splice sites do not conform to consensus Noncoding exons are common – Exon – what is left over after splicing after introns are removed and does not refer to a stretch of coding information – tRNAs are spliced but noncoding – >35% of human genes have noncoding exons – No statistical bias so they are difficult to identify

Factors Affecting Gene Annotation Cont. Internal exons can be very small – Avg. size of internal exons are ~130 bp – ~65% of vertebrate exons are bp – >10% are <60 bp – Exons < 10 bp have been identified – Invected gene in Drosophila One of four exons is 6 bp (GTCGAA) Flanked by introns of 27.6 and 1.1 kb Not correctly recognized by cDNA alignment software and creates a frameshift in the gene – Exons of size 0 Resizing exons create an intermediate splice product

Places to View Annotated Genomes National Center for Biotechnology Information (NCBI) Ensemble The Golden Path (UCSC Genome Browser) Celera

Verification of Annotation in C. elegans by Experimentation Complete genomic sequence Small introns Small intergenic regions

Results 11,984 cDNAs successfully cloned out of a prediction of 19,477 4,365 were not represented by cDNAs or ESTs Failure of cloning could be due to: – Wrongly predicted exons – Very low expressing genes – Not a real gene

Verification of intron/exon structures

Comparison of a Single Transcript

Greater than 50% of intron/exon structures need correcting?