Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal.

Slides:



Advertisements
Similar presentations
An Introduction to Bioinformatics Finding genes in prokaryotes.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Finding Eukaryotic Open reading frames.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
BIOS816/VBMS818 Lecture 7 – Gene Prediction Guoqing Lu Office: E115 Beadle Center Tel: (402) Website:
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Identification Lab
ECE 501 Introduction to BME
Gene Finding Charles Yan.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Lecture 3. Gene Finding and Sequence Annotation
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Protein Synthesis The genetic code – the sequence of nucleotides in DNA – is ultimately translated into the sequence of amino acids in proteins – gene.
Gene Structure and Identification
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
Grupo 5. 5’site 3’site branchpoint site exon 1 intron 1 exon 2 intron 2 AG/GT CAG/NT.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
1 Genes and How They Work Chapter Outline Cells Use RNA to Make Protein Gene Expression Genetic Code Transcription Translation Spliced Genes – Introns.
Molecular Biology in a Nutshell (via UCSC Genome Browser) Personalized Medicine: Understanding Your Own Genome Fall 2014.
Fig.1.8 DNA STRUCTURE 5’ 3’ Antiparallel DNA strands Hydrogen bonds between bases DOUBLE HELIX 5’ 3’
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
Genome Annotation Rosana O. Babu.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
Applied Bioinformatics
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Annotation of eukaryotic genomes
Gene Structure Prediction (Gene Finding) I519 Introduction to Bioinformatics, 2012.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Definitions of Annotation Interpreting raw sequence data into useful biological information Information attached to genomic coordinates with start and.
The Central Dogma of Life. replication. Protein Synthesis The information content of DNA is in the form of specific sequences of nucleotides along the.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Gene Expression : Transcription and Translation 3.4 & 7.3.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Bacterial infection by lytic virus
bacteria and eukaryotes
Bacterial infection by lytic virus
A Quest for Genes What’s a gene? gene (jēn) n.
Genes, Genomes, and Genomics
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
Protein Synthesis The genetic code – the sequence of nucleotides in DNA – is ultimately translated into the sequence of amino acids in proteins – gene.
Presentation transcript:

Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal

Prokaryotes Gene Architecture ATG-10-36Protein 1Protein 2Protein 3 Termination TAA TAG TGA Promoter Gene Initiation ATG Regulatory Seq. Exon-1Intron-1Exon-2 TAA TAG TGA Initiation Termination Eukaryotes Gene Architecture Splicing Sites Gene Prediction Strategies

Codon Usage Tables  Each amino acid can be encoded by several codons  Each organism has characteristic pattern of codon usage

Problems in Gene Prediction  Distinguishing Pseudogenes from Genes  Exon-Intron Structure in Eukaryotes, Exon flanking regions – not very well conserved  Alternative Splicing – Shuffling of Exons  Genes can overlap each other and occur on different strand of DNA

Gene Identification 1. Homology Based Gene prediction  Sequence Similarity Search against gene database using BLAST and FAST searching tools  EST (Expressed Sequence Tags) similarity search 2. Ab initio Gene Prediction  Prokaryotes - ORF finding  Eukaryotes - Promoter prediction - Start-Stop codon prediction - Splice site Prediction (Exon-Intron and Intron –Exon) - PolyA signal prediction 1. Homology Based Gene prediction  Sequence Similarity Search against gene database using BLAST and FAST searching tools  EST (Expressed Sequence Tags) similarity search 2. Ab initio Gene Prediction  Prokaryotes - ORF finding  Eukaryotes - Promoter prediction - Start-Stop codon prediction - Splice site Prediction (Exon-Intron and Intron –Exon) - PolyA signal prediction

ORF Finding in Prokaryotes Easier due to ………..  Small Genome have high gene density (Haemophilus influenza – 85% genic)  No Introns or Few Introns  Operons - One Transcript, many genes  Open Reading Frames (ORF) - Contigous set of codons, start with Met-codon, ends with stop codon

1. ORF Findings:  Simplest method  Length of DNA sequence that contains a contiguous set of codons, each of which specifies an Amino Acid  Six possible reading frames Sense Strand Antisense Strand ATGCCATCAG TGCCATTGTA 5’ 3’ 5’ Start Codon Position 3 Position 2 Position 1 DNAmRAN Protein Central Dogma

ORF Prediction: Based on Position of Start Codon & Stop Codon Start CodonStop Codon AUG UAG UAA UGA OR ORF Protein Coding Region Code for Protein No Protein: Due to the Presence of many in-frame stop codons

Example of ORF There are six possible ORFs in each sequence for both directions of transcription.

Difficulty in ORF Prediction: 1.Prokaryotes & Viruses: Presence of multiple genes on mRNA and Overlapping genes in which two different proteins may be encoded in different reading frames of the same mRNA 2.Eukaryotes: Protein coding region (Exon) is followed by non-coding region (Intron) 3. Differential mRNA splicing create different mRNA, hence different proteins 4. Variation in Genetic Code from Universal code Reliability of ORF Prediction: Characteristics of ORF regions 1.Ordered list of specific codons that reflects the evolutionary origin of the gene and constraints associated with gene expressions 2. Characteristics pattern of use of synonymous codons i.e. codons that stands for same Amino Acid 3. In Eukaryotes strong preferences for codon pairs at Intron-Exon or Exon-Intron junction 4. High genome content of GC have a strong bias of G & C in the third codon positions

3 Test of ORF First Test: It is based on an unusual type of sequence variation that is found in ORF have been devised to variety that a predicted ORF is in fact likely to encode a protein Second Test: It is analyzed, to determine whether the codon in the ORF correspond to these used in other genes of the same organism Third Test: ORF may be translated into an amino acid sequence and the resulting sequence then compound to the databases of existing sequence

Repeated Sequence Elements and Nucleosome Structure 1. Eukaryotic DNA is wrapped around histon-protein complexes 2. Some base pairs in the major or minor grooves of the DNA molecules face the nucleosome surface 3. Other pair face outside of the structures 4. Nucleosome located in the promoter regions are remodeled in a manner that can influence the availability of binding sites for regulatory proteins making them more or less available Hidden Morkov Model (HMM) of Eukaryotic Internal Exon Computational Background: Repeated patterns of sequence have been found in the Introns and Exons and near the start site of Transcriptuion of Eukaryotic genes Bending Pattern: Bending is influenced by 1.Repeated pattern i.e. not T, A or G, G 2. AA/TT dinucleotide

Ab initio gene prediction Predictions are based on the observation that gene DNA sequence is not random: - Gene-coding sequence has start and stop codons. - Each species has a characteristic pattern of synonymous codon usage. - Non-coding ORFs are very short. - Gene would correspond to the longest ORF. These methods look for the characteristic features of genes and score them high.

Ab initio gene prediction methods GeneScan – Fourier transform of DNA sequence to find characteristic patterns. GeneScan – Fourier transform of DNA sequence to find characteristic patterns. GeneParser – predicts the most likely combination of exons/introns. Dynamic programming. GeneParser – predicts the most likely combination of exons/introns. Dynamic programming. GeneMark – mostly for prokaryotes, Hidden Markov Models. Also for Eukaryotes GeneMark – mostly for prokaryotes, Hidden Markov Models. Also for Eukaryotes Grail II – predicts exons, promoters, Poly(A) sites. Neural network plus dynamic programming. Grail II – predicts exons, promoters, Poly(A) sites. Neural network plus dynamic programming.

Gene Preference Score : Important indicator of coding region Observation: frequencies of codons and codon pairs in coding and non-coding regions are different. Observation: frequencies of codons and codon pairs in coding and non-coding regions are different. Given a sequence of codons: and assuming independence, the probability of finding coding region: and assuming independence, the probability of finding coding region: The probability of finding sequence “C” in non-coding regions: The probability of finding sequence “C” in non-coding regions: The gene preference score: The gene preference score:

Confirming gene location using EST libraries Expressed Sequence Tags (ESTs) – sequenced short segments of cDNA. They are organized in the database “UniGene”. Expressed Sequence Tags (ESTs) – sequenced short segments of cDNA. They are organized in the database “UniGene”. If region matches ESTs with high statistical significance, then it is a gene or pseudogene. If region matches ESTs with high statistical significance, then it is a gene or pseudogene.

Gene prediction accuracy True positives (TP) – nucleotides, which are correctly predicted to be within the gene. Actual positives (AP) – nucleotides, which are located within the actual gene. Predicted positives (PP) – nucleotides, which are predicted in the gene. Sensitivity = TP / AP Sensitivity = TP / AP Specificity = TP / PP Specificity = TP / PP

Gene prediction accuracy

Common Difficulties of Gene Prediction First and last exons difficult to annotate because they contain UTRs. First and last exons difficult to annotate because they contain UTRs. Smaller genes are not statistically significant so they are thrown out. Smaller genes are not statistically significant so they are thrown out. Algorithms are trained with sequences from known genes which biases them against genes about which nothing is known. Algorithms are trained with sequences from known genes which biases them against genes about which nothing is known.

Genome Analysis for Gene Prediction Genome analysis Genome – the sum of genes and intergenic sequences of haploid cell. The value of genome sequences lies in their annotation Annotation – Characterizing genomic features using computational and experimental methods Annotation – Characterizing genomic features using computational and experimental methods Genes: levels of annotation Genes: levels of annotation Gene Prediction – Where are genes? Gene Prediction – Where are genes? What do they encode? What do they encode? What proteins/pathways involved in? What proteins/pathways involved in?

Flowchart: Gene Prediction Process Genomic DNA Sequence 1.Translate in all six Reading Frames & compare to Protein sequence database 2. Perform database similarity search of EST database of some Organism Use Gene Prediction program to locate genes Analyze the Regulatory Sequences in the Gene

Try this first using BLAST & FASTA ORF Finding Promoter, Splicing Site, Poly-A tail, 5’ TUR, 3’ UTR PSI-BLAST, PHI-BLAST & Other BLAST/FAS TA programs & EST, cDNA database search Compare with Genome of Other Organism

Let’s have some Practice on Gene Finding using some Gene Finding Programs 1.GenMark ( ) 2. Genscan ( ) 3. Grail II ( ) 4.Gene Finder in GlimmerM ( ml ) ml

HMMgene - Prediction of genes in vertebrate and C. elegans Gene Discovery Page FramePlot - protein-coding region prediction tool for high GC-content bacteria tRNAscan-SE Search for transfer RNA genes in genomic sequence NETGENE - Predict splice sites in human genes ORF Finder BCM Gene Finder Grail Genemark Genie: A Gene Finder Based on Generalized Hidden Markov Models GENSCAN - predict complete gene structures Splice Site Prediction by Neural Network Procrustes GenePrimer GenLang MZEF Gene Finder Webgene - Tools for prediction and analysis of protein-coding gene structure MAR-Finder - Nuclear matrix attachment region prediction Glimmer bacterial/archael gene finder HMMgene - Prediction of genes in vertebrate and C. elegans Gene Discovery Page FramePlot - protein-coding region prediction tool for high GC-content bacteria tRNAscan-SE Search for transfer RNA genes in genomic sequence NETGENE - Predict splice sites in human genes ORF Finder BCM Gene Finder Grail Genemark Genie: A Gene Finder Based on Generalized Hidden Markov Models GENSCAN - predict complete gene structures Splice Site Prediction by Neural Network Procrustes GenePrimer GenLang MZEF Gene Finder Webgene - Tools for prediction and analysis of protein-coding gene structure MAR-Finder - Nuclear matrix attachment region prediction Glimmer bacterial/archael gene finder

Promoter Region, Transscription Factor and Signals 1. TRANSFAC - Transcription Factor database TFD Transcription Factor Database TransTerm - A Translational Signal Database PLACE - a database of plant cis-acting regulatory DNA elements NNPP: Promoter Prediction by Neural Network FastM/ModelInspector TFSEARCH MatInd and MatInspector Transcription Element Search Software (TESS) CorePromoter (Core-Promoter Prediction Program) Gene Express - analysis of genomic regulatory sequences Signal Scan PromoterInspector Promoter Scan II Pol3scan TargetFinder - finds DNA-binding proteins. TRANSFAC - Transcription Factor database TFD Transcription Factor Database TransTerm - A Translational Signal Database PLACE - a database of plant cis-acting regulatory DNA elements NNPP: Promoter Prediction by Neural Network FastM/ModelInspector TFSEARCH MatInd and MatInspector Transcription Element Search Software (TESS) CorePromoter (Core-Promoter Prediction Program) Gene Express - analysis of genomic regulatory sequences Signal Scan PromoterInspector Promoter Scan II Pol3scan TargetFinder - finds DNA-binding proteins.

Overview GENE PREDICTION TOOLS

GenMark TM ( ) Mark Borodovsky's Bioinformatics Group at the Georgia Institute of Technology, Atlanta, Georgia

GeneMark.hmm for Prokaryotes (Version 2.4) Referen ce: Lukashin A. and Borodovsky M., GeneMark.hmm: new solutions for gene finding, NAR, 1998, Vol. 26, No. 4, pp Bacterial and archaeal gene prediction, you can use the parallel combination of the GeneMark and GeneMark.hmm programsGeneMark and GeneMark.hmm Heuristic Approach for Gene Prediction in Prokaryotes If the DNA sequence of interest belongs to a species whose name is not in the list of available models, use the Heuristic models optionHeuristic models Self Training Program of Genmarks If the sequence is longer than 1 Mb, generate models with the self- training program GeneMarkSself- training program GeneMarkS

Gene Prediction in Eukaryotes Eukaryotic gene prediction: Use the parallel combination of the GeneMark and GeneMark.hmm

Select the Related Organisms from this list

Gene Prediction in EST and cDNA To analyze ESTs and cDNAsESTs and cDNAs

Gene Prediction in Viruses Viral gene prediction through virus database “VIOLIN”“VIOLIN”

GenMark Output

New GENSCAN Web Server at MIT

Genescan Output

1.Locate protein coding genes within DNA sequence, 2.Locate EST/mRNA alignments, 3.Locate certain types of promoters, polyadenylation sites, CpG islands, and repetitive elements. GrailEXP GrailEXP is a gene finder…………. 1.EST alignment utility 2.exon prediction program, 3.a promoter/polya recognizer, 4.a CpG island finer, 5.a repeat masker,

GrailEXP Predicts exons, genes, promoters, polyas, CpG islands, EST similarities, and repetitive elements within DNA sequence

GlimmerM: A system for finding genes in microbial DNA, especially the genomes of bacteria and archaea.Glimmer (Gene Locator and Interpolated Markov Modeler) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DNA.Glimmer GlimmerHMM : For Eukaryotic Organisms Genesplicer: Fast, flexible system for detecting splice sites in the genomic DNA of various eukaryotes.

GLimmerM Gene Finder

Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal