Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.

Slides:



Advertisements
Similar presentations
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Protein Modules An Introduction to Bioinformatics.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Genome projects and model organisms Level 3 Molecular Evolution and Bioinformatics Jim Provan.
Gene Structure and Identification
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
1. Bacterial genomes - genes tightly packed, no introns... HOW TO FIND GENES WITHIN A DNA SEQUENCE? Scan for ORFs (open reading frames) - check all 6 reading.
Muhammad Awais PhD Biochemistry 08-ARID-1103 Understanding Basic Local Alignment Search Tool.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Chapter 21 Eukaryotic Genome Sequences
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
Genome Annotation Rosana O. Babu.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Protein and RNA Families
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
From Genomes to Genes Rui Alves.
Bioinformatics and Computational Biology
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Annotation of eukaryotic genomes
Starter What do you know about DNA and gene expression?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Genome sequencing and annotation Week 2 reading assignment - pages 63-78, 93-98, Boxes 2.1 and don’t worry about details of similarity scoring.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Bacterial infection by lytic virus
Virginia Commonwealth University
bacteria and eukaryotes
Bacterial infection by lytic virus
A Quest for Genes What’s a gene? gene (jēn) n.
Sequence based searches:
Genome Annotation Continued
Genome Center of Wisconsin, UW-Madison
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
What do you with a whole genome sequence?
From Mendel to Genomics
Genome of the week Bacillus subtilis Gram-positive soil bacterium
Basic Local Alignment Search Tool
Presentation transcript:

Genome annotation and search for homologs

Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing the genome on the MMG433 website.

Bacillus subtilis Gram-positive soil bacterium Genetically tractable, well-studied Developmental pathways (sporulation, genetic competence) Industrial and agricultural importance 4.2 Mb genome (sequence completed 1997)

B. subtilis genome features 4,106 protein coding genes 10 rRNA operons Nearly 50% of the genome consists of paralogous genes. –77 ABC transporter binding proteins 10 phage like regions - horizontal transfer. Low GC regions in the genome. 18 sigma factors - initiate transcription. 34 two-component regulatory systems.

Sequencing of genomes Hierarchical or contig based sequencing –Clone smaller seqments of the genome. –Labor intensive, slow –Not needed for sequencing microbial genomes Shotgun method –Randomly clone and sequence kb fragments of DNA fold coverage. –Computationally intensive.

Sequence assembly Focus of this week’s lab exercise Algorithms to align and edit multiple sequences Phrap and Consed Sequencher (commercial) for lab.

Finding functional features in a microbial genome. Genes rRNA operons, tRNAs - programs available Origin of replication - oriC -near dnaA gene Promoters Transcription terminators Horizontially transferred DNA –GC content

Gene finding Easy relative to eukaryotic genomes –No introns –80-90% of DNA encodes genes. 5% in eukaryotes. Find open reading frames (ORF scanning). –Find start codons (mostly ATG, not always) to stop codons. Smallest ORFs - usually 300 nt in length. –Additional features. Good Shine-Dalgarno sequence (ribosome binding site). AGGAGG. Not essential. –Similarity matches to genes in other genomes. –Effective way of searching for ORFs.

Gene finding programs Genefinder, Grail, Glimmer (TIGR), etc. ORF finder from NCBI –Will use in a future lab exercise and in the final annotation project

Annotating genes How to assign preliminary functions to genes. Automated programs. Similarity searches –BLAST and PSI-BLAST –COGs, Pfam, CDD, other databases –Only 50-75% of genes will have a predicted function. Some have no known homologs in any other genome. Functional characterization (individual genes) –Gene knockouts –Overexpression

In most cases computer annotation will only be able to predict function - NOT assign function. –The biological function of many genes have not been determined, even in model systems. –As genomic characterization of gene function continues - more and more computer generated annotations will be correct.

Molecular function - activity of a protein at the molecular level. –Examples would be ATPase, metal binding, converting glucose-6-phosphate to fructose-6- phosphate. Biological function - cellular role of the protein. –Examples would be translation initiation, adapting to environmental changes, glycolysis.

Homologs, orthologs, and paralogs. Homologous genes are genes that share a common evolutionary ancestor. –Orthologs are genes found in different organisms that arose from a common ancestor –Paralogs are genes found in the same organism that arose from a common ancestor. Duplication could have occurred in the species or earlier.

Using BLAST to predict gene function. BLAST predicted protein sequence against the non-redundant database. Determine best hits Automated annotation programs will often assign the best hit function to the gene being searched. Must manually confirm automated annotations.

Assessment of BLAST output What is the level of identity and similarity of the best hits? –More identity - more likely the proteins may have similar functions. Does the area of similarity occur over the entire protein? Or just part of the protein? (fig. 2.19) –Often you will find hits to only part of your protein. A GTP-binding domain for example. Have any of the best hits been characterized experimentally? –With so many microbial genomes sequenced chances are you will have to search extensively to find a hit that has been characterized experimentally.

Databases used in protein function analysis. COGs - Cluster of orthologous groups - proteins that are best hits against each other when comparing two genomes. Pfam - Protein families -more likely to identify conserved domains rather than full-length proteins TIGRfam - strives to find equivalogs - “proteins that are conserved with respect to FUNCTION since their last common ancestor”

Databases used in protein function analysis. SMART - Simple Modular Architecture Research Tool. PROSITE - Protein motifs CDD - Conserved domain database - linked to BLAST -Pfam, SMART, COGs. InterPro - A database that brings together many of the above databases so that you can search them all at once.

Bottom line on databases Are useful tools in assigning possible functions. Be careful about annotations – example -proteins in the same COG can be orthologs that have evolved different functions. –Many annotations are not backed up by experimental data. –Some databases are automated - have not been checked for accuracy.

Examples YqeH and DnaA

Protein function Molecular function –YqeH - GTPase –DnaA - ATPase, DNA binding Biological function –YqeH - Unknown –DnaA -DNA replication initiation