Download presentation
Presentation is loading. Please wait.
Published byAlexandrina Pope Modified over 9 years ago
1
Genome annotation and search for homologs
2
Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing the genome on the MMG433 website.
3
Bacillus subtilis Gram-positive soil bacterium Genetically tractable, well-studied Developmental pathways (sporulation, genetic competence) Industrial and agricultural importance 4.2 Mb genome (sequence completed 1997)
4
B. subtilis genome features 4,106 protein coding genes 10 rRNA operons Nearly 50% of the genome consists of paralogous genes. –77 ABC transporter binding proteins 10 phage like regions - horizontal transfer. Low GC regions in the genome. 18 sigma factors - initiate transcription. 34 two-component regulatory systems.
5
Sequencing of genomes Hierarchical or contig based sequencing –Clone smaller seqments of the genome. –Labor intensive, slow –Not needed for sequencing microbial genomes Shotgun method –Randomly clone and sequence 1.5-2 kb fragments of DNA. 5-10 fold coverage. –Computationally intensive.
6
Sequence assembly Focus of this week’s lab exercise Algorithms to align and edit multiple sequences Phrap and Consed Sequencher (commercial) for lab.
7
Finding functional features in a microbial genome. Genes rRNA operons, tRNAs - programs available Origin of replication - oriC -near dnaA gene Promoters Transcription terminators Horizontially transferred DNA –GC content
8
Gene finding Easy relative to eukaryotic genomes –No introns –80-90% of DNA encodes genes. 5% in eukaryotes. Find open reading frames (ORF scanning). –Find start codons (mostly ATG, not always) to stop codons. Smallest ORFs - usually 300 nt in length. –Additional features. Good Shine-Dalgarno sequence (ribosome binding site). AGGAGG. Not essential. –Similarity matches to genes in other genomes. –Effective way of searching for ORFs.
9
Gene finding programs Genefinder, Grail, Glimmer (TIGR), etc. ORF finder from NCBI –Will use in a future lab exercise and in the final annotation project
10
Annotating genes How to assign preliminary functions to genes. Automated programs. Similarity searches –BLAST and PSI-BLAST –COGs, Pfam, CDD, other databases –Only 50-75% of genes will have a predicted function. Some have no known homologs in any other genome. Functional characterization (individual genes) –Gene knockouts –Overexpression
11
In most cases computer annotation will only be able to predict function - NOT assign function. –The biological function of many genes have not been determined, even in model systems. –As genomic characterization of gene function continues - more and more computer generated annotations will be correct.
12
Molecular function - activity of a protein at the molecular level. –Examples would be ATPase, metal binding, converting glucose-6-phosphate to fructose-6- phosphate. Biological function - cellular role of the protein. –Examples would be translation initiation, adapting to environmental changes, glycolysis.
13
Homologs, orthologs, and paralogs. Homologous genes are genes that share a common evolutionary ancestor. –Orthologs are genes found in different organisms that arose from a common ancestor –Paralogs are genes found in the same organism that arose from a common ancestor. Duplication could have occurred in the species or earlier.
14
Using BLAST to predict gene function. BLAST predicted protein sequence against the non-redundant database. Determine best hits Automated annotation programs will often assign the best hit function to the gene being searched. Must manually confirm automated annotations.
15
Assessment of BLAST output What is the level of identity and similarity of the best hits? –More identity - more likely the proteins may have similar functions. Does the area of similarity occur over the entire protein? Or just part of the protein? (fig. 2.19) –Often you will find hits to only part of your protein. A GTP-binding domain for example. Have any of the best hits been characterized experimentally? –With so many microbial genomes sequenced chances are you will have to search extensively to find a hit that has been characterized experimentally.
16
Databases used in protein function analysis. COGs - Cluster of orthologous groups - proteins that are best hits against each other when comparing two genomes. Pfam - Protein families -more likely to identify conserved domains rather than full-length proteins TIGRfam - strives to find equivalogs - “proteins that are conserved with respect to FUNCTION since their last common ancestor”
17
Databases used in protein function analysis. SMART - Simple Modular Architecture Research Tool. PROSITE - Protein motifs CDD - Conserved domain database - linked to BLAST -Pfam, SMART, COGs. InterPro - A database that brings together many of the above databases so that you can search them all at once.
18
Bottom line on databases Are useful tools in assigning possible functions. Be careful about annotations – example -proteins in the same COG can be orthologs that have evolved different functions. –Many annotations are not backed up by experimental data. –Some databases are automated - have not been checked for accuracy.
19
Examples YqeH and DnaA
20
Protein function Molecular function –YqeH - GTPase –DnaA - ATPase, DNA binding Biological function –YqeH - Unknown –DnaA -DNA replication initiation
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.