Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)

Slides:



Advertisements
Similar presentations
1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Protein Modules An Introduction to Bioinformatics.
Single Motif Charles Yan Spring Single Motif.
Scaffold Download free viewer:
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Enzymatic Function Module (KEGG, MetaCyc, and EC Numbers)
Annotation Presentation Alternative Start Codons &
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Structure-based Evidence for Function (TIGRfam, Pfam and PDB)
© Wiley Publishing All Rights Reserved.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Introduction to Gene Mining Part B: How similar are plant and human versions of a gene? After completing part B, you will demonstrate How to use NCBI BLASTp.
T-COFFEE Multiple Alignments of Orthologous Sequences Horizontal Gene Transfer (Phylogenetic Trees) WebLogo.
Pathway Assignments. The assignment – Annotating Pathways KEGG Pathway Database.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Overview. What is Annotation? Annotation is the process of determining the location and function of all identifiable genes in a genome. Annotation is.
Anotation: Gene of which little is known What follows is a simulation of an orf page in the proposed graphical interface. The interface does not yet exist.
Cleaning Genomes: So easy - even a program head can do it Igor Bogorad.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
Sequencing a genome and Basic Sequence Alignment
1 P6a Extra Discussion Slides Part 1. 2 Section A.
1 LSM2241 AY0910 Semester 2 MiniProject Briefing Round 5.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Protein and RNA Families
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Welcome to the combined BLAST and Genome Browser Tutorial.
Bioinformatics What is a genome? How are databases used? What is a phylogentic tree?
Using BLAST to Identify Species from Proteins
Sequence similarity, BLAST alignments & multiple sequence alignments
Annotation Presentation
Demo: Protein Information Resource
Basics of Comparative Genomics
Sequence based searches:
Genome Annotation Continued
Genome Center of Wisconsin, UW-Madison
There are four levels of structure in proteins
Sequence Based Analysis Tutorial
BLAST.
What do you with a whole genome sequence?
Annotation Presentation
Basic Local Alignment Search Tool
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Presentation transcript:

Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)

 Genes related by duplication within a genome  May evolve new function  Non-functional  Share homology with functional gene

Sequence Conservation Reflects Evolutionary Relationships (Ancestry) Homologs –Orthologs Genes duplicated via appearance of new species –Identical function in different organisms –Paralogs Genes duplicated within a species –Perform slightly different tasks in cell »Can develop new capabilities »Can become pseudogene if functionality lost but sequence similarity retained Insert Figure 8-41 from Microbiology – An Evolving Science © 2009 W.W. Norton & Company, Inc.

OID Scroll down Let’s hunt for paralogs first... Go to Gene Detail page for your gene

Select Paralogs/Orthologs from drop-down menu

Tables displaying orthologs and paralogs For those cases in which no paralogs are displayed, enter “No paralogs found” in your notebook. We will revisit the ortholog table to complete last of the modules next week For this module, focus on paralog table

For each paralog, perform a reciprocal BLAST search with the amino acid sequence for the paralog as your query against the P. limnophilus genome. Then inspect the alignment of this paralog with our assigned gene  notebook. Tables displaying paralogs only These are the statistics you record in your notebook.

Right-click on the paralog OID; open Gene Detail page in new tab On the Gene Detail page, right click on the number for amino acid sequence length

Use the paralog sequence for reciprocal BLAST search COPY the entire protein sequence in FASTA format then go to “Find Genes”

Select BLAST from menu PASTE paralog protein sequence into query box Change E-value to 1e-2 Select P. limnophilus as database Press “Run BLAST”

Find the Gene OID in hit list that corresponds to your assigned gene Reciprocal BLAST search results Inspect the pair-wise alignment Scroll down

Paralog (query) Original (sbjct) Reciprocal BLAST search results Copy/paste the alignment into your notebook

Recording results in your notebook Enter the gene OID, gene name, and alignment statistics from Gene Detail page for paralog (Will need to add heading/box for OID) Scroll down

Recording results in your notebook Remember, the alignment generated by reciprocal BLAST search is between the paralog (query) and your assigned gene (subjct). Repeat process for all paralogs with significant E-value

 Genes that are nonfunctional.  Will align well to known protein sequence on BLAST, CDD, and/or Pfam and may appear to encode a legitimate coding sequence (start/stop codons, Shine-Dalgarno sequence within proper distance, etc.)  Formed by one of two mechanisms:  By duplication of functional gene followed by mutagenesis that removes functionality  Degradation of a functional gene no longer required by organism

 Sequence is interrupted by more than one stop codon or frameshift, corresponds to a truncated Pfam less than 30% of predicted profile.  Sequence separated by another ORF  Missing key residues known to be required for functionality. New resource: ScanProsite If an ORF meets one of the following criteria, then it should be annotated as a possible pseudogene.

For this class, we will investigate possible pseudogenes using only criterion 1 and criterion 3. We do not have the resources to obtain sequence information needed for criterion 2; however, a brief explanation will be provided. Some things to keep in mind: pseudogenes are very RARE the criteria used to characterize pseudogenes is based on computational methods  experimental confirmation is required before a gene can be ruled nonfunctional

CRITERION #1  Sequence is interrupted by more than one stop codon or frameshift, corresponds to a truncated Pfam less than 30% of predicted profile  Navigate to Pfam database “Click”

CRITERION #1  Copy/paste the amino acid sequence in FASTA format for your assigned gene into the query box  Click “Go”

CRITERION #1  On results page, note the domain graphic Not a pseudogene: Possible pseudogene:  If the last domain is truncated and running to the end of the sequence, then one should investigate the possibility that this ORF is a pseudogene – HOW?

CRITERION #1 Possible pseudogene:  Determine whether this domain is required for functionality of the protein 1.Calculate length of sequence aligned to HMM EX: (113 – 1) + 1 = 113 residues

CRITERION #1 2.Determine the length of the HMM model EX: Calculate % coverage: Divide the value from step 1 by the value from step 2 and multiply by 100 EX: 113 / 113 x 100 = 100% If this value is < 30%, then this may be a pseudogene. Research (PubMed) must indicate that the domain is required for functionality before one can conclude it is a peudogene – look it up!

CRITERION #2  Sequence separated by another ORF  No tools available to easily do this on img/edu, but this is how it is done in theory 1.Obtain genomic DNA sequence that is flanking your ORF (1000s of kilobases on one side of your gene or the other) 2.Perform Pfam search 3.Note the domain graphic  If the second half of the fragmented domain is present in the flanking DNA, and the domain is required for functionality (consult PubMed!), then this may be a pseudogene. Possible pseudogene:

CRITERION #3  Missing key residues known to be required for functionality. Possible pseudogene:  Navigate to ScanProsite tool on Prosite database

New online resource! What is it? curated database of multiple sequence alignments of motifs used for the purpose of identifying domains and families of protein sequences alignments are known as profiles similar to Pfams or COGs What does it do? generally used to identify an open reading frame as a pseudogene by verifying the absence of catalytic residues also can be used to identify proteins present in distantly related microbes

 Copy/paste the amino acid sequence in FASTA format for your assigned gene into the query box  Deselect “Exclude motifs with high probability of occurrence”  Check the box “Show low level score”  Press “START THE SCAN” “Click”

ScanProsite Results Query sequence used for search Visual representation of profiles and patterns Portion of sequence aligning to profile (If scroll over, will highlight corresponding sequence in query above)

Analysis of ScanProsite Results – Do I have a pseudogene by Criterion 3?  Inspect the results  Underlined series of residues, which correspond to required features in a functional protein domain  Red colored residues, which correspond to necessary components of an active site

Analysis of ScanProsite Results – Do I have a pseudogene by Criterion 3?  If there are no underlined sequences or red residues, then the protein is missing the predicted features or active site components required for functionality  Although this is strong evidence that your gene is a pseudogene, you must confirm that there are no exceptions to a condition for functionality – HOW? Next slide

 Click on PS***** identification number in the title line for the profile in which the condition for functionality has not been met  On profile page, look for blue box in “Technical section”  If multiple sequences are detected in Swiss-Prot that match the profile, and the notes indicate there are exceptions to the conditions, then you cannot conclude you have a pseudogene Exceptions

 If multiple sequences are detected in Swiss-Prot that match the profile, and there are NO exceptions to the conditions, then you hypothesize that you have a pseudogene

Recording results in your notebook If you cannot conclude you have a pseudogene, enter “NO” in the box If you can conclude you may have a pseudogene, enter “YES” in the box with a brief note as to which criterion was used to come to this conclusion.