Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.

Slides:



Advertisements
Similar presentations
1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Advertisements

Pfam(Protein families )
Basics of Comparative Genomics Dr G. P. S. Raghava.
Comparative genomics Joachim Bargsten February 2012.
Profiles for Sequences
Psi-BLAST, Prosite, UCSC Genome Browser Lecture 3.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Fa05CSE 182 CSE182-L5: Position specific scoring matrices Regular Expression Matching Protein Domains.
What’s next ?? Today 3.3 Protein function 10.3 Protein secondary structure prediction 17.3 Protein tertiary structure prediction 24.3Gene expression &
Protein Modules An Introduction to Bioinformatics.
Similar Sequence Similar Function Charles Yan Spring 2006.
Prosite and UCSC Genome Browser Exercise 3. Protein motifs and Prosite.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Single Motif Charles Yan Spring Single Motif.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Comparative Genomics of the Eukaryotes
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Protein Bioinformatics Course
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
My Research Work and Clustering Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Construction of Substitution Matrices
You have worked for 2 years to isolate a gene involved in axon guidance. You sequence the cDNA clone that contains axon guidance activity. What do you.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein and RNA Families
Positional Association Rules Dr. Bernard Chen Ph.D. University of Central Arkansas.
Motif discovery and Protein Databases Tutorial 5.
Using blast to study gene evolution – an example.
Protein Domain Database
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Step 3: Tools Database Searching
(H)MMs in gene prediction and similarity searches.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
InterPro Sandra Orchard.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Protein families, domains and motifs in functional prediction May 31, 2016.
Sequence similarity, BLAST alignments & multiple sequence alignments
Protein families, domains and motifs in functional prediction
Demo: Protein Information Resource
Basics of Comparative Genomics
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
There are four levels of structure in proteins
Protein Bioinformatics Course
Secreted Fringe-like Signaling Molecules May Be Glycosyltransferases
A brief on: Domain Families & Classification
Basics of Comparative Genomics
Basic Local Alignment Search Tool
A brief on: Domain Families & Classification
Presentation transcript:

Identification of Protein Domains

Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous genes is gene duplication and speciation. Homology: not sufficiently well-defined Therefore additional terms are used:

Orthologs are two genes from two different species that derive from a single gene in the last common ancestor of the species. ortho para ortho Paralogs are genes that derive from a single gene that was duplicated within a genome.

Co-orthologs are paralogs produced by duplications of orthologs subsequent to a given speciation event. co-ortho

Inparalogs are paralogs in a given lineage that all evolved by gene duplications that happened after the speciation event. in-para out-para Outparalogs are paralogs in the given lineage that evolved by gene duplications that happened before the speciation event

Orthologs and Paralogs Orthologs - evolutionary functional counterparts in different species Inparalogs – important for detecting lineage-specific adaptations

Proteins : Rapidly growing databases of protein sequences due to genome sequencing projects. Many new proteins belong to protein families with known functions, (significant sequence similarity). Only a small fraction of known proteins have functions determined by experiment. Databases providing computational sequence analysis allow us to classify new proteins to known families, and thus determine their function.

Protein Domains A domain is an independent structural unit which can be found alone or in conjunction with other domains or repeats. Module = mobile domain. Different domains have distinct functions. Many eukaryotic proteins have multiple domains.

Protein Domains PX domain with ligand SH3 domain with ligand

Identifying Protein Domains : Problems : –Defining the members of each family. –Building multiple alignments of the members. –Finding the boundaries of the domain.

Identifying Protein Domains Little structural data  identification by sequence analysis. Sequence characterization of families - determine 3D structure and molecular functions. Even when the structure of the domain is not known it may be possible to define its boundaries from sequence alone.

Identifying Protein Domains : They do not give a clear picture of the domain boundaries. Lack sensitivity. Motif matches are often useful to indicate functional sites, however :

Identifying Protein Domains : Automatic methods : Fast, effective, deals with a lot of information. Might fragment domain families. Might cause fusion of domain families. Manual methods : Knowledge of protein experts is put to use. Slow, require a lot of manpower.

SMART : (Simple Modular Architecture Research Tool) Web-based resource used for : –rapid annotation of protein domains. –analysis of domain architectures.

Domain Architecture Protein: PA-3427CG Species: Drosophila melanogaster Protein: ENSMUSP Species: Mus musculus Protein: ENSANGP Species: Anopheles gambiae

SMART (Simple Modular Architecture Research Tool) There are over 600 domain families. Provides information about : –function. –subcellular localization. –phyletic distribution. –tertiary structure. Based on HMMs (Hidden Markov Models).

SMART (Simple Modular Architecture Research Tool) HMM – based on seed alignment. Threshold values used to determine homology of domains.

SMART (Simple Modular Architecture Research Tool) Alignments of proteins by: –Minimize insertions/deletions in conserved alignment blocks. –Optimize amino acid property conservation. –Closing unnecessary gaps. Gapped alignments prefered over ungapped ones: –prediction of domain boundaries. –greater information content. Alignment of entire structural domains.

PROSITE - database of protein families and domains Database of biologically significant sites and patterns. Contains 1,609 profiles. Pattern – conserved sequence of a few amino acids. Identifies to which known family of proteins (if any) the new sequence belongs. Used to determine the function of uncharacterized proteins translated from genomic or cDNA sequences.

PROSITE - database of protein families and domains A protein too distant from any other to detect its resemblance by overall sequence alignment, can be classified according to a Pattern. Patterns arise because of requirements of binding sites that impose very tight constraint on the evolution of portions of the protein.

PROSITE – how is a pattern developed ? As short as possible. Detects all/most sequences it describes. As little false results as possible. high sensitivity and high specificity.

PROSITE – how is a pattern developed ? First – study reviews on a protein family. Then build alignment table with particular attention to residues and regions important to the biological function of that family. - Enzyme catalytic sites. -Prostethic group attachment sites (heme). -Amino acids involved in binding a metal ion. -Cysteines involved in disulfide bonds. - Regions involved in binding a molecule (ADP/ATP, GDP/GTP, calcium, DNA, etc.) or another protein.

PROSITE steps in the development of a pattern: Finding a core pattern : 4-5 biologically significant residues. Test the pattern on a large database. If lucky – there is correlation in this region which indicates a good pattern. Mostly, there is no correlation : –Gradually increase the size of the pattern. –search over other patterns.

PROSITE – An example ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS This pattern is small and would probably pick up too many false positive results :

Profiles – characterize a protein family or domain over its entire length. Patterns - small regions, high sequence similarity.

Research: Finding new domain families Automatic methods The team started with 107 nuclear domains. Using SMART - get all proteins with at least one of these domains, characterize their complete domain structure. Regions not annotated using known SMART domain models were extracted with their domain context.

Finding new domain families: Automatic methods Grouping proteins by region similarity. Finding homologs using PSI-BLAST on longest of every group (Threshold E- value<0.001). Finding domain organization via SMART. Homologous regions – candidates for a novel domain family.

Finding new domain families:

Finding new domain families: Manual confirmation Different context – novel module family. Proteins with nuclear AND extracellular domains excluded. Multiple alignments and known locations of domains – definition of domains ’ borders. Automatic searches to find more members, E- value < 0.1, and manual checks. Marginal similarity to domain family – possible divergent family.

Prediction of Function: Chromatin-Binding Domains Protein SPT6 containing CSZ domain, regulates transcription through a histone- binding capability. It also contains two other types of domains, which are unlikely to bind histones. Therefore it was predicted that CSZ domain has that function.

Research : Search of C-terminal by PSI-BLAST (E- value<10 -5 ) found UBX containing proteins and metazoan homologs of PNGases. PNGases – proteins involved in UPR. UPR – unfolded protein response. PUG – the homologous regions. PUG domains found in proteins with domains central to ubiquitin- mediated proteolysis, (UBA and UBX). Arabidopsis protein – UBA in N-terminal.

Conclusion : PUG containing proteins might link the UPR to ubiquitin mediated protein degradation.

PUGUBA PUG UBX PUGUBCc PNGases Believed to have a role in the UPR Domains central to ubiquitin mediated proteolysis

Apoptosis Ubx domain from human faf1 Dna binding protein c-terminal uba domain of the human homologue of rad23a (hhr23a)

Orthologs of PNGases in metazoan are present singly, (not in multiple paralogs) – likely to have similar cellular localization. The ortholog in Sacharaomyces cervisiae is known to be localized mainly in the nucleus. Likely that PNGases are localized in the nucleus too.

HMM from the PUG – marginal similarity to IRE1p-like Kinases which are known to initiate the UPR as well. They suggest the presence of divergent PUG domains in the C termini of these Proteins. Analysis revealed a conserved region in metazoan PNGases. Named it PAW. Put it in SMART.

The team found 28 novel nuclear domain families. Most of them with representatives in diverse molecular context in different species. Some specific to single species. Others divergent members of previously recognized families.

The End