Protein Families, Motifs & Domains.

Protein Families, Motifs & Domains.

The Annotation Process
DNA SEQUENCE Useful Information ANNALYSIS SOFTWARE Annotator

A Common Mistake! BLAST PROTEIN SEQUENCE Function Annotator

A word about BLAST and FASTA Sequence alignment Domains Prosite Pfam/HMMs SignalP/ TMHMM

BLAST Local Alignment Suggests the presence of a common domain between two proteins. However common domains can be conserved between proteins with very different functions Eg ATP binding common to many proteins

BLAST/FASTA Reduces sensitivity increases specificity
FASTA is a global alignment tool BLAST blast is local BLAST FASTA Reduces sensitivity increases specificity

Using FASTA Global Alignment
Annotation gained from homology hits is only as good as the annotation you are transferring. Eg there are two different genes called ESAG2 in swall. Small changes in “your gene” might confer functional differences.

FASTA 10-5 Low scoring hits Can give good alignments

10-8 High scoring hits can give poor alignments

The big problem with searching public databases is…
There is a need to reduce The amount of sequences We search and to prevent bad Annotation from spreading

Proteins with common functions have some common features. Domains and motifs from conserved residues. Families can be grouped, profiles and HMMs derived. There is more to life than Blast

Sequence Alignment Sequence alignments allow us to see which residues are important to a family of proteins. This lets us make motifs/profiles/fingerprints/HMMs. To define families

Domains A domain is a functional part of a protein
It may contain amino acid sequence motifs that can be used to identify it. More than one motif is known as a fingerprint

DOMAINS Motifs Prosite Fingerprints Blocks Pfam (HMMs) Domain
Alignment Fingerprints Blocks Pfam (HMMs)

Prosite http://us.expasy.org/prosite/
Maintained a the swiss institute of Bioinformatics. All Motifs are checked for false positives and fine tuned. Sometimes a family can be defined by more than one expression. Fingerprints and BLOCKs automatically scan proteins for a number of motifs.

Prosite (Bairoch et al (1997) NAR 25(1) 217-221)
Single most conserved motifs Referred to as regular expressions or Patterns. Eg.. cydeggis cyedggis cyeeggit cyhgdggs cyrgdgnt C-Y-x2-[DG]-G-x-[ST]

Prosite PROSITE: PS00002 ID GLYCOSAMINOGLYCAN; RULE.
AC PS00002; DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE). DE Glycosaminoglycan attachment site. PA S-G-x-G. RU Additional rules: RU There must be at least two acidic amino acids (Glu or Asp) from -2 to RU -4 relative to the serine. CC /TAXO-RANGE=??E??; CC /SITE=1,glycosaminoglycan; CC /SKIP-FLAG=TRUE; DO PDOC00002; //

Prosite documentation entry
************************************* * Glycosaminoglycan attachment site * Proteoglycans [1] are complex glycoconjugates containing a core protein to which a variable number of glycosaminoglycan chains (such as heparin sulfate, chondroitin sulfate, etc.) are covalently attached. The glycosaminoglycans are attached to the core proteins through a xyloside residue which is in turn linked to a serine residue of the protein. A consensus sequence for the attachment site seems to exist [2]. However, it must be noted that this consensus is only based on the sequence of three proteoglycan core proteins. -Consensus pattern: S-G-x-G [S is the attachment site] Additional rule: There must be at least two acidic amino acids from -2 to -4 relative to the serine. -Last update: June 1988 / First entry. [ 1] Hassel J.R., Kimura J.H., Hascall V.C. Annu. Rev. Biochem. 55: (1986). [ 2] Bourdon M.A., Krusius T., Campbell S., Schwarz N.B. Proc. Natl. Acad. Sci. U.S.A. 84: (1987).

Prosite A prosite hit is a binary piece of information (True/False).
However some motifs are very simple. so many false positives. Some motifs should be found together. Documentation must always be read.

Hidden Markov Models Probabilistic models linking interconnecting states Profile HMMs represent linear chains of match, delete or insert. Each position in an alignment is assigned M,I or D. There is a defined probability of moving from one state to the next.

Hidden Markov Models D1 D2 D3 D4 I2 I3 I4 I0 I1 begin M1 M2 M3 M4

Pfam Pfam 7.0 contains a total of 3360 families.
Pfam is a database of two parts: Pfam A ..curated Pfam B automatically generated. All HMMs have a seed alighment which is added to using the HMMer package.

Pfam annotation

Pfam Scores E value expect value, same as for blast, probability of a hit by chance. Noise Cutoff The HMM score below which a hit is uninteresting. Trusted Cutoff The HMM score above which there should be no false positives.

Searching for Specific Domains
Signal Peptides Secreted/targeted proteins Transmembrane Domains Membrane bound proteins

Interpro curation

TMHMM What is a transmembrane domain

TMHMM http://www.cbs.dtu.dk/services/TMHMM/

SIGNALP What Is a signal Peptide?
Any protein that has to be targeted to a specific part of the cell requires a signal peptide. The signal peptide ensures that the protein in translated at the ER where it can enter the secretory pathway. Ie, the signal peptide suggests a cellular (or extracellular) location other than the cytoplasm.

SIGNALP

using secondary databases for functional Assignments
Better, more detailed, proffesional annotation. More powerful and sensitive search methods, hmms/profiles/weight matrixes. Not as good coverage.

The Gene Prediction Process
BLAST FASTA SignalP DNA SEQUENCE Functional Assignments ANNALYSIS SOFTWARE TMHMM Pfam Prosite Annotator

Protein Families, Motifs & Domains.

Similar presentations

Presentation on theme: "Protein Families, Motifs & Domains."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Protein Families, Motifs & Domains.

Similar presentations

Presentation on theme: "Protein Families, Motifs & Domains."— Presentation transcript:

Similar presentations

About project

Feedback