1 Lesson 5 Protein Prediction and Classification.

Slides:



Advertisements
Similar presentations
Proteins: Structure reflects function….. Fig. 5-UN1 Amino group Carboxyl group carbon.
Advertisements

Review.
Amino Acids PHC 211.  Characteristics and Structures of amino acids  Classification of Amino Acids  Essential and Nonessential Amino Acids  Levels.
A Ala Alanine Alanine is a small, hydrophobic
Review of Basic Principles of Chemistry, Amino Acids and Proteins Brian Kuhlman: The material presented here is available on the.
• Exam II Tuesday 5/10 – Bring a scantron with you!
5’ C 3’ OH (free) 1’ C 5’ PO4 (free) DNA is a linear polymer of nucleotide subunits joined together by phosphodiester bonds - covalent bonds between.
Lectures on Computational Biology HC Lee Computational Biology Lab Center for Complex Systems & Biophysics National Central University EFSS II National.
Applied Bioinformatics The amino acids. Overview Proteins (sneak preview) – Primary structure – Secondary structure – Tertiary structure The amino acids.
Peptides to Proteins. What are proteins? How are proteins made? How do proteins fold? Why are proteins important?
1 Lessons 5-6 Classifying a protein / Inside the genome.
Prosite and UCSC Genome Browser Exercise 3. Protein motifs and Prosite.
©CMBI 2001 A Ala Alanine Alanine is a small, hydrophobic residue. Its side chain, R, is just a methyl group. Alanine likes to sit in an alpha helix,it.
Single Motif Charles Yan Spring Single Motif.
©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”
You Must Know How the sequence and subcomponents of proteins determine their properties. The cellular functions of proteins. (Brief – we will come back.
The Big Picture of Protein Metabolism Gladys Kaba.
Proteins and Enzymes Nestor T. Hilvano, M.D., M.P.H. (Images Copyright Discover Biology, 5 th ed., Singh-Cundy and Cain, Textbook, 2012.)
Proteins account for more than 50% of the dry mass of most cells
Unit 7 RNA, Protein Synthesis & Gene Expression Chapter 10-2, 10-3
How does DNA work? What is a gene?
Protein Synthesis. DNA RNA Proteins (Transcription) (Translation) DNA (genetic information stored in genes) RNA (working copies of genes) Proteins (functional.
Proteins account for more than 50% of the dry mass of most cells
CHAPTER 12 PROTEIN SYNTHESIS AND MUTATIONS -RNA -PROTEIN SYNTHESIS -MUTATIONS.
Proteins Secondary Structure Predictions Structural Bioinformatics.
©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”
How Proteins Are Made Mrs. Wolfe. DNA: instructions for making proteins Proteins are built by the cell according to your DNA What kinds of proteins are.
Protein Structure IST 444. Protein Chemistry Basics Proteins are polymers consisting of amino acids linked by peptide bonds Each amino acid consists of:
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
AMINO ACIDS.
Now playing: Frank Sinatra “My Way” A large part of modern biology is understanding large molecules like Proteins A large part of modern biology is understanding.
Secondary structure prediction
Amino Acids are the building units of proteins
Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.
Fig Second mRNA base First mRNA base (5 end of codon) Third mRNA base (3 end of codon)
CELL REPRODUCTION: MITOSIS INTERPHASE: DNA replicates PROPHASE: Chromatin condenses into chromosomes, centrioles start migrating METAPHASE: chromosomes.
End Show Slide 1 of 39 Copyright Pearson Prentice Hall 12-3 RNA and Protein Synthesis 12–3 RNA and Protein Synthesis.
RNA 2 Translation.
Amino Acids ©CMBI 2001 “ When you understand the amino acids, you understand everything ”
Proteins.
Chapter 3 Proteins.
Amino Acids  Amino Acids are the building units of proteins. Proteins are polymers of amino acids linked together by what is called “ Peptide bond” (see.
Protein structure prediction Haixu Tang School of Informatics.
InterPro Sandra Orchard.
Proteins Structure Predictions Structural Bioinformatics.
Amino Acids. Amino acids are used in every cell of your body to build the proteins you need to survive. Amino Acids have a two-carbon bond: – One of the.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Genomics Lecture 3 By Ms. Shumaila Azam. Proteins Proteins: large molecules composed of one or more chains of amino acids, polypeptides. Proteins are.
Proteins Tertiary Protein Structure of Enzyme Lactasevideo Video 2.
Structural Bioinformatics Elodie Laine Master BIM-BMC Semester 3, Genomics of Microorganisms, UMR 7238, CNRS-UPMC e-documents:
Biochemistry Free For All
Amino acids.
Protein Folding Notes.
Protein Folding.
BIOLOGY 12 Protein Synthesis.
RNA Ribonucleic Acid.
Proteins.
Conformationally changed Stability
Chemistry 121 Winter 2016 Introduction to Organic Chemistry and Biochemistry Instructor Dr. Upali Siriwardane (Ph.D. Ohio State)
Chapter 3 Proteins.
Fig. 5-UN1  carbon Amino group Carboxyl group.
Introduction and Fundamentals of Protein Structure
Proteins Genetic information in DNA codes specifically for the production of proteins Cells have thousands of different proteins, each with a specific.
Conformationally changed Stability
The 20 amino acids.
Introduction and Fundamentals of Protein Structure
Do now activity #6 What is the definition of: RNA?
Translation.
The 20 amino acids.
“When you understand the amino acids,
Presentation transcript:

1 Lesson 5 Protein Prediction and Classification

2 Learning about a protein What does a protein do??  Post-translational modifications – phosphorylation, glycosylation, etc.  Identifying patterns, motifs  Secondary structure  Tertiary/quaternary structure  Protein-protein interactions

3 Domains & Motifs

4 Domains  An analysis of known 3-D protein structures reveals that, rather than being monolithic, many of them contain multiple folding units.  Each such folding unit is a domain (>50 aa, 50 aa, < 500 aa)

5 calcium/calmodulin-dependent protein kinase SH2 domain: interact with phosphorylated tyrosines, and are thus part of intracellular signal-transuding proteins. Characterized by specific sequences and tertiary structure

6 What is a motif??  A sequence motif = a certain sequence that is widespread and conjectured to have biological significance  Examples: KDEL – ER-lumen retention signal PKKKRKV – an NLS (nuclear localization signal)

7 More loosely defined motifs  KDEL (usually) +  HDEL (rarely) =  [HK]-D-E-L: H or K at the first position  This is called a pattern (in Biology), or a regular expression (in computer science)

8 Syntax of a pattern  Example: W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE].

9 Patterns  W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE]. Any amino, between times F or Y or V WOPLASDFGYVWPPPLAWS ROPLASDFGYVWPPPLAWS WOPLASDFGYVWPPPLSQQQ  

10 Patterns - syntax The standard IUPAC one-letter codes.  ‘ x ’ : any amino acid.  ‘ [] ’ : residues allowed at the position.  ‘ {} ’ : residues forbidden at the position.  ‘ () ’ : repetition of a pattern element are indicated in parenthesis. X(n) or X(n,m) to indicate the number or range of repetition.  ‘ - ’ : separates each pattern element.  ‘‹’ : indicated a N-terminal restriction of the pattern.  ‘›’ : indicated a C-terminal restriction of the pattern.  ‘. ’ : the period ends the pattern.

11 Pattern ~ motif ~ signature  A pattern (similarly to consensus and profile) is a way to represent a conserved sequence  Whereas a profile and consensus usually relate to the entire sequence, a pattern usually relates to a a few tens of amino-acids

12 Profile-pattern-consensus GTTCAA GCTGAA CTTCAC A.1000T C G GTTCAA [AC]-A-[GC]-T-[TC]-[GC] multiple alignment consensus pattern profile Information: consensus<pattern<profileNNTNAN

13 Interpro  Interpro: a collection of many protein signature databases (Prosite, Pfam, Prints … ) integrated into a hierarchical classifying system

14 Interpro example

15 PTM – Post-Translational Modification

16 PTM – Post-Translational Modification  Phosphorylation Tyr, Ser, Thr  Glycosylation (addition of sugars) Asn, Ser, Thr  Addition of fatty acids (e.g. N- myristoylation, S-Palmitoylation)

17 So how to predict Take into account: 1. Context (motif): PKC (a kinase) recognizes X S/T X R/K N-Myristoylation at M G X X X S/T Several times – we don ’ t know the exact motif! 2. Conservation Is the motif found (for instance, in human) also conserved in related organisms (for instance, in chimp)?

18 Prediction problems  Signal for detection is very short  Not enough biological knowledge for characterizing the signal  Tertiary structure

19 Prediction will be more efficient if more information is available

20 Secondary Structure

21 Secondary Structure  Reminder- secondary structure is usually divided into three categories: Alpha helix Beta strand (sheet) Anything else – turn/loop

22 Secondary Structure  An easier question – what is the secondary structure when the 3D structure is known?

23 DSSP  DSSP (Dictionary of Secondary Structure of a Protein) – assigns secondary structure to proteins which have a crystal structure H = alpha helix B = beta bridge (isolated residue) E = extended beta strand G = 3-turn helix I = 5-turn helix T = hydrogen bonded turn S = bend

24 Predicting secondary structure from primary sequence

25 Chou and Fasman (1974) Name P(a) P(b) P(turn) Alanine Arginine Aspartic Acid Asparagine Cysteine Glutamic Acid Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine The propensity of an amino acid to be part of a certain secondary structure (e.g. – Proline has a low propensity of being in an alpha helix or beta sheet  breaker)

26 Chou-Fasman prediction  Look for a series of >4 amino acids which all have (for instance) alpha helix values >100  Extend ( … )  Accept as alpha helix if average alpha score > average beta score Ala Pro Tyr Phe Phe Lys Lys His Val Ala Thr α β

27 Chou and Fasman (1974)  Success rate of 50%

28 Improvements in the 1980’s  Conservation in MSA  Smarter algorithms (e.g. HMM, neural networks).

29 Accuracy  Accuracy of prediction seems to hit a ceiling of 70-80% accuracy AccuracyMethod 50% Chou & Fasman 69% Adding the MSA 70-80% MSA+ sophisticated computations

30 Gene Ontology

31 GO  Gene Ontology – a project for consistent description of gene products in different databases.  Consistent description - Common key definitions. Example: ‘ protein synthesis ’ or ‘ translation ’

32 GO  GO - GO describes proteins in terms of : biological process cellular component molecular function  GO is not: –A sequence database. –A portal for sequence information

33 GO – structure nucleus Nuclear chromosome cell cellular component

34 GO example Links from the swissprot entry of human protein kinase C alphaprotein kinase C alpha

35 Examples for use of GO  Enrichment for a GO category: 1. Do all up regulated genes in a microarray you built belong to the same GO “ molecular function ” category? 2. You have predicted a new transcription factor binding site. Do all genes with this site belong to the same GO biological process?

36 Evaluation of prediction methods

37 Evaluation of prediction methods  Comparing our results to experimentally verified sites Positive (hit) Negative True True-positive True-positive True-negative True-negative False False-positive (false alarm) False-negative (miss) Our prediction gives: Is the prediction correct?

38 Method evaluation Positive (hit) Negative True True-positive True-positive True-negative True-negative False False- positive (false alarm) False-negative (miss)  A good method will be one with a high level of true-positives and true-negatives, and a low level of false-positives and false-negatives Our prediction gives: Is the prediction correct?

39 Calibrating the method  All methods have a parameter (or a score) that can be calibrated to improve the accuracy of the method.  For example: the E-value cutoff in BLAST

40 Calibrating E-value cutoff  Reminder: the lower the E-value, the more ‘ significant ’ the alignment between the query and the hit.

41 Calibrating the E-value  What will happen if we raise the E-value cutoff (for instance – work with all hits with an E-value which is < 10) ? Positive (hit) Negative True True-positive True-positive True-negative True-negative False False-positive (false alarm) False-negative (miss) Our prediction gives: Is the prediction correct?

42 Calibrating the E-value  On the other hand – if we lower the E- value (look only at hits with E-value < ) Positive (hit) Negative True True-positive True-positive True-negative True-negative False False-positive (false alarm) False-negative (miss) Our prediction gives: Is the prediction correct?

43 Improving prediction  Trade-off between specificity and sensitivity

44 Sensitivity vs. specificity  Sensitivity =  Specificity = True positive True positive + False negative Represent all the proteins which are really phosphorylated True negative True negative + False positive Represent all the proteins which are really NOT phosphorylated How good we hit real phosphorylations How good we avoid real non- phosphorylations

45  Raising the E-value to 10: sensitivity specificity  Lowering the E-value to sensitivity specificity

46 Over-predictions: example  Many PTM-predictors tend to over- predict  high level of false positives  low specificity WHY? 1. Tertiary structure! (buried/exposed, tertiary motifs) 2. The phosphorylation recognition mechanism is not completely clear!

47 Next time on: Biological Sequences Analysis

48 The Human Genome

49 Horizontal (Lateral) Gene Transfer

50 Alternative splicing

51 Repetitive Elements