School B&I TCD Bioinformatics Proteins: structure,function,databases,formats.

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

The amino acids in their natural habitat. Topics: Hydrogen bonds Secondary Structure Alpha helix Beta strands & beta sheets Turns Loop Tertiary & Quarternary.
Review: Amino Acid Side Chains Aliphatic- Ala, Val, Leu, Ile, Gly Polar- Ser, Thr, Cys, Met, [Tyr, Trp] Acidic (and conjugate amide)- Asp, Asn, Glu, Gln.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Biology 224 Dr. Tom Peavy Sept 27 & 29 Protein Structure & Analysis.
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
Sequence analysis June 20, 2006 Learning objectives-Understand sliding window programs. Understand difference between identity, similarity and homology.
Protein-a chemical view A chain of amino acids folded in 3D Picture from on-line biology bookon-line biology book Peptide Protein backbone N / C terminal.
1 Levels of Protein Structure Primary to Quaternary Structure.
An Introduction to Bioinformatics Protein Structure Prediction.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Sequence analysis June 18, 2008 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Sequence analysis June 19, 2007 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Matching Problems in Bioinformatics Charles Yan Fall 2008.
Sequence analysis June 17, 2003 Learning objectives-Review amino acids structures. Understand sliding window programs. Understand difference between identity,
Protein Modules An Introduction to Bioinformatics.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Fa 05CSE182 CSE182-L6 Protein structure basics Protein sequencing.
Protein and Function Databases
Single Motif Charles Yan Spring Single Motif.
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Motif searching and protein structure prediction May 26, 2005 Hand in written assignments today! Learning objectives-Learn how to read structure information.
Protein Structure Elements Primary to Quaternary Structure.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
PROTEIN SEQUENCE ANALYSIS. Need good protein sequence analysis tools because: As number of sequences increases, so gap between seq data and experimental.
Multiple Sequence Alignment School of B&I TCD May 2010.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Day 2: Protein Sequence Analysis 1.Physico-chemical properties. 2.Cellular localization. 3.Signal peptides. 4.Transmembrane domains. 5.Post-translational.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Secondary structure prediction
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
The α-helix forms within a continuous strech of the polypeptide chain 5.4 Å rise, 3.6 aa/turn  1.5 Å/aa N-term C-term prototypical  = -57  ψ = -47 
©CMBI 2009 Alignment & Secondary Structure You have learned about: Data & databases Tools Amino Acids Protein Structure Today we will discuss: Aligning.
Protein and RNA Families
Manually Adjusting Multiple Alignments Chris Wilton.
Amino Acids ©CMBI 2001 “ When you understand the amino acids, you understand everything ”
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Alignment & Secondary Structure You have learned about: Data & databases Tools Amino Acids Protein Structure Today we will discuss: Aligning sequences.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Russell Group, Protein Evolution _________ ____ Rob Russell Cell Networks University of Heidelberg Interactions and Modules: the how and why of molecular.
Protein Properties Function, structure Residue features Targeting Post-trans modifications BIO520 BioinformaticsJim Lund Reading: Chapter , 11.7,
Hyperthermophile subtilases
©CMBI 2008 Databases Data must be in a certain format for software to recognize Every database can have its own format but some data elements are essential.
Multiple Sequence Alignment Carlow IT Bioinformatics November 2006.
InterPro Sandra Orchard.
Doug Raiford Lesson 14.  Reminder  Involved in virtually every chemical reaction ▪ Enzymes catalyze reactions  Structure ▪ muscle, keratins (skin,
Bioinformatics A Summary seminar (with many hints for exam questions)
Protein Structure and Function. Proteins are organic compounds made from amino acids held together by peptide bonds.
EBI is an Outstation of the European Molecular Biology Laboratory. A web based integrated search service to understand ligand binding and secondary structure.
Protein Structure Prediction. Protein Sequence Analysis Molecular properties (pH, mol. wt. isoelectric point, hydrophobicity) Secondary Structure Super-secondary.
Protein families, domains and motifs in functional prediction
Bio/Chem-informatics
PDBemotif A web based integrated search service to understand ligand binding and secondary structure properties in macromolecular structures.
Protein Families, Motifs & Domains.
Genome Center of Wisconsin, UW-Madison
There are four levels of structure in proteins
Aligning Sequences You have learned about: Data & databases Tools
Sequence Based Analysis Tutorial
Levels of Protein Structure
The Three-Dimensional Structure of Proteins
Presentation transcript:

School B&I TCD Bioinformatics Proteins: structure,function,databases,formats

Wot’s a protein,then? Hierarchical A collection of amino acids (0-D) –AACompIdent can identify a protein from AA%s A sequence (string) of AAs (1-D) 2 nd ry structural elements:  -helix etc. (2-D) Domains – (independent) functional units Whole Protein (from single CDS) (3-D) Quaternary structure: dipeptides,ribosomes Interactome, pathways

Protein functions

Amino acid properties again … and again and again

Amino acid groups KR (Lys Arg) NH 3 + basic DE (Glu Asp) COO - acidic WYF (Trp Tyr Phe) large aromatic GP (Gly,Pro)  -breaking C (Cys) disulphide –S – S – bridges –C also not disulphide bridges etc.

Secondary structure  -helix (no Pro Gly) –3.4 residues per turn –Leucine zipper …LXXXXXXLXXXXXXL… –Amphipathic helix (charged on one side) –Transmembrane (  -helix,hydrophobic ~21AA long)  -sheet –2 dimensional zigzag Coil,random Turn (kink) Easy like exon prediction

Patterns to recognise (more reliable in MSA than in single seq) Alternate hydrophobic residues –Surface  -sheet (zig-zag-zig-zag) Runs of hydrophobic residues –Interior/buried  -sheet Residues with 3.5AA spacing ( amphipathic ) –  -helix WNNWFNNFNNWNNNF Gaps/indels –Probably surface not core MSA improves 2ndary structure (  -helix  -sheet) prediction by >6%)

Conserved residues W,F,Y large hydrophobic, internal/core –conserved WFY best signal for domains G,P turns, can mark end of  -helix  -sheet C conserved with reliable spacing speaks C-C disulphide bridges - defensins H,S often catalytic sites in proteases (and other enzymes) KRDE charged: ligand binding or salt-bridge L very common AA but not conserved –except in Leucine zipper L234567L234567L234567L

Basic information How big is my protein? Where beta-sheets? Is there a signal peptide? Is there a trypsin cleavage site? ProtParam tool (MWt etc.) Tmpred,TMHMM transmembrane helix inside/outside,external loops. JPRED for 2-D structure see practical manual for examples

Tertiary structure The holy grail of bioinformatics 3-D orientation of known ,  Proteins made of functional units “domains” –Tried tested module –Domain shuffling and exon boundaries Bioinf tries to make predictive calls on aspects of the 3-D structure Q. Why is 3-D important ? A. Structure = function Difficult like Gene prediction

What binf can do about 3-D Expressed/exported proteins have signal peptide Hydropathy plot,antigenicity index,amphipathicity get handle on surface probability But homology to known 3-D structure (Xray,NMR) is best predictor – threading. Plan to X-ray all “folds” in human genome.

recaA

SwissProt/UniProt Some of the 194 lines of info in a SwissProt entry ID RECA_ECOLI STANDARD; PRT; 352 AA. AC P0A7G6; P03017; P26347; P78213; RX MEDLINE= ; PubMed= ;; RA Story R.M.,Weber I.T.,Steitz T.A.; RT "The structure of the E. coli recA protein"; RL Nature 355: (1992). DR EMBL; V00328; CAA ; -; Genomic_DNA. DR PDB; 2REB; DR PRINTS; PR00142; RECA. DR ProDom; PD000229; RecA; 1. DR SMART; SM00382; AAA; 1. DR TIGRFAMs; TIGR02012; tigrfam_recA; 1. DR PROSITE; PS00321; RECA_1; 1. FT HELIX FT TURN FT STRAND FT HELIX UniProt is the key hub of Bioinformatics databases

Homology? LVMFWSIVGE Known1 L W GE LIVYWTVIGE Unknown 40% ID ILVFYTVVGD Known2 V TV G LIVYWTVIGE Unknown 40% ID Is Unknown part of the same family? Or is this just a 4/10 co-incidence?

RegEx LVMFWSIVGE Known1 ILVFYTVVGD Known2 [MILV](3)-[FYW](2)-[STA]-[MILV]-V-G-[DE] LIVYWTVIGE Unknown * ***** ** More convincing that it is same family? How modify RegEx to include 3 rd sequence? RegEx

Family Databases Three methods

Prosite Groups families by conserved motif. Which is Present in all family members Absent in all other proteins No/few false positives (selectivity) All true positives (sensitivity) Motif defined with a Regular expression

What prosite looks like ID RECA_1; PATTERN.AC PS00321; DT APR-1990 (CREATED); NOV-1997 DE recA signature. PA A-L-[KR]-[IF]-[FY]-[STA]-[STAD]-[LIVMQ]-R. NR /RELEASE=49.0,207132; NR /TOTAL=281(281); /POSITIVE=279(279); /UNKNOWN=0(0); NR /FALSE_POS=2(2); /FALSE_NEG=11; /PARTIAL=10; DR Q01840,RECA1_LACLA,T; P48291,RECA1_MYXXA,T; DR P48292,RECA2_MYXXA,T; Q9ZUP2,RECA3_ARATH,T; Etc for 70 lines DR Q7UJJ0,RECA_RHOBA,N; Q9EVV7,RECA_STRTR,N; DR Q4X0X6,EXO70_ASPFU,F; Q5AZS0,EXO70_EMENI,F; 3D 2REB; 2REC; DO PDOC00131; False negatives False positivesPDB structures Documentation cf SwissProt

Prosite problems RegEx now breaking down as recAs increase so no longer defines the protein Database now huge so prob of finding any short motif is high. Many copies of ELVIS hiding in UniProt May be more than 1 motif defining a family A great first attempt and still useful but too crude

Prints A database of multiple domains/motifs. Multiple motifs abstracted to database Stored as probability matrix If two proteins have the same motifs in the same order they are likely to be homologous. More biological/real/sensitive than ProSite

ProDom A French DB All against all search of the nr protein Db. Includes domains with no known function –cf synteny of non coding regions Great for determining the domain structure of a particular protein.

Pfam Moves up from the short; highly conserved; easily aligned bits of protein family. Uses PSSM position specific scoring matrix … on complete aligned family members

PSSM Multiple sequence alignment: NSGTIVFLWP DSGTAIFLKP ESGTIIFLHN DSDTVRSLKP Posn1 50% D,N,E Posn2 100% S Posn3 75% G,D Posn4 100% T Posn5 50% I,A,V Posn6 50% I,V,R Posn7 75% F,S Posn8 100% L Posn9 50% K,H,W Posn0 75% P,N

Domain take home Run your protein against –InterproScan –CD server at NCBI –Pfscan Likely that the crucial bit of info is only in one of the above.