School B&I TCD Bioinformatics Proteins: structure,function,databases,formats.

School B&I TCD Bioinformatics Proteins: structure,function,databases,formats

Wot’s a protein,then? Hierarchical A collection of amino acids (0-D) –AACompIdent can identify a protein from AA%s A sequence (string) of AAs (1-D) 2 nd ry structural elements:  -helix etc. (2-D) Domains – (independent) functional units Whole Protein (from single CDS) (3-D) Quaternary structure: dipeptides,ribosomes Interactome, pathways

Protein functions

Amino acid properties again … and again and again

Amino acid groups KR (Lys Arg) NH 3 + basic DE (Glu Asp) COO - acidic WYF (Trp Tyr Phe) large aromatic GP (Gly,Pro)  -breaking C (Cys) disulphide –S – S – bridges –C also not disulphide bridges etc.

Secondary structure  -helix (no Pro Gly) –3.4 residues per turn –Leucine zipper …LXXXXXXLXXXXXXL… –Amphipathic helix (charged on one side) –Transmembrane (  -helix,hydrophobic ~21AA long)  -sheet –2 dimensional zigzag Coil,random Turn (kink) Easy like exon prediction

Patterns to recognise (more reliable in MSA than in single seq) Alternate hydrophobic residues –Surface  -sheet (zig-zag-zig-zag) Runs of hydrophobic residues –Interior/buried  -sheet Residues with 3.5AA spacing ( amphipathic ) –  -helix WNNWFNNFNNWNNNF Gaps/indels –Probably surface not core MSA improves 2ndary structure (  -helix  -sheet) prediction by >6%)

Conserved residues W,F,Y large hydrophobic, internal/core –conserved WFY best signal for domains G,P turns, can mark end of  -helix  -sheet C conserved with reliable spacing speaks C-C disulphide bridges - defensins H,S often catalytic sites in proteases (and other enzymes) KRDE charged: ligand binding or salt-bridge L very common AA but not conserved –except in Leucine zipper L234567L234567L234567L

Basic information How big is my protein? Where beta-sheets? Is there a signal peptide? Is there a trypsin cleavage site? ProtParam tool (MWt etc.) Tmpred,TMHMM transmembrane helix inside/outside,external loops. JPRED for 2-D structure see practical manual for examples

Tertiary structure The holy grail of bioinformatics 3-D orientation of known ,  Proteins made of functional units “domains” –Tried tested module –Domain shuffling and exon boundaries Bioinf tries to make predictive calls on aspects of the 3-D structure Q. Why is 3-D important ? A. Structure = function Difficult like Gene prediction

What binf can do about 3-D Expressed/exported proteins have signal peptide Hydropathy plot,antigenicity index,amphipathicity get handle on surface probability But homology to known 3-D structure (Xray,NMR) is best predictor – threading. Plan to X-ray all “folds” in human genome.

SwissProt/UniProt Some of the 194 lines of info in a SwissProt entry ID RECA_ECOLI STANDARD; PRT; 352 AA. AC P0A7G6; P03017; P26347; P78213; RX MEDLINE=92114994; PubMed=1731246;; RA Story R.M.,Weber I.T.,Steitz T.A.; RT "The structure of the E. coli recA protein"; RL Nature 355:318-325(1992). DR EMBL; V00328; CAA23618.1; -; Genomic_DNA. DR PDB; 2REB; X-ray; @=-. DR PRINTS; PR00142; RECA. DR ProDom; PD000229; RecA; 1. DR SMART; SM00382; AAA; 1. DR TIGRFAMs; TIGR02012; tigrfam_recA; 1. DR PROSITE; PS00321; RECA_1; 1. FT HELIX 72 85 FT TURN 86 87 FT STRAND 90 94 FT HELIX 101 106 UniProt is the key hub of Bioinformatics databases

Homology? LVMFWSIVGE Known1 L W GE LIVYWTVIGE Unknown 40% ID ILVFYTVVGD Known2 V TV G LIVYWTVIGE Unknown 40% ID Is Unknown part of the same family? Or is this just a 4/10 co-incidence?

RegEx LVMFWSIVGE Known1 ILVFYTVVGD Known2 [MILV](3)-[FYW](2)-[STA]-[MILV]-V-G-[DE] LIVYWTVIGE Unknown * ***** ** More convincing that it is same family? How modify RegEx to include 3 rd sequence? RegEx

Family Databases Three methods

Prosite Groups families by conserved motif. Which is Present in all family members Absent in all other proteins No/few false positives (selectivity) All true positives (sensitivity) Motif defined with a Regular expression

What prosite looks like ID RECA_1; PATTERN.AC PS00321; DT APR-1990 (CREATED); NOV-1997 DE recA signature. PA A-L-[KR]-[IF]-[FY]-[STA]-[STAD]-[LIVMQ]-R. NR /RELEASE=49.0,207132; NR /TOTAL=281(281); /POSITIVE=279(279); /UNKNOWN=0(0); NR /FALSE_POS=2(2); /FALSE_NEG=11; /PARTIAL=10; DR Q01840,RECA1_LACLA,T; P48291,RECA1_MYXXA,T; DR P48292,RECA2_MYXXA,T; Q9ZUP2,RECA3_ARATH,T; Etc for 70 lines DR Q7UJJ0,RECA_RHOBA,N; Q9EVV7,RECA_STRTR,N; DR Q4X0X6,EXO70_ASPFU,F; Q5AZS0,EXO70_EMENI,F; 3D 2REB; 2REC; DO PDOC00131; False negatives False positivesPDB structures Documentation cf SwissProt

Prosite problems RegEx now breaking down as recAs increase so no longer defines the protein Database now huge so prob of finding any short motif is high. Many copies of ELVIS hiding in UniProt May be more than 1 motif defining a family A great first attempt and still useful but too crude

Prints A database of multiple domains/motifs. Multiple motifs abstracted to database Stored as probability matrix If two proteins have the same motifs in the same order they are likely to be homologous. More biological/real/sensitive than ProSite

ProDom A French DB All against all search of the nr protein Db. Includes domains with no known function –cf synteny of non coding regions Great for determining the domain structure of a particular protein.

Pfam Moves up from the short; highly conserved; easily aligned bits of protein family. Uses PSSM position specific scoring matrix … on complete aligned family members

PSSM Multiple sequence alignment: 1234567890 NSGTIVFLWP DSGTAIFLKP ESGTIIFLHN DSDTVRSLKP Posn1 50% D,N,E Posn2 100% S Posn3 75% G,D Posn4 100% T Posn5 50% I,A,V Posn6 50% I,V,R Posn7 75% F,S Posn8 100% L Posn9 50% K,H,W Posn0 75% P,N

Domain take home Run your protein against –InterproScan –CD server at NCBI –Pfscan Likely that the crucial bit of info is only in one of the above.

School B&I TCD Bioinformatics Proteins: structure,function,databases,formats.

Similar presentations

Presentation on theme: "School B&I TCD Bioinformatics Proteins: structure,function,databases,formats."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

School B&I TCD Bioinformatics Proteins: structure,function,databases,formats.

Similar presentations

Presentation on theme: "School B&I TCD Bioinformatics Proteins: structure,function,databases,formats."— Presentation transcript:

Similar presentations

About project

Feedback