Presentation is loading. Please wait.

Presentation is loading. Please wait.

Simple Rearrangements Reversals Blocks represent conserved genes. 1 32 4 10 5 6 8 9 7 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.

Similar presentations


Presentation on theme: "Simple Rearrangements Reversals Blocks represent conserved genes. 1 32 4 10 5 6 8 9 7 1, 2, 3, 4, 5, 6, 7, 8, 9, 10."— Presentation transcript:

1

2 Simple Rearrangements

3

4 Reversals Blocks represent conserved genes. 1 32 4 10 5 6 8 9 7 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

5 Reversals 1 32 4 10 5 6 8 9 7 1, 2, 3, -8, -7, -6, -5, -4, 9, 10 Blocks represent conserved genes. In the course of evolution or in a clinical context, blocks 1,…,10 could be misread as 1, 2, 3, -8, -7, -6, -5, -4, 9, 10.

6 Types of Rearrangements Reversal 1 2 3 4 5 61 2 -5 -4 -3 6 Translocation 4 1 2 3 4 5 6 1 2 6 4 5 3 1 2 3 4 5 6 1 2 3 4 5 6 Fusion Fission

7 Sorting by reversals: 5 steps

8 Sorting by reversals: 4 steps

9 What is the reversal distance for this permutation? Can it be sorted in 3 steps?

10 From Signed to Unsigned Permutation (Continued) 0 5 6 10 9 15 16 12 11 7 8 14 13 17 18 3 4 1 2 19 20 22 21 23 Construct the breakpoint graph as usual Notice the alternating cycles in the graph between every other vertex pair Since these cycles came from the same signed vertex, we will not be performing any reversal on both pairs at the same time; therefore, these cycles can be removed from the graph

11 Reversal Distance with Hurdles Hurdles are obstacles in the genome rearrangement problem They cause a higher number of required reversals for a permutation to transform into the identity permutation Taking into account of hurdles, the following formula gives a tighter bound on reversal distance: d(π) ≥ n+1 – c(π) + h(π) Let h(π) be the number of hurdles in permutation π

12 Median Problem Goal: find M so that D AM +D BM +D CM is minimized NP hard for most metric distances

13 Genome Enumeration for Multichromosome Genomes Genome Enumeration For genomes on gene {1,2,3} 2 -2 2 2

14 Rearrangement Phylogeny

15 Compute A Given Tree (Start)

16 Compute A Given Tree (First Median)

17 Compute A Given Tree (Second Median)

18 Compute A Given Tree (Third Median)

19 Compute A Given Tree (After 1 st Iteration)

20 Binary Encoding

21 MLBE Sequences

22 Experimental Results (Equal Content) 80% inversion, 20% transposition

23 An Example—New Genomes 1 2 3 4 5 6 7 8 9 10 1 -4 5 2 8 10 9 -7 -6 3 … 1 3 5 7 9 1 5 9 -7 3 …

24 Jackknifing Rate

25 Support Value Threshold - FP Up to 90% FP can be identified with 85% as the threshold

26 Jackknife Properties Jackknifing is necessary and useful for gene order phylogeny, and a large number of errors can be identified 40% jackknifing rate is reasonable 85% is a conservative threshold, 75% can also be used Low support branches should be examined in detail

27 Protein

28 In-silico Biochemistry Online servers exist to determine many properties of your protein sequences Molecular weight Extinction coefficients Half-life It is also possible to simulate protease digestion All these analysis programs are available on www.expasy.ch

29 Analyzing Local Properties Many local properties are important for the function of your protein Hydrophobic regions are potential transmembrane domains Coiled-coiled regions are potential protein-interaction domains Hydrophilic stretches are potential loops You can discover these regions Using sliding-widow techniques (easy) Using prediction methods such as hidden Markov Models (more sophisticated)

30 Sliding-window Techniques Ideal for identifying strong signals Very simple methods Few artifacts Not very sensitive Use ProtScale on www.expasy.org Make the window the same size as the feature you’re looking for

31 www.expasy.org/cgi-bin/protscale.pl

32

33

34 Hphob. / Eisenberg

35 Transmembrane Domains Discovering a transmembrane domain tells you a lot about your protein Many important receptors have 7 transmembrane domains Transmembrane segments can be found using ProtScale The most accurate predictions come from using TMHMM

36 Using TMHMM TMHMM is the best method for predicting transmembrane domains TMHMM uses an HMM Its principle is very different from that of ProtScale TMHMM output is a prediction

37 TMHMM vs. ProtScale

38 >sp|P78588|FREL_CANAX Probable ferric reductase transmembrane component OS=Candida albicans GN=CFL1 PE=3 SV=1 MTESKFHAKYDKIQAEFKTNGTEYAKMTTKSSSGSKTSTSASKSSKSTGSSNASKSSTNA HGSNSSTSSTSSSSSKSGKGNSGTSTTETITTPLLIDYKKFTPYKDAYQMSNNNFNLSIN YGSGLLGYWAGILAIAIFANMIKKMFPSLTNNLSGSISNLFRKHLFLPATFRKKKAQEFS IGVYGFFDGLIPTRLETIIVVIFVVLTGLFSALHIHHVKDNPQYATKNAELGHLIADRTG ILGTFLIPLLILFGGRNNFLQWLTGWDFATFIMYHRWISRVDVLLIIVHAITFSVSDKAT GKYKNRMKRDFMIWGTVSTICGGFILFQAMLFFRRKCYEVFFLIHIVLVVFFVVGGYYHL ESQGYGDFMWAAIAVWAFDRVVRLGRIFFFGARKATVSIKGDDTLKIEVPKPKYWKSVAG GHAFIHFLKPTLFLQSHPFTFTTTESNDKIVLYAKIKNGITSNIAKYLSPLPGNTATIRV LVEGPYGEPSSAGRNCKNVVFVAGGNGIPGIYSECVDLAKKSKNQSIKLIWIIRHWKSLS WFTEELEYLKKTNVQSTIYVTQPQDCSGLECFEHDVSFEKKSDEKDSVESSQYSLISNIK QGLSHVEFIEGRPDISTQVEQEVKQADGAIGFVTCGHPAMVDELRFAVTQNLNVSKHRVE YHEQLQTWA Search with Accession number P78588 http://www.uniprot.org/uniprot/

39 www.cbs.dtu.dk/services/TMHMM-2.0

40

41 Predicting Post-translational Modifications Post-translational modifications often occur on similar motifs in different proteins PROSITE is a database containing a list of known motifs, each associated with a function or a post-translational modification You can search PROSITE by looking for each motif it contains in your protein (the server does that for you!) PROSITE entries come with an extensive documentation on each function of the motif

42 Searching for PROSITE Patterns Search your protein against PROSITE on ExPAsy www.expasy.org/tools/scanprosite PROSITE motifs are written as patterns Short patterns are not very informative by themselves They only indicate a possibility Combine them with other information to draw a conclusion Remember: Not everything is in PROSITE !

43 www.expasy.org/tools/scanprosite P12259

44 www.expasy.org/tools/scanprosite

45 Interpreting PROSITE Patterns Check the pattern function: Is it compatible with the protein? Sometimes patterns suggest nonexistent protein features For instance : If you find a myristoylation pattern in a prokaryote, ignore it; prokaryotic proteins have no myristoylation ! Short patterns are more informative if they are conserved across homologous sequences In that case, you can build a multiple-sequence alignment This slide shows an example

46 Patterns and Domains Patterns are usually the most striking feature of the more general motifs (called domains) Domains are less conserved than patterns but usually longer In proteins, domain analysis is gradually replacing pattern analysis

47 Protein Domains Proteins are usually made of domains A domain is an autonomous folding unit Domains are more than 50 amino acids long It’s common to find these together: A regulatory domain A binding domain A catalytic domain

48 Discovering Domains Researchers discover domains by Comparing proteins that have similar functions Aligning those proteins Identifying conserved segments A domain is a multiple-sequence alignment formulated as a profile For each column, a domain indicates which amino acid is more likely to occur

49 Domain Collections Scientists have been discovering and characterizing protein domains for more than 20 years 8 collections of domains have been established Manual collections are very precise but small Automatic collections are very extensive but less informative These collections Overlap Have been assembled by different scientists Have different strengths and weaknesses We recommend using them all!

50 The Magnificent 8 Pfam is the most extensive manual collection Pfam is often used as a reference

51 Searching Domain Collections Domains in Pfam often include known functions A match between your protein and a domain is desirable A match is a potential indication of a function This is VERY informative for further research! Three servers exist to compare proteins and domain collections: InterProScan www.ebi.ac.uk/interproscan CD-Search(conserved Domain)www.ncbi.nih.nlm.gov Motif Scanwww.ch.embnet.org

52 Using InterProScan InterProScan is the most comprehensive search engine for domain databases Makes it possible to compare alternative results on most collections Does not provide a statistical score

53 >sp|P53539|FOSB_HUMAN Protein fosB OS=Homo sapiens GN=FOSB PE=1 SV=1 MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGS GGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSAPAKEDGFSWLLPPPPPPPLPFQTSQDAPPNLTASLFTHSEVQVLGDPFPVVNPSY TSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL

54 www.ebi.ac.uk/InterProScan

55

56 The CD-Search Output CD search is less extensive than that of InterProScan Results come with a a statistical evaluation (E-value) 10 e-15 Low E-valueGood match 2.1 High E-valueBad match

57 www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

58

59

60 Predicting Functions with Domains Finding a match with a domain having a catalytic function is good news... but what, exactly, does it mean? A match indicates that your sequence has the domain structure... but does it also have the function? You cannot say before looking into these details: Where are the catalytic residues on the domain? Does your sequence have the right residues at these positions?

61 Looking into the Details Catalytic residues are normally highly conserved in domains Motif Scan makes it possible to check whether these important residues are conserved in your sequence High bar above 0 = Highly conserved residues Green = Your sequence has an expected residue Red = Your sequence has an unexpected residue

62 Looking into the Details (cont’d.) R (Arginine) is highly expected at this position High bar Potential active site If your protein has an arginine on this position... Bar is filled with green Your protein could be active

63 myhits.isb-sib.ch/cgi-bin/motif_scan

64 Protein 3D Structure

65 Primary, Secondary and Tertiary Structures Proteins are made of 20 amino acids Proteins are on average 400 amino acids long Protein structure has 3 levels: The primary structure is the sequence of a protein The secondary structure is the local structure The tertiary structure is the exact position of each atom on a 3D model

66 Secondary Structures Helix Amino acid that twists like a spring Beta strand or extended Amino acid forms a line without twisting Random coils Amino acid with a structure neither helical nor extended Amino-acid loops are usually coils

67 Guessing the Secondary Structure of Your Protein Secondary structure predictions are good If your protein has enough homologues, expect 80% accuracy The most accurate secondary structure prediction server is PSIPRED

68 PSIPRED Output Conf = Confidence 9 is the best, 0 the worst Pred = Every amino acid is assigned a letter: C for coils E for extended or beta-strand H for helix

69 >gi|15892329|ref|NP_360043.1| translocation protein TolB [Rickettsia conorii str. Malish 7] MRNIIYFILSLLFSVTSYALETINIEHGRADPTPIAVNKFDADNSAADVLGHDMVKVISNDLKLSGLFRP ISAASFIEEKTGIEYKPLFAAWRQINASLLVNGEVKKLESGKFKVSFILWDTLLEKQLAGEMLEVPKNLW RRAAHKIADKIYEKITGDAGYFDTKIVYVSESSSLPKIKRIALMDYDGANNKYLTNGKSLVLTPRFARSA DKIFYVSYATKRRVLVYEKDLKTGKESVVGDFPGISFAPRFSPDGRKAVMSIAKNGSTHIYEIDLATKQL HKLTDGFGINTSPSYSPDGKKIVYNSDRNGVPQLYIMNSDGSDVQRISFGGGSYAAPSWSPRGDYIAFTK ITKGDGGKTFNIGIMKACPQDDENSERIITSGYLVESPCWSPNGRVIMFAKGWPSSAKAPGKNKIFAIDL TGHNEREIMTPADASDPEWSGVLN

70 bioinf.cs.ucl.ac.uk/psipred//?program=psipred

71

72

73

74 Predicting Other Secondary Features It is also possible to predict these accurately: Transmembrane segments Solvent accessibility Globularity Coiled/coil regions All these predictions have an expected accuracy higher than 70%

75 Servers www.predictprotein.org cubic.bioc.columbia.edu/predictprotein www.sdsc.edu/predicprotein www.cbi.pku.edu.cn/predictprotein

76 Predicting 3D Structures Predicting 3D structures from sequences only is almost impossible The only reliable way to establish the 3D structure of a protein is to make a real-world experiment in X-ray crystallography Nuclear magnetic resonance (NMR) Structures established this way are conserved in the PDB database “The PDB of my protein” is synonymous with “The structure of my protein”

77 Retrieving Protein Structures from PDB All PDB entries are 4-letter words! 1CRZ, 2BHL... Sometimes the chain number is added: 1CRZA, 1CRZB... To access all PDB entries, go to www.rcsb.org PDB contains 42,000 entries PDB contains the structure of 16,000 unique proteins or RNAs You can download the coordinates and display the structure

78 www.rcsb.org

79

80 Displaying a PDB Structure You can use any of the online viewers to display the structure They will let you rotate the structure, zoom in and out, or color it PDB files themselves are not human-readable

81 Predicting the Structure of Your Protein The bad news: It is very hard to predict protein 3D structures The good news: Similar proteins have similar structures If your favorite protein has a homologue with a known structure... You can do homology modeling How? Start with a BLAST (more about that in the next slide)

82 ncbi.nlm.nih.gov/BLAST

83

84 BLASTing PDB for Structures BLAST your protein against PDB If you get a very good hit, it means PDB contains a protein similar to yours Your protein and this hit probably have the same structure

85 Be Careful! Sometimes only one of the domains contained in your protein has been characterized If that’s the case, the PDB will only contain this domain Always check the alignments Red line = full protein in PDB Blue line = one domain only in this entry

86 Structures and Sequences Highly conserved sequences are often important in the structure Make a multiple-sequence alignment to identify these important positions Highly conserved positions are either in the core or important for protein/protein interactions

87 3D Predictions If you want to predict the structure of your protein automatically, try the Swiss Model Swiss Model makes the BLAST for you The program does a bit of homology modeling The process delivers a new PDB entry You can access it at swissmodel.expasy.org Swiss Model gives good results for proteins having homologues in PDB

88 zhanglab.ccmb.med.umich.edu/I-TASSER/

89

90 3D-BLAST Use this technique if you have a structure and you want to find other similar structures Use VAST or DALI to look for proteins having the same 3D shape as yours www.eb.ac.uk/dali www.ncbi.nlm.nih/vast

91 3D Movements Most proteins need to move to do their job Predicting protein movement is possible using molecular dynamics Check out this site: molmolvdb.mbb.yale.edu Good molecular dynamics requires extremely powerful computers Don’t expect miracles from standard online resources


Download ppt "Simple Rearrangements Reversals Blocks represent conserved genes. 1 32 4 10 5 6 8 9 7 1, 2, 3, 4, 5, 6, 7, 8, 9, 10."

Similar presentations


Ads by Google