Presentation is loading. Please wait.

Presentation is loading. Please wait.

Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough.

Similar presentations


Presentation on theme: "Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough."— Presentation transcript:

1 Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough data. Few bioinformaticians available.

2 Laerte about structures: “Use the Force, Luke” sequence, Gert

3 Signals in Sequences The number of sequences available for analysis rapidly approaches infinite. We need new ways to look at all this information.

4 The First Law: The First Law: First law of sequence analysis: A conserved residue is important.

5 With thousands of aligned sequences: Second law of sequence analysis: A very conserved residue is very important.

6 Signals in sequences: Conserved, CMA, variable QWERTYASDFGRGH QWERTYASDTHRPM QWERTNMKDFGRKC QWERTNMKDTHRVW Black = conserved White = variable Green = correlated mutations(CMA)

7 Sequence Signals Three types of information from multiple sequence alignments: 1) Conservation 2) Correlation 3) Variability

8 Artefacts Wrong sequence signals can result from: Not enough sequences Too conserved sequences Too variable sequences Over-alignment Over-interpretation

9 Recalcitrant residues

10 Sequence Entropy 20 E i =  p i ln(p i ) i=1

11 Sequence Variability Sequence variability is the number of residue types that is present in more than 0.5% of the sequences.

12 Entropy - Variability Evolution = try everything (and keep what works well) Variability = Chaos (try everything) Entropy = Information (keep what works well)

13 Entropy - Variability Variability is result of DNA trying everything. Entropy is the protein’s break on evolutionary speed.

14 Ras Entropy - Variability 11 Red 12 Orange 22 Yellow 23 Green 33 Blue

15 Ras Location 11 Red 12 Orange 22 Yellow 23 Green 33 Blue

16 Protease Entropy - Variability 11 Red 12 Orange 22 Yellow 23 Green 33 Blue

17 Protease Location 11 Red 12 Orange 22 Yellow 23 Green 33 Blue

18 Globin Entropy - Variability GPCR 11 Red 12 Orange 22 Yellow 23 Green 33 Blue

19 Globin Location 11 Red 12 Orange 22 Yellow 23 Green 33 Blue

20 And now for drug design: GPCR 11 Red 12 Orange 22 Yellow 23 Green 33 Blue

21 GPCRs: (Membrane facing amino acids left out) 11 Red 12 Orange 22 Yellow 23 Green 33 Blue

22 Summary Given many sequences: Every residue’s role known. Signaling paths detectable. Two step evolutionary model: First main site, soon after modulator site.

23 Beyond the summary Sequence -> structure -> function is wrong. It should be: Structure -> sequence -> function. And, because active sites are at the surface, conserved residues are at or near the surface.

24 Beyond the summary Why do all TIM-barrel enzymes have the functional residues at the C-terminal side of the strands?

25 Beyond the summary 22 Yellow: Core 11 Red: main site 23 Green: Modulator 12 Orange: Around main site Up to 18 residue types Up to 14 residue types Up to 8 residue types Up to 4 residue types 11 12 22 23 33

26 The weakness of data Data errors. Poor software. Data poorly understood. Never enough data. Few bioinformaticians around.

27 The weakness of data Rob Hooft WHAT_CHECK www.cmbi.kun.nl/gv/servers/ www.cmbi.kun.nl/gv/pdbreport/

28 Structure validation Everything that can go wrong, will go wrong, especially with things as complicated as protein structures.

29 Why ? Why does a sane (?) human being spend fourteen years to search for twelve million errors in the PDB?

30 Because: All we know about proteins is derived from PDB files. If a template is wrong the model will be wrong. Errors become smaller when you know about them.

31 What do we check? Administrative errors. Crystal-specific errors. NMR-specific errors. Really wrong things. Improbable things. Things worth looking at. Ad hoc things.

32 Error detection Detecting errors is one thing fixing them another… We try not to say about the structure that it is wrong, but we try to say what is wrong about the structure. Give hints how to fix things.

33 How difficult can it be?

34

35 Your best check:

36 Planarity

37 Little things hurt big

38 Improbable things

39 How wrong is wrong?

40 Our errors Four sigma: 12.000 false positives. Administrative errors misunderstood. Improbable is not wrong. Poor data makes errors unavoidable. Bugs.

41 Contact Probability

42

43 DACA

44 DACA

45 DACA

46 DACA

47 DACA

48 Contact probability box

49 Using contact probability

50 His, Asn, Gln ‘flips’

51 Where are the protons?

52 Hydrogen bond network

53 Hydrogen bond force field

54

55 15% should be flipped

56 Summary Everything that could go wrong has gone wrong. Errors are on a ‘sliding scale’. Error detection can detect a lot, but surely not everything (yet).

57 Beyond the summary, For Drug Design: Forget: High throughput. Forget: Docking. Forget: Structure in absence of many, many sequences. First gather and digest all experimental data.

58 Beyond the summary, For Drug Design: First know your enemy, then defeat it.

59 Thanks to: Laerte OliveiraSao Paulo Florence HornSan Francisco Rob HooftDelft Wilma KuipersWeesp Bob BywaterCopenhagen Nora vd WendenThe Hague Mike SingerBoston Ad IJzermanLeiden Margot BeukersLeiden Amos BairochGeneva Fabien CampagneSan Diego


Download ppt "Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough."

Similar presentations


Ads by Google