Download presentation
Presentation is loading. Please wait.
Published byAudra Octavia Pitts Modified over 9 years ago
1
EMBL-EBI Visualization & Data mining
2
EMBL-EBI Visualisation The process of representing abstract data to aid in understanding the meaning of the data. Not to be confused with rendering data (drawing pictures) Typically though, we render data in such a way to visualize the information within that data.
3
EMBL-EBI Introduction Biological data comes from & is of interest to: Chemists : reaction mechanism, drug design Biologists : sequence, expression, homology, function. Structure biologists : atomic structure, fold, classification, function. Medicine : clinical effect Education : Media : Presentation of diverse information to a diverse audience. Each has there own point of view (context). Expert = scientist working within their own field of expertise Non-expert = scientist using data/information outside their field Novice = Non-scientist
4
EMBL-EBI Web pages These are notoriously badly designed often resulting in the information on that site being unusable. The front page should load quickly The main point should appear on the first full screen Clutter – not logically laid out Too busy – cannot find the salient point 8% men & 0.5% women are colour blind Bad text/fonts Too often it doesn’t work User will go somewhere else The latest wiz-bang stuff only works on the latest browsers Only works in one browser – they only tested on one. –Does not conform to standard HTMl Not just presentation of results Google is a good design
5
EMBL-EBI Asking questions Asking questions Biological data is very complex Chemistry, Biology, Physics, Statistics, Medicine.. Most users will be from a different field Asking the right question is difficult. The user cannot use the correct terminology Too many things to query (2000 attributes in MSD) SQL : not suitable for most users Interface too complex Too many check boxes, widgets etc Trying to be too clever The “Go” button is buried somewhere
6
EMBL-EBI Result presentation Results Biological data is complex Chemistry, physics, biology, statistics, medicine… Experts users want all the detail Ie : want to use a specific method They want all the details The want (I hope) the statistical validity of the results The non-expert wants the best practice answer returned within their own context. The want comparative analysis with other fields The want to know the results are valid
7
EMBL-EBI Query design Suitable for text queries Only one logic AND or OR Predefined Easy to use Limited scope 2000 attributes -> 2000 check-boxes ! The simple text box design is very common
8
EMBL-EBI Query design Graphical interface Multiple logic AND/OR/NOT Under users control Slower Steep learning curve Some users just cannot get it Intuitive once mastered Pretty
9
EMBL-EBI Query design HIS|SER:S/H>C2.0 HIS.ne2:S/S>C2.0 HIS.[n]/T>C2.0 Figurative 2D sketch for 3D query (Active sites) Informative – presents meaning for the question Slower Less error prone
10
EMBL-EBI YAMGP (yet another molecular graphics program) Many different programs are available AstexViewer@MSD-EBI Quanta Rasmol MolMol Chime O Spock Swiss-PDBviewer Molscript iMol Pymol Chimera XtalView Frodo Bobscript InsightII Raster3D WebLab-viewer POVRay Yasara LigPlot WebMol Pymol Grasp Mage Whatif VMD Frodo
11
EMBL-EBI Result visualisation Multiple types of biological data Textual data 3D structure 2D chemical sketches 1D sequence Node linked General/derived data Web pages Time Errors/Variance Patented !
12
EMBL-EBI Visualisation : AstexViewer@MSI-EBI Visualisation Lensing Linked views Brushing Picking Flying views Hyperbolic distortion Animation Solid rendering Depth cues Colour,lighting Highlighting Etc… Structure/sequence/data
13
EMBL-EBI Visualisation : comparative analysis Similarity/Difference Data superposition Attribute display Colour, size… Correlation Attribute mapping Sequence colour by structure alignment Analysis Example Example
14
EMBL-EBI Animation Animation Time dependent display Reaction chemistry Visual clues. Expression data Shown as… Rotation Flash On/off Object Synchronization Size, Colour…. Sound NO : incredibly annoying Animation ExampleExample
15
EMBL-EBI Multidimensional analysis Comparative analysis on multiple data Eg. Phi,Psi, Bvalue, Omega 1D & 2D easy 3D graphs are difficult to see. 4D requires 3D + iso-surfaces Higher – too busy Use 2D + multiple properties SPOTFIRE is the most well known Use : X/Y/Colour/size/shape… Interactive bracketing Example
16
EMBL-EBI Visualization- Summary Rendering data is not visualization Not just the display of results Huge array of non-specific techniques – and entire scientific field !
17
EMBL-EBI Data mining “Analysis of data in a database using tools which look for trends or anomalies without knowledge of the meaning of the data.” (Hyperdictionary) “True data mining software does not just change the presentation, but discovers previously unknown relationships among the data.” (IBM)
18
EMBL-EBI Data mining & Data analysis Traditional analysis is via “verification-driven analysis” Requires hypothesis of the desired information (target) Requires correct interpretation of proposed query Discovery-driven data mining Finds data with common characteristics Results are ideal solutions to discovery Finds results without previous hypothesis Results have unbiased mean and variance
19
EMBL-EBI So what is Hypothesis driven data analysis ? Define a target = hypothesis Search for target There are/are-not “hits” Verify/negate hypothesis Distribution is centred on target “catalytic triad” : text string matching Atomic coordinates : coordinate superposition Mathematical graph : graph matching HIS,ASP,SER : data hierarchy knowledge
20
EMBL-EBI Four types of data mining Creation of predictive models : future data expectation Link analysis : connections between data objects Database segmentation : classification Deviation detection : finding outliers. IBM : white papers
21
EMBL-EBI Given multiple sets of primary data (dependant variables) Characters, numbers, Function(numbers),…. Find anomalies To many : numerical occurrence Data variation : Derivatives Singularities ….. Correlations and clusters Within primary data with other data (independent variables) So what is this data mining ? Finds new things ! But not what it means !
22
EMBL-EBI Eg Wife rings husband, “get some nappies for the weekend” Husband takes opportunity to buy some beer ! You won’t grant funding to test this hypothesis ! Retail and Financial industry are heavily into DM. A well known US food supermarket chain found a correlation : Babies nappies Beer 5pm on Friday
23
EMBL-EBI Self/Cross data mining Most mining software looks for correlations between dependent variables. Rainfall, temperature, cloud-cover It rains when it is cloudy Free : http://www.cs.waikato.ac.nz/~ml/ Bioinformatics usually involves anomalies within data objects Sequence clusters (sequence finger prints) Local coordinate clusters (active sites) Global coordinate cluster (folds)
24
EMBL-EBI Data mining – not idiot proof Date of birth and age will give 100 % correlation Authors for structure submission will be correlated to authors on primary citation. “Lysozyme” is the most common fold pattern 36 spelling’s of E.Coli will mask results. Requires representative sets Statistically valid ones too ! Signal/Noise ratio is a problem
25
EMBL-EBI Discovery driven data mining of the PDB Analysis of 3-dimensional coordinates Defined common patterns of atomic interactions locally DB segmentation - active sites & common packing features Link analysis - Similarity between different functional group Defined globally DB segmentation - common patterns of super-secondary str’ Link analysis - common folds in diverse protein families Outlier detection - unique folds
26
EMBL-EBI Issues Systematic “error” propagates as solution 300 lysozyme structures return as a strong solution Results cannot be found below the noise level Need to characterise the noise level Need to improve signal/noise ratio (S/N) to see information Target is not biologically defined It does not give you the biological answer Results should reproduce known biology Can give you new results not previously observed
27
EMBL-EBI Data selection Cannot leave in 300 lysozyme structures ! Select by sequence similarity at 70% exact alignment Different “phase space” to select data Remove structures with resolution < 2.5A Remove NMR (different statistics) Remove pre-1982 etc. Geometrical analysis criteria to check for outliers Using properties NOT target parameters of structure solution
28
EMBL-EBI Local atomic interactions Data Function(3D coordinates) = distance Atom names (independent variable) Residue names (independent variable) Create 3D Hash table of triplets of distances(*) between “points” This is the dependant variable Order = 3
29
EMBL-EBI Local atomic interactions Merge triplets Any pair of N-fold interactions are a (N+1) interaction if they have (N-1) equivalence. Order = N Just keep going until no more (N+1) interaction are found. Time = 8 seconds to find ~ 2000 interactions (Digital alpha ES40)
30
EMBL-EBI Catalytic quartet
31
EMBL-EBI Electrostatic interaction Ligands are found close by rather than associated with the residues
32
EMBL-EBI Iron binding site
33
EMBL-EBI Double disulphide
34
EMBL-EBI N-linked glycosolation binding site + Spot the non-sugar This glycosolation site is the same as active site found in “1a53” – indol-3- glycerolphosphate synthase
35
EMBL-EBI Summary Nearly all Bioinformatics is based on hypothesis driven data analysis Data mining has lost its meaning within Bioinformatics. Discovery driven data-analysis (true data mining) : Can find unknown dependencies, clusters, outliers Is based on statistical probability Returns distributions unbiased by previous ideas Information technology may be better for genomes (1D) “A numerical measure of the uncertainty of an outcome” Information content of gene sequences can be defined by the normalized probability of finding “words” within that sequence
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.