EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data.

EMBL-EBI Visualization & Data mining

EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data.  Not to be confused with rendering data (drawing pictures)  Typically though, we render data in such a way to visualize the information within that data.

EMBL-EBI Introduction  Biological data comes from & is of interest to:  Chemists : reaction mechanism, drug design  Biologists : sequence, expression, homology, function.  Structure biologists : atomic structure, fold, classification, function.  Medicine : clinical effect  Education :  Media :  Presentation of diverse information to a diverse audience.  Each has there own point of view (context).  Expert = scientist working within their own field of expertise  Non-expert = scientist using data/information outside their field  Novice = Non-scientist

EMBL-EBI  Web pages  These are notoriously badly designed often resulting in the information on that site being unusable.  The front page should load quickly  The main point should appear on the first full screen  Clutter – not logically laid out  Too busy – cannot find the salient point  8% men & 0.5% women are colour blind  Bad text/fonts  Too often it doesn’t work  User will go somewhere else  The latest wiz-bang stuff only works on the latest browsers  Only works in one browser – they only tested on one. –Does not conform to standard HTMl Not just presentation of results Google is a good design

EMBL-EBI Asking questions  Asking questions  Biological data is very complex  Chemistry, Biology, Physics, Statistics, Medicine..  Most users will be from a different field  Asking the right question is difficult.  The user cannot use the correct terminology  Too many things to query (2000 attributes in MSD)  SQL : not suitable for most users  Interface too complex  Too many check boxes, widgets etc  Trying to be too clever  The “Go” button is buried somewhere

EMBL-EBI Result presentation  Results  Biological data is complex  Chemistry, physics, biology, statistics, medicine…  Experts users want all the detail  Ie : want to use a specific method  They want all the details  The want (I hope) the statistical validity of the results  The non-expert wants the best practice answer returned within their own context.  The want comparative analysis with other fields  The want to know the results are valid

EMBL-EBI Query design  Suitable for text queries  Only one logic AND or OR  Predefined  Easy to use  Limited scope  2000 attributes -> 2000 check-boxes !  The simple text box design is very common

EMBL-EBI Query design  Graphical interface  Multiple logic AND/OR/NOT  Under users control  Slower  Steep learning curve  Some users just cannot get it  Intuitive once mastered  Pretty

EMBL-EBI Query design HIS|SER:S/H>C2.0 HIS.ne2:S/S>C2.0 HIS.[n]/T>C2.0  Figurative 2D sketch for 3D query (Active sites)  Informative – presents meaning for the question  Slower  Less error prone

EMBL-EBI YAMGP (yet another molecular graphics program)  Many different programs are available AstexViewer@MSD-EBI Quanta Rasmol MolMol Chime O Spock Swiss-PDBviewer Molscript iMol Pymol Chimera XtalView Frodo Bobscript InsightII Raster3D WebLab-viewer POVRay Yasara LigPlot WebMol Pymol Grasp Mage Whatif VMD Frodo

EMBL-EBI Result visualisation  Multiple types of biological data  Textual data  3D structure  2D chemical sketches  1D sequence  Node linked  General/derived data  Web pages  Time  Errors/Variance Patented !

EMBL-EBI Visualisation : AstexViewer@MSI-EBI  Visualisation  Lensing  Linked views  Brushing  Picking  Flying views  Hyperbolic distortion  Animation  Solid rendering  Depth cues  Colour,lighting  Highlighting  Etc… Structure/sequence/data

EMBL-EBI Visualisation : comparative analysis  Similarity/Difference  Data superposition  Attribute display  Colour, size…  Correlation  Attribute mapping  Sequence colour by structure alignment Analysis Example Example

EMBL-EBI Animation  Animation  Time dependent display  Reaction chemistry  Visual clues.  Expression data  Shown as…  Rotation  Flash  On/off  Object Synchronization  Size, Colour….  Sound  NO : incredibly annoying Animation ExampleExample

EMBL-EBI Multidimensional analysis  Comparative analysis on multiple data  Eg. Phi,Psi, Bvalue, Omega  1D & 2D easy  3D graphs are difficult to see.  4D requires 3D + iso-surfaces  Higher – too busy  Use 2D + multiple properties  SPOTFIRE is the most well known  Use : X/Y/Colour/size/shape…  Interactive bracketing Example

EMBL-EBI Visualization- Summary  Rendering data is not visualization  Not just the display of results  Huge array of non-specific techniques – and entire scientific field !

EMBL-EBI Data mining  “Analysis of data in a database using tools which look for trends or anomalies without knowledge of the meaning of the data.” (Hyperdictionary)  “True data mining software does not just change the presentation, but discovers previously unknown relationships among the data.” (IBM)

EMBL-EBI Data mining & Data analysis  Traditional analysis is via “verification-driven analysis”  Requires hypothesis of the desired information (target)  Requires correct interpretation of proposed query  Discovery-driven data mining  Finds data with common characteristics  Results are ideal solutions to discovery  Finds results without previous hypothesis  Results have unbiased mean and variance

EMBL-EBI So what is Hypothesis driven data analysis ?  Define a target = hypothesis  Search for target  There are/are-not “hits”  Verify/negate hypothesis  Distribution is centred on target “catalytic triad” : text string matching Atomic coordinates : coordinate superposition Mathematical graph : graph matching HIS,ASP,SER : data hierarchy knowledge

EMBL-EBI Four types of data mining  Creation of predictive models : future data expectation  Link analysis : connections between data objects  Database segmentation : classification  Deviation detection : finding outliers. IBM : white papers

EMBL-EBI  Given multiple sets of primary data (dependant variables)  Characters, numbers, Function(numbers),….  Find anomalies  To many : numerical occurrence  Data variation : Derivatives  Singularities  …..  Correlations and clusters  Within primary data  with other data (independent variables) So what is this data mining ? Finds new things ! But not what it means !

EMBL-EBI Eg  Wife rings husband, “get some nappies for the weekend”  Husband takes opportunity to buy some beer ! You won’t grant funding to test this hypothesis !  Retail and Financial industry are heavily into DM.  A well known US food supermarket chain found a correlation :  Babies nappies  Beer  5pm on Friday

EMBL-EBI Self/Cross data mining  Most mining software looks for correlations between dependent variables.  Rainfall, temperature, cloud-cover  It rains when it is cloudy  Free : http://www.cs.waikato.ac.nz/~ml/  Bioinformatics usually involves anomalies within data objects  Sequence clusters (sequence finger prints)  Local coordinate clusters (active sites)  Global coordinate cluster (folds)

EMBL-EBI Data mining – not idiot proof  Date of birth and age will give 100 % correlation  Authors for structure submission will be correlated to authors on primary citation.  “Lysozyme” is the most common fold pattern  36 spelling’s of E.Coli will mask results.  Requires representative sets Statistically valid ones too !  Signal/Noise ratio is a problem

EMBL-EBI Discovery driven data mining of the PDB  Analysis of 3-dimensional coordinates  Defined common patterns of atomic interactions locally  DB segmentation - active sites & common packing features  Link analysis - Similarity between different functional group  Defined globally  DB segmentation - common patterns of super-secondary str’  Link analysis - common folds in diverse protein families  Outlier detection - unique folds

EMBL-EBI Issues  Systematic “error” propagates as solution 300 lysozyme structures return as a strong solution  Results cannot be found below the noise level  Need to characterise the noise level  Need to improve signal/noise ratio (S/N) to see information  Target is not biologically defined  It does not give you the biological answer  Results should reproduce known biology  Can give you new results not previously observed

EMBL-EBI Data selection  Cannot leave in 300 lysozyme structures !  Select by sequence similarity at 70% exact alignment Different “phase space” to select data  Remove structures with resolution < 2.5A  Remove NMR (different statistics)  Remove pre-1982 etc.  Geometrical analysis criteria to check for outliers Using properties NOT target parameters of structure solution

EMBL-EBI Local atomic interactions  Data  Function(3D coordinates) = distance  Atom names (independent variable)  Residue names (independent variable)  Create 3D Hash table of triplets of distances(*) between “points”  This is the dependant variable  Order = 3

EMBL-EBI Local atomic interactions  Merge triplets  Any pair of N-fold interactions are a (N+1) interaction if they have (N-1) equivalence.  Order = N  Just keep going until no more (N+1) interaction are found.  Time = 8 seconds to find ~ 2000 interactions (Digital alpha ES40)

EMBL-EBI Catalytic quartet

EMBL-EBI Electrostatic interaction Ligands are found close by rather than associated with the residues

EMBL-EBI Iron binding site

EMBL-EBI Double disulphide

EMBL-EBI N-linked glycosolation binding site +  Spot the non-sugar  This glycosolation site is the same as active site found in “1a53” – indol-3- glycerolphosphate synthase

EMBL-EBI Summary  Nearly all Bioinformatics is based on hypothesis driven data analysis  Data mining has lost its meaning within Bioinformatics.  Discovery driven data-analysis (true data mining) :  Can find unknown dependencies, clusters, outliers  Is based on statistical probability  Returns distributions unbiased by previous ideas  Information technology may be better for genomes (1D)  “A numerical measure of the uncertainty of an outcome”  Information content of gene sequences can be defined by the normalized probability of finding “words” within that sequence

EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data.

Similar presentations

Presentation on theme: "EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data.

Similar presentations

Presentation on theme: "EMBL-EBI Visualization & Data mining. EMBL-EBI Visualisation  The process of representing abstract data to aid in understanding the meaning of the data."— Presentation transcript:

Similar presentations

About project

Feedback