OMPOL – Visualisation of large chemical spaces

OMPOL – Visualisation of large chemical spaces
Peter Corbett, Colin Batchelor, Alexey Pshenichnov, Valery Tkachenko Royal Society of Chemistry ACS Spring 2016 San Diego, CA March 17th 2016

ChemSpider Synthetic Pages
Compounds Reaction Analytical Data Text and References

Chemical space

Dimensions and complexity of science
What about science and chemistry in particular?

RSC Data Repository

RSC Databases RSC Compounds RSC Reactions RSC Spectra RSC Crystals
RSC Polymers RSC Materials RSC Assays RSC Algorithms RSC Models …and on…

Record labels

Visualising Chemical Space
Need to be able to see what sorts of structures are in a collection, how they relate to each other, etc. Could use something like clustering Dimensionality Reduction – chemical structures -> fingerprints -> large dimensional space -> small dimensional space Standard technique – Principal Components Analysis (PCA)

Dimensionality Reduction – First make a molecule-feature matrix
1 … Each row is a fingerprint, each column is a fingerprint bit. In our implementation these are 512-bit similarity fingerprints calculated using Indigo. In this example, the fingerprints are made-up, the compounds are just something I got from a PubChem search. (For those who are interested – column 1 = benzene ring, 2 = methyl group, 3=chlorine, 4=nitro, 5=para-substituted benzene, 6=trichloromethyl group, 7=thiol, 8=bromine, … last = non-aromatic C=C double bond

PCA/SVD Do I need

The result 0.209 0.078 -0.368 … 0.030 0.297 0.174 0.509 0.005 0.343 0.514 -0.394 0.172 0.320 -0.034 -0.198 0.228 0.108 -0.791 0.338 0.812 0.151 0.403 -0.281 0.003 Outcome: the original matrix is transformed into another matrix (technically, three matrices, but only one is important here). Each row is a compound, each column represents a “dimension”, the number says where it is in the dimension. Dimensions are presented in decreasing order of importance, so you can throw out the last dimension. For a simple plot, you can just accept the two most important dimensions. <--- Most important Least important --->

Plot on a graph Note: this plots the top two dimensions – the ones that explain the greatest amount of the data. We could plot other dimensions along the axes instead.

The problem Need an interactive scatterplot
Web delivery => JavaScript Need, at minimum, to click, mouseover, pan and zoom Existing scatterplot libraries, e.g. flot.js, are plentiful and well supported… …but do not scale well – become slow and unresponsive with ~40,000 data points

The solution Make your own graph-plotting tool HTML5 Canvas
OMPOL – One Million Points Of Light – an aspiration for scalability HTML5 Canvas “Google maps” style drawing Divide graph into panels Draw panels as they come onto the screen Assemble display from pre-drawn panels Opportunity for better ways of exploring the data

Example data ChEBI ~50000 compounds, of “Biological Interest”
Has an ontology of compound types

What we’re going to show
Display data from dimensional reduction Selecting data points, sets of data points “Narrowing down” a cluster of compounds based on distribution in multiple dimensions Exporting data Using name and ontology information to select groups of points

OMPOL. The data set is ChEBI, we did PCA on 512-bit similarity fingerprints (calculated using Indigo) on ChEBI, and we’re plotting the first two principal components. But what does it all mean? Let’s mouseover and have a look at a data point.

Mouseover gets you a name and a structure, you can also click on the data point…

That puts a little selection rectangle down, and thus puts the compound in the sidebar, so you can click on links, for example to the ChEBI site

…and get more information that way.

We also have responsive scrolling and zooming – even with very large numbers of data points. In fact the main point of developing our own widget was to have something that would cope with tens of thousands, hundreds of thousands, possibly millions of data points, and still stay responsive. The cluster the mouse is pointing to, let’s look at that. First we scroll the scroll wheel…

…and draw a selection rectangle
…and draw a selection rectangle. This gets all the data points into the side bar.

Next step – we can highlight all of the data points in the selection rectangle. This turns them green, and allows the selection to persist even as other things change. For example, we can break up the nice cluster by changing which variables to plot. Let’s change the x-axis, so instead of showing the 1st principal component, it shows the 3rd…

This is a bit different now
This is a bit different now. We can see that our cluster of points breaks into two main clusters plus another two side clusters.

We can select a group of data points – maybe one of those side clusters…

We can also select a rectangle, and keep (“retain”) just the highlighted points within that rectangle, such that other highlighted points are unhighlighted, and no new points are highlighted.

Like so.

…like so

There’s also the option of using other means to select points
There’s also the option of using other means to select points. For example, ChEBI has an ontology – you can type in the name of an ancestor node, and it can keep following is_a relationships until it finds all of the compounds that share that node as an ancestor. So, to find all of the things that ChEBI thinks are antibiotics, we type in “antibiotics”, and ask OMPOL to shade them:

Et viola! Of course, “antibiotics” covers a very wide range of structures. Let’s try something more specific, like cephalosporins.

Fairly similar results, although there appear to be some more data points. There’s a “main cluster” where the last lot of cephalosporins were, mousing over these data points shows that they’re fairly typical cephalosporins.

Going more towards the top left, there appear to be a few data points that are something to do with cephalosporins – like cepham – not a cephalosporin itself, but it contains the characteristic ring system. But most of these “stray” data points are nothing to do with cephalosporins.

For example “acephenanthylrene” – by co-incidence it contains “ceph” but you can see it’s far from the cephalosporins in chemical space. Of course, the first and second principal components might not be the best for visualisng cephalosporins vs non-cephalosporins. With a little exploring we can find better principal components…

For example, this data point – AL-321 – doesn’t look much like a hydrocarbon. It has oxygen, nitrogen and sulfur in!

So if we get just that compound in the sidebar, follow the link to ChEBI…

…and then view the ontology part, then we get to see what’s going on.

Here we go. A “hydrocarbon” is defined as “A compound consisting of hydrogen and carbon only”. So far so good.

Only in 2D, didn’t have all features turned enabled
How scalable? Works very nicely with ~50000 data points and all features During development, was able to work with 1M and on occasion 10M data points Only in 2D, didn’t have all features turned enabled

Conclusion Interacting with large (tens of thousands to millions of data points) multidimensional data sets is now a definite possibility

Thank you Slides: 42

OMPOL – Visualisation of large chemical spaces

Similar presentations

Presentation on theme: "OMPOL – Visualisation of large chemical spaces"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

OMPOL – Visualisation of large chemical spaces

Similar presentations

Presentation on theme: "OMPOL – Visualisation of large chemical spaces"— Presentation transcript:

Similar presentations

About project

Feedback