why smart data is better than big data Queen Mary University of London

why smart data is better than big data Queen Mary University of London
Bayesian Networks why smart data is better than big data Bayesian Seminar 16 October 2015 Norman Fenton Queen Mary University of London and Agena Ltd Pleasure to have the opportunity to talk in this series. I tried to get into the first one. I arrived a couple minutes after 2.00 to find the room so packed that there was not even standing room left. I physically could not get in. That suggests to me there is a huge appetite for people to learn more about Bayesian methods and what I am going to talk about today is influenced by years of research and practical experience – largely in the area of risk assessment and decision analysis. As a Director of Agena I declare an interest up front because Agena is in the business of applying Bayesian methods to risk assessment and Agena has an established proprietary BN tool (for which there is a completely free version available).

From Bayes to Bayesian networks
Outline From Bayes to Bayesian networks Why pure machine learning is insufficient Applications Way forward <CL>I am going to introduce BNs and explain why, due to relatively recent algorithmic breakthroughs, they have become an increasingly popular technique for risk assessment and decision analysis. <CL> I will explain why Bayesian networks ‘learnt’ purely from data – even when ‘big data’ is available - generally do not work well. <CL> I will provide an overview of successful applications (including transport safety, medical, law/forensics, operational risk, and football prediction). What is common to all of these applications is that the Bayesian network models are built using a combination of expert judgment and (often very limited) data. <CL> I will finally give and overview of the challenges ahead and conclusions

From Bayes to Bayesian networks

Introducing Bayes H (Person has disease?) We have a hypothesis H E (Positive Test?) We get some evidence E 1 in a 1000 100% accurate for those with disease; 95% accurate for those without Although I’m assuming most people here know what BT is I want to introduce it using the graphical formalism of BNs. <CLICK> we start with some hypothesis H (disease) – for simplicity assume Boolean T or F <CLICK> We some evidence about H (e.g. result of a diagnostic test) Again for simplicity assume this outcome is T or F. <CLICK> We have a prior probability for H – say 1/1000. so here is the prior probability table for H <CLICK> We also know the probability of the evidence given the hypothesis – this is the test accuracy. Suppose eg the test is always pos if a person has the disease P(E|H) = 1 and P(E| not H) is 0.05 So here is its probability table which you can see is conditioned on the state of H. This incidentally is a complete specification of a BN. <CL> But what we want to know is the prob …. I am sure most people here know the answer but it is worth pointing out that when this problem was presented to staff at students at Harvard medical school most said the answer was 95%. What is the probability a person has the disease if they test positive?

Waste of time showing this to most people!!!
Bayes Theorem We have a prior P(H) = 0.001 Waste of time showing this to most people!!! We know the (likelihood) values for P(E|H) But we want the posterior P(H|E) P(H|E) = P(E|H)*P(H) P(E) P(E|H)*P(H) + P(E|not H)*P(not H) = In more familiar terms to most people here I suspect <CL> we have a prior for H <CL> we know P(E|H) the likelihood of E <CL> but what we really want to know is the posterior probability of the hypothesis given the evidence <CLICK> Bayes theorem gives us the necessary formula for this. <CLICK> which is of course very different to the 95% assumed by most doctors. This suggests that Bayes is counterintuitive to most lay people and domain experts. But worse <CL> showing them the formula and calculations neither makes them understand it nor convinces them the answer is correct. It might be easy for statisticians and for mathematically literate people in this simple case, But for MOST people – and this includes from my personal experience highly intelligent barristers, judges and surgeons – this is completely hopeless. And it is no good us arguing that it is not. 1*0.001 1* *0.999 P(H|E)  2% = 0.001 0.5005 0.0196

Imagine 1,000 people To explain it you have to use diagrammatic methods like this.

One has the disease

But about 5% of the remaining 999 people without the disease test positive. That is about 50 people

So about 1 out of 50 who test positive actually have
the disease That’s about 2% <CLICK> Of the 100 only one is guilty. <CLICK> So (if there is no other evidence against the defendant) there is a 99% chance that a person with the matching blood type is innocent <CLICK> Which is very different from the prosecution claim That’s very different from the 95% assumed by most medics

A more realistic scenario
Cause 1 Cause 2 This is a Bayesian network Disease Y Disease Z Disease X Test A Symptom 1 Symptom 2 Test B The problem is that neither the formulaic approach nor the diagrammatic scales up. In any realistic scenario, our problem will involve more than just a single unknown hypothesis and piece of evidence. There may be more than one disease which leads to a positive result of Test A. So we introduce a test B more specific to Disease X, but which also has a separate dependency on Test A. There also might by observable symptoms of disease X Some of these may also be more or less likely with the other diseases. Then we might know of some common cause of the diseases. And another which is influenced or causes by the first. This is a Bayesian network. LACK of arcs represent conditional independence – so the BN is a simplified version of the full joint probability space over all variables. That is good for 2 reasons: 1) Having a visual representation improves understanding and communication and 2) it makes the Bayesian inference simpler. <FINAL CLICK> Unfortunately, despite this the necessary Bayesian calculations quickly become infeasible. Not only is it almost impossible to do the calculations manually even for small size BNs, but the problem of producing an efficient exact algorithm is know to be NP-hard (i.e. intractable) in general. The necessary Bayesian propagation calculations quickly become extremely complex

Combined Evidence/data
The usual big mistake Combined Hypothesis Combined Evidence/data Now, although as I will show that problem has been to a large part resolved – there are many researchers including even Bayesian statisticians– who are unaware of these developments and this is a reason why BNs have so far been relatively under exploited. <CLICK> The ramifications are that it is very common for researchers to try to solve Bayesian inference problems by collapsing the ‘real model’ into effectively a 2-node model. The results of doing that can be misleading or simply wrong.

The Barry George case This flawed simplification is especially common when dealing with statistical forensic evidence in legal cases. <CL> BG was convicted in 2001 for the murder of TV presenter Jill Dando <CL> A critical part of the prosecution case was the discovery in BGs coat pocket of a tiny particle of gun powder residue that matched that of the gun which killed JD In 2007 BG’s lawyers used essentially a Bayesian argument successfully to argue that this evidence was ‘NEUTRAL’ and therefore should not have been presented at trial. A retrial was ordered wiith the gun powder evidence inadmissible and BG was not found guilty.

Evidence George fired gun
The Barry George case But the claim that the evidence was neutral was simply an artefact of collapsing a complex BN model <LR> into a 2-node model <CL><CL> By collapsing it into a 2-node model the we can use the LR of the evidence to determine the extent to which the evidence favours the defence hyp or the prosecution hyp. We don’t even have to assume a prior probability for the hypothesis/ But by transforming it crucial information is lost and incorrect assumptions are made. In this case it was clear that while the evidence was neutral on H in the simple model it was NOT neutral on H in the full model. I was not involved in BG but have published work about it. I have, however, have been involved as an expert in several cases involving forensic and statistical evidence and this type of mistake is common. Only by using BNs can flaws in the foresnic scientists’ claims be exposed. This is especially worrying for DNA evidence. CONTINUED AVOIDANCE OF BNS IS SILLY BECAUSE A SOLUTION HAS BEEN AVAILABLE FOR 30 YEARS NOW. George fired gun Evidence George fired gun

Lauritzen and Spiegelhalter
Late 1980s breakthrough Pearl Lauritzen and Spiegelhalter The real breakthrough came in the late 1980s when AI researchers <CL> Pearl and <CL> L&S discovered exact fast Bayesian algorithms that worked not for all BNs – that’s impossible – but for a very large class of practical BNs. Since then increasingly sophisticated and widely available tools that implement these algorithms have become available, meaning that nobody should ever manually do BT calculations nor write their own programs to do complex inference.

A Classic BN As an illustration of a classic BN model in action I can show you the famous Asian model. The idea here is that there is a chest clinic where patients come with different symptoms and we have to diagnose what’s wrong with them. For simplicity all the nodes here are Boolean.

Marginals Before entering evidence in the model the algorithm computes the marginal probs based on the user defined priors. So, e.g. the marginal for the dyspnea symptom (shortness of breath) is calculated from the user-defined conditional prob table of this node. The for smoker simply means we provided a prior suggesting that 50% of people who have come to the clinic in the past are smokers. Only 1% had a recent visit to Asia. The marginals tells us that 45% of previous patients had bronchitis. Very few had TB or cancer

As we enter evidence in the model all the uncertain nodes result in updated prob distributions, So if the patient has shortness of breath then effect is that the chance of TB Bronch increases massively to 83%. Although the other 2 also increase Bronch is now much more likely than not the problem. Dyspnoea observed

Also non-smoker If the person is NOT a smoker the prob drops a bit but is still overwleminlgy the most likely.

Positive x-ray So we send the patient for an X-ray and it comes back positive (bad thing). Although bronch is still most likely TB and cancer are both up to about 25%.

..but recent visit to Asia
We then find out the patient had a recent visit to Asia and everything changes. TB is now easily the most likely disease. So the power of BNs Explicitly model causal factors Reason from effect to cause and vice versa ‘Explaining away’ Overturn previous beliefs Make predictions with incomplete data Combine diverse types of evidence Visible auditable reasoning But first generation tools have significant limitations ……..

How to develop complex models
Can we really LEARN this kind of model from data? The most obvious limitation of BN tools is that, while they are able to do the calculations in a BN model they provide minimal support for actually building the BN model. <CL> This is an actual Bayesian network model colleagues in my research group built for risk assessment and risk management of offending behaviour in released prisoners with serious background of violent behaviour. How do we build a model like this? Building a BN requires us to first build the graph structure and then to define the probability tables for each node. To see how difficult this could be look at a node with 5-states having 2-parents each with 5 states. <CL> <CL><CL>So there are 5 times 5 parent state combinations and each of these has to be defined for each of the 5 states. That’s 125 table entries. Imagine a node with 5 parents. Many people who use BNs assume that the only sensible way to build them is to LEARN both the structure and the tables from data <CL> But the data requirements for this are huge. Even when vast amounts of data are available structure learning is largely a waste of time. For table learning it can be fine, but many of the problems we deal with simply do not have the data and we have to rely at least in part on expert judgment. That has been the focus of our research and applications for several years.

Definitional idiom Cause consequence idiom Induction idiom A Bayesian network model for risk assessment and risk management of offending behaviour in released prisoners with serious background of violent behaviour. Measurement idiom Idioms

A Bayesian network model for risk assessment and risk management of offending behaviour in released prisoners with serious background of violent behaviour. Bayesian net objects

Ranked nodes When it comes to building node probability tables we have developed and implemented the notion of ranked nodes that make it very easy to define large tables for a an important class of variables

Static discretisation: marginals
But the most important development is the work on numeric variables that has been pioneered by my colleague MN that deals with a critical limitation of the first gen BN tools: their inability to properly and accurately handle continuous variables. Because the algorithms only apply to discrete nodes, any continuous variables have to be manually discretised. This is not only incredibly time consuming but also very inaccurate as there is generally no way of knowing in advance which ranges require the finest discretisations. One of the most important applications we worked on was software reliability and defect prediction, which involved model fragements like this. With the standard static disc this is the kind of result you get. Not how, e.g. we cannot differentiate between an observation of 2000 and a 20,000 KLOC In 2007 my colleague MN developed a DD algorithm which has been implemented in AR which largely resolves this critical problem. Static discretisation: marginals

Dynamic discretisation: marginals
With DD there is no need to do any manual discretisation – the algorithm works on the whole range and dynamically discretises as is necessary based on where most of the probability mass lies. This is the same model using DD (which incidentally can be built in a couple of minutes). Dynamic discretisation: marginals

Static discretisation with observations
Now compare what happens when you enter observations. Here is the result with static DD when KLOC=50 and p=0.2 Static discretisation with observations

Dynamic discretisation with observations
Compared with the far more accurate results with DD. I should point out that AgenaRisk is the only BN tool that has implemented DD. Dynamic discretisation with observations

Why pure machine learning is insufficient
What I will now try to explain is why good BN models inevitably require expert judgment to build and cannot be learnt from data alone – no matter how much data you have.

A typical data-driven study
Age Delay in arrival Injury type Brain scan result Arterial pressure Pupil dilation Outcome (death y/n) 17 25 A N L Y 39 20 B M 23 65 21 80 C H 68 22 30 … .. In a typical data driven approach we have observations from a large number of patients – in the example here taken from a study attempting to build a model to predict at risk patients in A&E with head injuries. We have a bunch of variables representing observable factors about the patient and a record of the outcome. The idea is we want to use the date to learn a model to help identify patients most at risk of death <CL>

BN Model learnt purely from data
Age Brain scan result Injury type Outcome Delay in arrival What you tend to end up with (and this is based on a published study) is a meaningless illogical structure. Poor predictive accuracy. BN used in this way is no better than any other pure ML technique. Arterial pressure Pupil dilation

Regression model learnt purely from data
Delay in arrival Brain scan result Arterial pressure Pupil dilation Age Injury type Of course the classic statistical approach is to build a regression model. This is actually a special case of expert contribution (because there is a prior assumption about the structure). All variables <CL> except outcome <CL> are treated as independent risk factors affecting the dependent outcome variable. <CL> Often produces counterintuitive results like outcome OK for the ‘worst’ combination of risk factors. The classic 70% maximum classification accuracy. Outcome

Expert causal BN with hidden explanatory and intervention variables
Brain scan result Arterial pressure Pupil dilation Delay in arrival Injury type Seriousness of injury Age Ability to recover Outcome Treatment What an expert can provide is the following causes and explanatory information. <CL> Delay and Injury type SERIOUSNESS <CL> <CL> Art pressue,,, are symptoms of the seriousness of injury <CL> Ability to recover is influenced by seriousness and age <CL> most crucially the outcome is influenced not just by you ability to recover but by whether or not you receive treatment. What the model was missing were crucial variables like seriousness of injury and treatment. Especially at risk patients are of course more likely to get urgent treatment to avoid worst outcomes. Hence the anomolies and inaccuracies of the data learnt and regression models. By relying on the data available rather than data that is necessary I continue to see very poor BN models learnt from data. Such models in fact perform no better than any of the other multitude of ML models ranging from regression models through to NNs.

Danger of pure data driven decision making: Example of a Bank database on loans
Customer Age Marital status Employment status Home owner Salary Loan … Defaulted 1 37 M Employed Y 50000 10000 N 2 45 Self-employed 60000 5000 3 26 30000 20000 4 29 S 15000 5 90000 6 35 70000 7 32 40000 8 25000 9 18 Unemployed 10 40 65000 45000 11 21 12 30 13 22 14 3000 15 19 100000 100001 34 1000 100002 28 2000 100003 OK, so we might need expert judgment when we have missing data, but with good experimental design and lots of good quality data we can surely remove dependency on experts …… Because too many people ‘default’ on loans the bank wants to use machine learning techniques on this database to help decide whether or not to offer credit to new applicants. In other words they expect to ‘learn’ when to refuse loans on the basis that the customer profile is too ‘risky’. <CL> These are the problem customers. The fundamental problem with such an approach is that it can learn nothing about those customers who were refused credit precisely because the bank decided they were likely to default. Any causal knowledge about such (potential) customers is missing from the data. Suppose, for example, that the bank normally refuses credit to people under 20, unless their parents are existing high-income customers known to a bank manager. <CL> Such special cases (like customers 9, 15, above) show up in the database and they never default. Any pure data driven learning algorithm will 'learn' that unemployed people under 20 never default - the exact opposite of reality in almost all cases. Pure machine learning will therefore recommend giving credit to people known most likely to default.

Other examples See:www.probabilityandlawblogspot.co.uk
Massive databases cannot learn even tiny models The massive shadow cast by Simpson’s paradox See: What you tend to end up with (and this is based on a published study) is a meaningless illogical structure. Poor predictive accuracy. BN used in this way is no better than any other pure ML technique.

applications

Legal arguments and forensics
As mentioned earlier we have developed many BN to capture complex legal evidence – like forensic evidence and especially to expose many hidden and incorrect assumptions made by both forensic scientists and lawyers. This is from a real case (ongoing) that has exposed fundamental flaws in the way DNA evidence was interpreted in a rape case. Models use basic statistical assumptions from DNA and expert judgment. Legal arguments and forensics

Football prediction overview
We have used BNs extensively in different areas of football analysis including predicting premiership results. Models uses historical data and expert judgment. Primarily AC. This is the high level view of such a model.

Parameter learning from past data
One component learns parameters from previous seasons data but with current adjustments from experts.

Game specific information

Taking account of fatigue

Incorporating recent match data

Final prediction

Final prediction www.pi-football.com
Constantinou, A., N. E. Fenton and M. Neil (2013): "Profiting from an Inefficient Association Football Gambling Market: Prediction, Risk and Uncertainty Using Bayesian Networks". Knowledge-Based Systems. Vol 50,

The Royal London Hospital US Army Institute of Surgical Research
Trauma Care Case Study QM RIM Group The Royal London Hospital US Army Institute of Surgical Research Our case study was to provide decision support for the treatment of lower extremity injuries. The surgeons provided the clinical knowledge and data for the BN models developed by William and barbaros. There were actually two major models developed: one concerning patient physiology in trauma care that is focused on coagulopathy risk and the other which also involved the US Army Institute of Surgical research was concerned with predicting limb viability.

Improving on MESS Score method
The motivation for the work was to improve on the state-of-the-art with respect to decision making for amputations. The most prominent existing model was a scoring system based on data about whether amputations were made. So the model is essentially helping to predict whether an amputation was done rather than whether or not it should be done. The model failed to incorporate causal features of the process and failed to take account of the patient’s physiological state.

Life Saving: Prediction of Physiological Disorders
Treatment of lower extremities involves multiple decision making stages, and priorities in these stages changes as the treatment progress. The first BN model we developed is aimed for the life saving stage of the lower extremity treatment. Surgeons follow life over limb strategy in treatment of lower extremities. If the patient’s life is danger, they postpone the definitive reconstruction operations until patient’s physiology stabilises and risk of death decreases. An important physiological disorder in this phase is acute traumatic coagulopathy. We built a BN model that accurately predicts ATC using the observations available in first 10 minutes of treatment. Our model was validated on three different datasets from three different hospitals and two countries. It had good results in all validations (external and temporal).

Limb Saving: Prediction of Limb Viability
Our second model aims to predict the viability of lower extremities after salvage is attempted. The model uses injury and treatment information to predict non-viable lower extremity and failed salvage attempt. This model was based on a large systematic review and meta-analysis of the literauture, and a large data set from USAISR. The dataset contained information about the lower extremity injuries of the US injured military personnel. The model’s results was accurate, it outperformed a well known scoring model, and multiple data driven algorithms. In summary, both of the models built in our case studies had significant contributions to the clinical literature as well. These models provide accurate predictions to two important clinical areas where previous models have failed. Both models analyse the data based on clinical knowledge, and evidence from literature. DATA + KNOWLEDGE

You can actually run the models online at this website

This interface hides the underlying model complexity and allows users to enter basic patient information and get updated risk probabilities in real-time.

Operational Risk

Way forward

Big Data Big Data … or Smart Data? machine learning causal models
Knowledge There are many who are unaware of the Bayesian developments who feel that the real solution to the problems I have spoken about will come with the advent of big data and increasingly powerful pure machine learning algorithms. I feel very strongly that much of this big data drive is an unnecessary waste of effort. Big data <CL> churned through pure machine learning <CL more often than not delivers rubbish.<CL> <CL>It is the combination of knowledge <CL> and smart data <CL> which generates causal models that make sense and the Bayesian approach is the most effective method for this smart data approach. Smart Data

Challenges Building good models with minimal data Tackle resistance to subjective priors Make BN models easier to use and understand BAYES-KNOWLEDGE bayes-knowledge.com

Bayesian calculations can and should be done with BN tools
Conclusions Bayesian calculations can and should be done with BN tools Some of the most serious limitations of BN tools and algorithms have been resolved BNs have been used effectively in a range of real world problems. Most of these BNs involve expert judgment and not just data <CLICK> Indeed subjective approach and Bayes is only rational way to reason under uncertainty and is the only rational way to do risk assessment. <CLICK> BNs in real use have been underreported. They are not just an academic research tool. <CLUCK> Many of the traditional genuine barriers have now been removed. Manual model building has been revolutionised by improvements in tool design and advances in methods for generating tables from mimimal user input. The achilles heel of continuous nodes has essentially been fixed. There are issues of computationasl complexity, but these are even worse in alternaitve approaches such as Monte Carlo. So the remaining problems are largely perceptual. To gain trust in Bayes we need visual non math arguments. There should NEVER be any need for discussion about the Bayesian calculations, just as there should not be any need to discuss or challenge say how a calculator is used to compute a long division. Under no circumstances should we assume that decision-makers can do the calculations or understand the way such calculations are done. I have indicated how BN tools have already been used with some effect. I believe that in 50 years time professionals of all types icluding those in insurance, law and even medicine will look back in total disbelief that they could have ignored these available techniques of reasoning about risk for so long.

Follow up Get the papers eecs.qmul.ac.uk/~norman Get the book
BayesianRisk.com Propose case study for BAYES-KNOWLEDGE bayes-knowledge.com Try the free software and models AgenaRisk.com

why smart data is better than big data Queen Mary University of London

Similar presentations

Presentation on theme: "why smart data is better than big data Queen Mary University of London"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

why smart data is better than big data Queen Mary University of London

Similar presentations

Presentation on theme: "why smart data is better than big data Queen Mary University of London"— Presentation transcript:

Similar presentations

About project

Feedback