crop ethnic clever student writing formula after analysis of molecule model in university
April 30, 2021

Is your focus on prediction weakening your Data Science work?

By Harry Powell

Returning to a more technical focus, let’s talk about the difference between focussing on prediction and explanation. Could your Data Science work be weakened if you just focus on building predictive models without seeking to explain casual relationships?

To help us consider this, I am delighted to welcome back the very mathematically grounded Harry Powell. As regular readers will know, Harry is the Director of Data & Analytics for Jaguar Land Rover and an experienced data leader who has worked across startups & Financial Services before joining JLR. He has written for us before on leaders improving how they listen to presentations and has been an engaging guest on my podcast.

In this post, Harry muses on the reasons for the commercial impact that has data science team at JLR has achieved. He usefully unpacks the difference between a data science focus on just building predictive models that appear to work and actually explaining the causal relationships. Well worth considering his challenge…

How we are doing Data Science a bit differently at JLR

So this will sound bad, but it is genuine. I have been struggling with something. Why has my team made lots of money when many other data science teams have not. What have we done differently? Is there really anything to learn from, or have we been lucky? I have a number of ideas that I will share with you over the coming weeks. They may or may not be right, so I would welcome your thoughts.

The first idea is that I think we are using analytical models in quite a different way to many other data science teams.

Analytical models can be used to predict the outcome of a process or conversely to explain what caused that outcome. If you think of it as an equation (y = a + bx), prediction is Right to Left (start with info and predict the outcome). Whereas explanation is Left to Right (start with the outcome you want and see what you can change to get it).

The changing fashion of Data Science versus Analytics

Conventionally analytics was about explaining. Indeed, the original (and proper) use of regression models was never simply to predict, but to validate ideas about how a process worked.

But it has become more fashionable to focus on prediction. Massive steps forward in machine learning have enabled very complex relationships to be modelled. This can offer accurate forecasts of what might happen. These magical predictions are now ubiquitous, e.g. recommendation, spotting fraud or setting airline ticket prices. With all the focus on prediction, explanation has taken a back seat.

But in some ways prediction is a sign of weakness.

Predictive models are a sign of weakness

You only need to predict an outcome when you can’t control it. Predicting sales volumes or optimal prices is only necessary because customers are a law unto themselves. You can’t make them buy more of your product. So if you want to increase sales your only option is to have a decent guess at what customers will want. Then try to work out your best response.

Because predictions tend to be inaccurate (people are heterogeneous, data is limited), prediction models tend to deliver only marginal gains. Moreover, since they don’t uncover any general principles that can be relied on, they need to be constantly recalibrated to whatever data is available.

But if you can explain what factors are driving outcomes, and if you can exert direct control over them, then you can drive much larger returns. For example, imagine you have built a model of product warranty claims. Rather than predicting the level of claims over time (nice to know maybe, but not very helpful), you can interrogate the factors influencing the warranty rates being high (for example the amount you spend on fixing issues) and make a change to address the problem.

Explanatory models are needed

So explainable models are not just a nice-to-have helping you to persuade stakeholders to cooperate. They are a valid basis for your intervention, and often the most impactful option.

Of course, the correlation/causation conundrum remains, but it is usually less of a problem in business than you might think. The question is not so much of what are the influential regressors (the order of the business process will often settle the arrow of causation), but of how much to change them to optimise impact.

If you want your model to explain, you have to build it differently. Models built from data only can only be used for predictions. Instead, for prescription or explanation, you need models built by combining your data with a knowledge of the process that created that data.

Model representation matters

In other words, model representation matters. You can’t just throw a bunch of features into a process and expect to have a result that can be used to drive business decisions. What you are interested in is the coefficients of the model, not simply the output. The coefficients are the things that will tell you what actions to take. But coefficients of which variables? If the model is made up of meaningless or unactionable regressors, or if they are constructed in a way that does not reflect the underlying relationships, it doesn’t matter what your accuracy is. You can’t use it for anything but a prediction.

Moreover, from a technical point of view, if your model parameters are not identified (inc. the rank and order conditions), or your independence assumptions are wrong – you are in trouble. So, it is a bit harder to build knowledge-driven models than purely data-driven models.

It is a bit worrying that many data scientists are not very well versed in this. They probably don’t get caught out because the business baseline performance is low. The answer they get from their data-driven models is going to be worse than if a knowledge-driven model were used, but it’s better than the baseline. It’s also a popular approach because it’s a lot faster to build a data-driven model. Until baselines improve or the consequences of error are severe, like death or huge financial loss, practitioners will keep relying on data-driven models.

You can do better Data Science leaders

But you can make the change, and you are missing out on millions of pounds of value if you don’t.

One book we have found that is pretty useful is Causal Inference In Statistics by Judea Pearl (as previously recommended by Paul):



The data scientists in my team used it as a book group text and it has completely changed the way they see their role.

Remember, businesses are always interested in shaping outcomes to make the future conform to their goals, so prediction (data-driven) models aren’t sufficient. Behind every prediction question asked by our stakeholders, there’s a prescriptive action they want to take. So, build models that explain and show you what to do to make a difference.

Are you convinced by the need for explanatory models?

Many thanks to Harry for his challenge. Remembering my own Data Mining background (back in the day), I think he makes a good case.

But, what is your experience, dear reader? Do you agree and have you seen success through developing explanatory models? Do you disagree and want to make the case for the power of predictive models even when complex non-linear relationships cannot be fully explained? I look forward to hearing from the data science leadership community on this one.