June 20, 2019

More Data Science methodology options – has much changed? (step 2 of 2)

By Paul Laughlin

Let’s continue our focus on Data Science methodologies. The reason for this focus is the need for more methodical delivery by many Data Science teams.

In the first post of this series, I made the case for having a Data Science methodology and shared 3 popular options. I hope you found those useful, but I’m also conscious that they are all old methodologies.

In that first post, I reviewed CRISP-DM, KDD & SEMMA methodologies. All of which were created during the heyday of Data Mining. Before the “AI winter” when exciting things were happening, but largely through using large stats packages or bespoke applications.

Drivers of a fresh approach to Data Science methodologies

Most of the data science teams I know use an in-house bespoke methodology. Why have the not just used one of those I shared in my last post? As I’ve chatted with different Data Science leaders a few themes have emerged in their answers.

Firstly, the way their teams work is different. Many have fully or partially transitioned to agile working. Although this does not preclude more rapid versions of iterative methodologies (like CRISP-DM), it changes the steps. So, a Data Science methodology using language more familiar to those using an Agile Development approach is needed.

Secondly, the tools being used by these teams have changed. Many are less reliant on IT departments than they were 20 years ago & the majority are coding in R or Python. This causes them to break free of methods more focussed on traditional statistical analysis packages, like SEMMA.

Finally, the type of data they work with is much more diverse. Data Scientists are often wrangling both structured and unstructured, internal and external data. Some are working with more complex relationship data, for instance from social media or other digital sources.

The way this data is accessed has also become more diverse. Data Science teams may routinely need to draw data from a variety of database structures (including column and even graph databases). They may also need to use their wider data access to load data into Data Lakes and spend longer on sourcing data than previous generations.

For all those reasons, it is not surprising to see an explosion of more varied options. It is nigh on impossible for me to be comprehensive in this post, but let me share some exemplars that I think typify different approaches to this challenge.

If in doubt, turn to Wikipedia

Another generational difference is the amount of material available online. So, as every student knows, if in doubt first check out the Wikipedia entry.

On the topic of Data Science methodologies, this is a useful exercise. Under an entry for “Data Analysis“, it shares this popular simple overview of a Data Science process. As you can see it shares some similarities with the simple steps and feedback loops of KDD but has evolved.

To reflect modern Data Science practice, it also recognises the creation of either “data products” or “visualisation” that drives a business decision. It is encouraging to also see the Exploratory Data Analysis stage (emphasised by SEMMA method) has been emphasised.

A little Googling reveals the source of this Wikipedia entry is Springboard. In this helpful blog post on KD Nuggets, they explain how it is meant to work:

The Data Science Process – KDnuggets

At Springboard, our data students often ask us questions like “what does a Data Scientist do?”. Or “what does a day in the data science life look like?” These questions are tricky. The answer can vary by role and company.

Technology providers continue to shape Data Science methodologies

I hope you can see some benefits to the clean simplicity of the above method, however it is also quite high-level. It risks being simplisitic and failing to highlight all the steps that should be considered.

In my last post, I recalled how SAS Software used to dominate the world of statistical analytics or Data Mining. One of the benefits of their focus was the creation of the SEMMA methodology.

For today’s Data Scientists, the technology behemoths are surely Amazon, Google & Microsoft. Those who own not only useful toolkits but whole environments (or ecosystems) including the data storage and deployment. So, it’s not surprising to see that once again they are taking the lead in shaping methodologies for practitioners.

As just one example of these useful contributions, I’ve shared above a visual summary of Microsoft’s Team Data Science Process Lifecycle. Despite feeling like too long a name, I like the emphasis on both teamwork & a lifecycle for use of data.

Beyond that, the diagram above shows that this methodology could be accused of being too flexible. But it does reflect the need for steps including pipelines for data, feature engineering and intelligent apps. A clear evolution with regards to steps needed today.

If you value the flexibility and more comprehensive nature of this method, then much more detail is available within Microsoft’s Azure support documentation:

The Team Data Science Process lifecycle

The Team Data Science Process (TDSP) provides a recommended lifecycle that you can use to structure your data-science projects. The lifecycle outlines the complete steps that successful projects follow. If you use another data-science lifecycle, such as the Cross Industry Standard Process for Data Mining (CRISP-DM), Knowledge Discovery in Databases (KDD), or your organization’s own custom process, you can still use the task-based TDSP.

Management Consultants don’t want to miss out on this party

Given both the complexity of Data Science options and the speed of change for organisations, it’s not surprising that many are unsure. As ever, management consultants are on hand to step into this gap. It seems almost daily that a new report is available on how to implement Data Science or AI in your business.

However, beyond just the marketing opportunity, some consulting firms are taking this field seriously and partnering with academics. This can be a really helpful step forward in educating today’s leaders and identifying where changes are needed.

One such example is the work of Booz Allen Hamilton. They have partnered with a number of leading Data Scientists, including Kirk Borne to produce the handy “Data Science Field Guid“. Not only is this a useful tool for educating executives about Data Science it also includes a methodology.

Because of this way this is visualised, it is hard to do it justice in the graphic above. I recommend downloading the field guide and flicking through the pages to appreciate the visual prompts they provide. It suggests to me that there is further data visualisation work to be done here. Could Data Viz produce a purely visual Data Science methodology?

Field Guide to Data Science

The Second Edition of The Field Guide to Data Science, released in December 2015, features a number of additions and enhancements based on our evolving understanding of how to best use data as a resource. In the Second Edition, you’ll find:

So, which Data Science methodologies are being used?

Having shared all that about how Data Science methodologies have developed, there appears to be little research in this area. While searching for research or surveys into which methodologies are more used today, the most recent I could find was a survey updated by KD Nuggets in 2014.

However, this slightly older research is still informative. They compare results with a comparable survey which they ran in 2007. Not much has changed and the surprise for me is that despite the examples shared in this post, most respondents are still using CRISP-DM.

Apart from individuals’s or organisations own methods, KDD or SEMMA are still popular. So, perhaps the Data Science world hasn’t changed that much since my first post recollecting the 1990s?

Here are the full results. Please, do let me know if you discover a more recent source of research on Data Science methodology usage:

CRISP-DM, still the top methodology for analytics, data mining, or data science projects – KDnuggets

Latest KDnuggets Poll asked What main methodology are you using for your analytics, data mining, or data science projects ? Compared to 2007 KDnuggets Poll on Methodology, the results are surprisingly stable. CRISP-DM remains the top methodology for data mining projects, with essentially the same percentage as in 2007 (43% vs 42%).

The end of our brief foray into Data Science methods?

I hope you have found these last two posts useful. I would be very interested in hearing which Data Science method you use.

If you use a Data Science methodology that I have not covered and are convinced it is worth recommending to others, please share.

I’m happy to add a third in this series with recommended methods from readers. Data Science methodologies is definitely still an evolving field.