March 28, 2016

Why data lakes, what you wear & your tools all matter

By Paul Laughlin

data lakesPerhaps, like me you’ve been a late comer to believing there is anything in buzzwords like Data Lakes?

Along with all the hyperbole surrounding Big Data & the Data Science, it’s easy to write-off a whole new language because of some over-selling.

It’s also been galling, for those of us who have been working in data analytics for decades prior to it becoming fashionable, to put up with listening to the patently obvious presented in new clothes.

Just because you make your acrostic start with ‘V’s doesn’t mean that the concepts are new. Some of those preaching Big Data as a new opportunity should read Dan Kimball’s works on Data Management & Data Mining from the 1980s.

But, as ever, there is also a danger of an overreaction the other way. In my case, that was a danger of throwing the Big Data / Data Science ‘baby’ out with the proverbial bath-water.

Once I got over my initial skepticism, there has in fact been much of value to learn from specialists in both Big Data & Data Science. In fact, despite the risks of inappropriate use of Data Scientists, their progress with deploying Machine Learning has exceeded what I saw achieved back in the ‘Expert Systems’ & AI boom in the 1990s. There have also been benefits to the focus on Big Data, with more businesses identifying & using internal data previously neglected. Interestingly, though, most of the positive profit-creating case studies for Big Data are still focussed on using structured internal data, very few are making really money from using unstructured external data.

So, let’s take a quick peek at a few useful resources available to help with some of those current buzzwords…

Data Lakes

You may have heard the term ‘Data Lake‘ coming to prominence. Both Big Data proponents and Data Scientists talk about it as something new & exciting. At first it just sounded like the Data Warehouse rebadged, but I am starting to see that some of the technical understanding & solutions are a useful alternative. Akin to the benefits of unstructured analytics ‘playpens’ within large data warehouse set-ups, Data Lakes provide the opportunity to dump unstructured data in an area to explore analytically prior to any significant IT spend on data feeds/schemas etc. Some also offer use of technology optimised to enable such enquiry & produce schemas as data is read.

A useful short comparison has been shared by Tamara Dull, here on the KD Nuggets site:

Data Lake vs Data Warehouse: Key Differences – KDnuggets

We hear lot about the data lakes these days, and many are arguing that a data lake is same as a data warehouse. But in reality, they are both optimized for different purposes, and the goal is to use each one for what they were designed to do.

So, it sounds like Data Lakes could genuinely help Data Scientists explore more internal & external, structured & unstructured data – to identify which has value.

Wearables (and the Internet of Things)

What about all the hype surrounding the Internet of Things? Is that just about ridiculously intelligent toasters or fridges self ordering food you weren’t expecting? No, it seems this fad is also maturing into an industry developing devices that will produce data of real value to analysts. One of the more established fields is wearables. The revolution in self tracking using Fitbit/Jawbone/Garmin/Nike/Apple et al is well trailed. Now analytics teams in industries like Insurance are catching on to the potential for improved customer experiences, understanding & even risk reducing behaviour changes.

Vitality and other insurers are proving the potential here. With customers registering their devices and sharing data to achieve status points that offer them benefits. This data provides Vitality with a better understanding of behaviour to both improve their propositions & better model risks/pricing etc. So, here too, it’s not all hype. Beyond the longer mixed experience of insurers using telematics, wearables are proving to offer data of real value to Customer Insight teams.

Insurance Nexus have published the findings of their research with insurers. Although simple & at times simplistic, it’s an interesting temperature check of where the industry is really at with regards to using this technology (still early adopter stage):

Exclusive Survey Results: Insurance Internet of Things | Insurance Nexus

Chief Claims Officers from Allstate, Liberty Mutual, State Auto, Kemper, Prudential, Ameritrust and more confirmed to speak at Connected Claims USA 2020 conference and expo in Chicago

But with all this expanded potentially valuable data, perhaps now accessible via your Data Lake, are you able to analyse it?

The right tools for the job

If you’re struggling there, it’s often to do with not having the right tools for the job or struggling with compatibility between the tools used for different stages. For instance do your data manipulation (Big Data) tools work well with your tools for data mining or modelling (Data Science)? Ironically, although a lot of information is published on the different data & analytics tools available, I’ve seen very little analysis of this data.

That makes the article published by Bob Hayes, on Customer Think blog, even more welcome. Building on previous survey data shared on KD Nuggets hub, Bob uses Principal Component Analysis to explain the relationships between 95 possible tools on the market (he publishes the result of a 13 factor solution). His results make interesting reading, suggesting 13 more common ‘tool groupings’. As well as the to be expected IBM only & SAS mainly solutions, there are others including up to 10 different Hadoop tools, plus a number of pairings & the simplest solution (just using XLSTAT for Excel).

If you have a number of potential data & analytics tools ‘in house’ this could be well worth reviewing in detail:

Are You Using the Right Tools for Your Big Data Projects?

Tweet Data scientists rely on tools/products/solutions to help them get insights from data. Gregory Piatetsky of KDNuggets conducts an annual survey of data data professionals to better understand the different types of tools they use. Here are the results of the 2015 survey.

Hope some of this post’s data & technology reflections have Sparked (excuse the pun) your interest.

However you view Data Lakes, the potential of IoT or the plethora of software available, I hope you’re getting value from your data. Remember, no matter how technically sophisticated your approach, if you’re not learning something that your business can put into action profitably, you won’t be at it for long.

Happy Data Mining!