How classifying & fixing dirty data can help even those using spreadsheets
All too often the topic of fixing dirty data is neglected in the plethora of online media covering AI, data science & analytics.
This is wrong for many reasons. To highlight just one, being confident in the quality of your data is the vital foundation of all analysis. Plus this topic remains relevant for all levels of complexity, from those using spreadsheets to those building complex machine learning models.
So, I was delighted when our latest guest blogger, Tristan Mobbs, offered me this book review of “Between the Spreadsheets“. In his review, Tristan brings to life the lessons from this book & what to expect from the author Susan Walsh. Read on for his highlights & Tristan’s advice on who should read this practical book.
Yes, we are finally talking about Dirty Data
Susan Walsh’s book “Between The Spreadsheets” focuses on classifying and fixing dirty data. A topic that gets less daylight amongst the glamour of Artificial Intelligence and Machine Learning. Having followed Susan for a while on LinkedIn, she highlights the benefits of making sure your data has its COAT on (Consistent, Organised, Accurate, Trustworthy). So, I was delighted to get her book for Christmas.
Susan takes the time to explain the challenges associated with poor data quality. She also discusses the painful real-world consequences of it. This book explores dirty data in the world of procurement. Susan looks at spend data classification and provides real examples of how she would go about validating and sorting out the dirty data.
Data horror stories to inspire action
Data quality and data validation are often unloved topics in the world of data. They don’t have the shiny appeal of Machine Learning or model development. But they are crucial. Anyone working in the world of data will have encountered an issue with data that had consequences for the company or people involved. In the energy industry, I often saw customers being billed who weren’t on supply with the company I worked for. It gets worse when the debt collectors are about to be unleashed.
There are many examples where data errors cause major issues. from Nasa losing their $125 billion orbiter due to a conversion error to Beverley Council not paying their gas bill for 17 years. The range of impact can vary from looking a bit foolish to extremely costly. “Between The Spreadsheets” highlights the importance of these topics. Susan gives practical examples of steps to ensure your data is Consistent, Organised, Accurate and Trustworthy (COAT).
Cleaning data can be a tedious exercise. Susan’s practical examples guide you through a process that helps make the exercise a little less painful. When the benefits are highlighted too and the horror stories shared, this book helps motivate you to get on and clean up your data.
How can you get started with your dirty data?
We can all spot these errors and clean up our data by following Susan’s guidance. As well as using our own methods and techniques. Often these can be as simple as sorting the data, as this schoolboy found out when he corrected Nasa (yes them again). Rocket science is easy compared to data validation apparently.
Throughout this book, guidance is provided on how to clean up your data in Excel. Tips and tricks are shared such as key misspellings and the challenges of replacing data without context.
Susan also highlights the importance of regularly cleaning up your data. If the quality is regularly checked then the clean up is a relatively small task. If left however the task can become huge. The longer it is left the bigger impact on the business too. What dodgy decisions may be being made because of false information in your organisation?
Who can benefit from reading this book?
None of the techniques and methods shared by Susan is particularly complex. The majority of people working with data should have the skills to execute the advice and methods in this book. By highlighting the issues hopefully Susan will motivate more people to take an interest in ensuring their data is cleaned on a regular basis.
If you have limited experience in managing and maintaining data, then this book is for you. If you work in Finance, Procurement, Marketing etc, then you deal with data on a daily basis but may not have the technical knowledge of a data team. For this reason, Susan’s Excel tips are relatable and easy to implement. So almost anyone can improve the quality of their data using this book as a prompt/guide.
Thanks to Tristan for that review & to Susan for her book. I’d be betraying confidences if I shared specific examples, but I too have seen big organisations make costly mistakes due to data errors. These days it feels like all the focus on advanced analytics & data science is making such neglect even more likely. Thank goodness for the rise of DataOps as a topic. Hopefully, coupled with the emergence of more CDOs, they can ensure dirty data is tackled quickly & often. Let’s all resolve as data, analytics & insight leaders to not forget to question the quality of data being used. It is the very ground we stand on.