Data Science programming languages: (2) Resources for Python
As promised in our previous post, for the R programming language, this one will focus on resources for Python.
Although R may have a longer heritage within the Statistics and Data Science community, Python could be described as a more complete programming language.
In my conversations with clients and Data Science leaders, I’ve also heard a number praise Python as much quicker to learn. So,although both languages are proving popular with analytics teams, there is perhaps a choice between the more statistically grounded R and the easier programming in Python.
But, even that distinction is now less clear, as both benefit from the kind of support/resources ecosystem that I mentioned in my post on R.
So, enough introduction, let me share some resources that I’ve found to help Python coders (and would be coders). Enjoy diving in, at the risk of getting bitten by the coding bug.
Resources for Python: Learning the language
As I stated for those intending to learn R, I recommend using a well written book. In my own experience, both the benefits of having this resource to refer back to regularly and the convenience of working at your own pace/location can still favour having a textbook.
That said, resources are also abounding on Python in printed media. Only the other week, I saw a glossy magazine entitled ‘Learn Python‘ on my local supermarket’s news stand! That is surely the definition of having gone ‘mass market‘. But, the quality of the book you choose and the experience of the author are crucial, so it is worth asking for recommendations.
I don’t know Zed Shaw personally, but his book (and supporting resources) were the most frequently recommended amongst those I asked (& searched). With the wonderfully off-putting title of “Learn Python the hard way“, it has a great mix of exercises, coding examples and ensuring you understand the syntax and how it works (rather than just follow instructions). despite the title, it is also well written and a great introduction for someone who has never programmed before.
When you buy Learn Python 3 the Hard Way directly from the author, Zed A. Shaw, you’ll get a professional quality PDF and hours of HD Video, all DRM-free and yours to download. Buy Learn Python 3 The Hard Way Buy Learn Python 2 The Hard Way Instead Or, you can read a free sample of Learn Python 3 the Hard Way before you decide.
As a side note, if you are an experienced programmer, just looking to pick up Python programming to add to your repertoire, then I’d recommend: “Dive into Python 3″ by Mark Pilgrim.
Resources for Python: Cheatsheets to help you remember
Given our love of infographics and visualisations on this site, I want to repeat this section for Python (as we did for R). The same reasons still stand, as you learn more, then memory aids are helpful. Python arguably has an easier to understand syntax, so perhaps cheatsheets are less needed for the basics. However, as the base Python language lacks some of the statistics and data science functions within R, you can easily end up wanting to also remember how to use a number of libraries to extend the language.
As well as providing their own Data Camps to learn R or Python, DataCamp have done a good job of creating cheat sheets to help you remember both some key elements of Python syntax and some of the key libraries you will want to add.
Here is their base Python for Data Science cheat sheet:
This handy one-page reference presents the Python basics that you need to do data science 54% of the respondents of the latest O’Reilly Data Science Salary Survey indicated that they used Python as a data science tool. This is a small increase in comparison to the results of the 2015 survey, where 51% of the respondents indicated to use Python.
I’ll share in the next section the cheat sheets for some of the recommended libraries to install, to extend Python to the full Data Science tool you require.
Resources for Python: Some of the libraries you will want to add
To start with, consider you need to handle data for analytics purposes. Here you are likely to need some further scientific computing capabilities, than provided in base Python and an alternative data structure to Python’s lists. To get you started on that, a good option is the NumPy library, as this will also extend your data manipulation capabilities.
Here is DataLab’s cheat sheet as an introduction to NumPy:
Given the fact that it’s one of the fundamental packages for scientific computing, NumPy is one of the packages that you must be able to use and know if you want to do data science with Python.
Beyond the improved mathematical functions and structures provided by NumPy, another essential library to install is Panda. This significantly improves the data structures supported and your data manipulation functions for analysis purposes. It also enables you start exploratory data analysis, including use of summary statistics. Here is a cheat sheet for Panda:
The Pandas library is one of the most preferred tools for data scientists to do data manipulation and analysis, next to matplotlib for data visualization and NumPy, the fundamental library for scientific computing in Python on which Pandas was built.
We spoken a lot about the importance of data visualisation this year and Python requires additional libraries to improve its data visualisation capabilities. Bokeh provides the data charting and visualisation capabilities you’ll require, as well as the capability to customise interactive visualisations for added sophistication. Here is a cheatsheet for Bokeh:
Bokeh distinguishes itself from other Python visualization libraries such as Matplotlib or Seaborn in the fact that it is an interactive visualization library that is ideal for anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.
In addition, there may be instances where you need the added capabilities of the comprehensive data visualisation libraries like Matplotlib or ggplot (especially for ‘small multiples). But this interview with Bryan Van Den Ven (core developer of Bokeh), helps explain the relative strengths of that library:
Beautiful Python Visualizations: An Interview with Bryan Van de Ven, Bokeh Core Developer – KDnuggets
Read this insightful interview with Bokeh’s core developer, Bryan Van de Ven, and gain an understanding of what Bokeh is, when and why you should use it, and what makes Bryan a great fit for helming this project.
The last library cheat sheet I want to share with you (from DataCamp’s impressive collection), is for SciKi-Learn. It’s a library that supports machine learning, pre-processing and cross-validation algorithms. So, if you’re really interested in using Python for Data Science (rather than only Analytics/Statistics), then you’ll want to install this too. Here is the last of those handy cheat sheets, to jog your memory or guide your learning, a cheatsheet for SciKit-Learn:
Most of you who are learning data science with Python will have definitely heard already about scikit-learn, the open source Python library that implements a wide variety of machine learning, preprocessing, cross-validation and visualization algorithms with the help of a unified interface.
In addition to those cheatsheet examples, I’d recommend looking at this worked example of applying Random Forest using Python. As it is the single most popular Data Science algorithm I hear being used by analysts, it should be relevant to most teams. In this helpful worked example, you will also see the libraries that Alex Woods needs to install to execute this code:
Random Forest is a machine learning algorithm used for classification, regression, and feature selection. It’s an ensemble technique, meaning it combines the output of one weaker technique in order to get a stronger result. The weaker technique in this case is a decision tree. Decision trees work by splitting the and re-splitting the data by features.
Resources for Python: Join the tribe at PyCon
Finally, for those seeking to continue their learning in Python and see what others are doing, I recommend getting involved in the active community. As for the R community, there are committed fans of this language, happy to share their knowledge and examples for free. There are also a number of member organisations, but the chief one I would recommend is the UK Python Association (UKPA). This is a charity, created with the aim of advancing both education & public benefit in use of Python.
I am delighted to say that their PyCon UK 2017 is once again being held near me, in Cardiff. It will be held at Cardiff City Hall from 26-30 October 2017. The agenda is yet to be confirmed, but that means they are also still open to speakers and input. So, if you have story to tell or tips to share, explore the link below to have your input:
PyCon UK 2017 is over. Thank you to everybody who supported the conference! Videos of talks are already appearing on YouTube, and links to more post-conference resources will appear here soon. To stay in touch, please sign up to our monthly newsletter: ~ The PyCon UK Committee PyCon UK is the annual gathering of the UK Python community and its friends from around the world.
Resources for Python: what has helped you?
I hope that was a useful collection of Python resources to help you. Do you have others to share? If so, please publish those links in the comments box below. If there are particularly popular options, I will then update this post to include them.
Beyond R & Python will be a journey of discovery for me. But I’ve heard that all the ‘cool kids‘ are now programming in Julia or Scala (or other Java based languages). So, that may not be my next post, but I will be sharing a wider range of Data Science languages in future posts. Let me know if there are any you want to see covered.