4. Basic Programming¶
Machine dreams hold a special vertigo.
Up to this point in the book I’ve tried hard to avoid using the word “programming” too much because – at least in my experience – it’s a word that can cause a lot of fear. For one reason or another, programming (like mathematics and statistics) is often perceived by people on the “outside” as a black art, a magical skill that can be learned only by some kind of super-nerd. I think this is a shame. It’s certainly true that advanced programming is a very specialised skill: several different skills actually, since there’s quite a lot of different kinds of programming out there. However, the basics of programming aren’t all that hard, and you can accomplish a lot of very impressive things just using those basics.
With that in mind, the goal of this chapter is to discuss a few basic programming concepts and how to apply them in Python. However, before I do, I want to make one further attempt to point out just how non-magical programming really is, via one very simple observation: you already know how to do it. Stripped to its essentials, programming is nothing more (and nothing less) than the process of writing out a bunch of instructions that a computer can understand. To phrase this slightly differently, when you write a computer program, you need to write it in a programming language that the computer knows how to interpret. Python is one such language. Although I’ve been having you type all your commands at the command prompt, and all the commands in this book so far have been shown as if that’s what I were doing, it’s also quite possible (and as you’ll see shortly, shockingly easy) to write a program using these Python commands. In other words, if this is the first time reading this book, then you’re only one short chapter away from being able to legitimately claim that you can program in Python, albeit at a beginner’s level.
Computer programs come in quite a few different forms: the kind of program that we’re mostly interested in from the perspective of everyday data analysis using Python is known as a script. The idea behind a script is that, instead of typing your commands into the Python console one at a time, instead you write them all in a text file, or in a notebook, if you are using Jupyter Notebooks. Then, once you’ve finished writing them and saved the file, you can get Python to execute all the commands at once. In a moment I’ll show you exactly how this is done, but first I’d better explain why you should care.
4.1.1. Why use scripts?¶
Before discussing scripting and programming concepts in any more detail, it’s worth stopping to ask why you should bother. After all, if you look at the Python commands that I’ve used everywhere else this book, you’ll notice that they’re all formatted as if I were typing them at the command line. Outside this chapter you won’t actually see any scripts. Do not be fooled by this. The reason that I’ve done it that way is purely for pedagogical reasons. My goal in this book is to teach statistics and to teach R. To that end, what I’ve needed to do is chop everything up into tiny little slices: each section tends to focus on one kind of statistical concept, and only a smallish number of R functions. As much as possible, I want you to see what each function does in isolation, one command at a time. By forcing myself to write everything as if it were being typed at the command line, it imposes a kind of discipline on me: it prevents me from piecing together lots of commands into one big script. From a teaching (and learning) perspective I think that’s the right thing to do… but from a data analysis perspective, it is not. When you start analysing real world data sets, you will rapidly find yourself needing to write scripts.
To understand why scripts are so very useful, it may be helpful to consider the drawbacks to typing commands directly at the command prompt. The approach that we’ve been adopting so far, in which you type commands one at a time, and R sits there patiently in between commands, is referred to as the interactive style. Doing your data analysis this way is rather like having a conversation … a very annoying conversation between you and your data set, in which you and the data aren’t directly speaking to each other, and so you have to rely on R to pass messages back and forth. This approach makes a lot of sense when you’re just trying out a few ideas: maybe you’re trying to figure out what analyses are sensible for your data, or maybe just you’re trying to remember how the various R functions work, so you’re just typing in a few commands until you get the one you want. In other words, the interactive style is very useful as a tool for exploring your data. However, it has a number of drawbacks:
It’s hard to save your work effectively. You can save the workspace, so that later on you can load any variables you created. You can save your plots as images. And you can even save the history or copy the contents of the R console to a file. Taken together, all these things let you create a reasonably decent record of what you did. But it does leave a lot to be desired. It seems like you ought to be able to save a single file that R could use (in conjunction with your raw data files) and reproduce everything (or at least, everything interesting) that you did during your data analysis.
It’s annoying to have to go back to the beginning when you make a mistake. Suppose you’ve just spent the last two hours typing in commands. Over the course of this time you’ve created lots of new variables and run lots of analyses. Then suddenly you realise that there was a nasty typo in the first command you typed, so all of your later numbers are wrong. Now you have to fix that first command, and then spend another hour or so combing through the R history to try and recreate what you did.
You can’t leave notes for yourself. Sure, you can scribble down some notes on a piece of paper, or even save a Word document that summarises what you did. But what you really want to be able to do is write down an English translation of your R commands, preferably right “next to” the commands themselves. That way, you can look back at what you’ve done and actually remember what you were doing. In the simple exercises we’ve engaged in so far, it hasn’t been all that hard to remember what you were doing or why you were doing it, but only because everything we’ve done could be done using only a few commands, and you’ve never been asked to reproduce your analysis six months after you originally did it! When your data analysis starts involving hundreds of variables, and requires quite complicated commands to work, then you really, really need to leave yourself some notes to explain your analysis to, well, yourself.
It’s nearly impossible to reuse your analyses later, or adapt them to similar problems. Suppose that, sometime in January, you are handed a difficult data analysis problem. After working on it for ages, you figure out some really clever tricks that can be used to solve it. Then, in September, you get handed a really similar problem. You can sort of remember what you did, but not very well. You’d like to have a clean record of what you did last time, how you did it, and why you did it the way you did. Something like that would really help you solve this new problem.
It’s hard to do anything except the basics. There’s a nasty side effect of these problems. Typos are inevitable. Even the best data analyst in the world makes a lot of mistakes. So the chance that you’ll be able to string together dozens of correct R commands in a row are very small. So unless you have some way around this problem, you’ll never really be able to do anything other than simple analyses.
It’s difficult to share your work other people. Because you don’t have this nice clean record of what R commands were involved in your analysis, it’s not easy to share your work with other people. Sure, you can send them all the data files you’ve saved, and your history and console logs, and even the little notes you wrote to yourself, but odds are pretty good that no-one else will really understand what’s going on (trust me on this: I’ve been handed lots of random bits of output from people who’ve been analysing their data, and it makes very little sense unless you’ve got the original person who did the work sitting right next to you explaining what you’re looking at)
Ideally, what you’d like to be able to do is something like this… Suppose you start out with a data set
myrawdata.csv. What you want is a single document – let’s call it
mydataanalysis.R – that stores all of the commands that you’ve used in order to do your data analysis. Kind of similar to the R history but much more focused. It would only include the commands that you want to keep for later. Then, later on, instead of typing in all those commands again, you’d just tell R to run all of the commands that are stored in
mydataanalysis.R. Also, in order to help you make sense of all those commands, what you’d want is the ability to add some notes or comments within the file, so that anyone reading the document for themselves would be able to understand what each of the commands actually does. But these comments wouldn’t get in the way: when you try to get R to run
mydataanalysis.R it would be smart enough would recognise that these comments are for the benefit of humans, and so it would ignore them. Later on you could tweak a few of the commands inside the file (maybe in a new file called
mynewdatanalaysis.R) so that you can adapt an old analysis to be able to handle a new problem. And you could email your friends and colleagues a copy of this file so that they can reproduce your analysis themselves.
In other words, what you want is a script.