Friday, 4 May 2018

Everyone else's data is untidy and in the wrong format

I forget where that quote in the title comes from, but it pops into my head everytime I work on any real machine learning problem.

By real I mean an actual real-world problem that requires using machine learning to find insights into a research problem rather than a tutorial to explain some aspect of machine learning I might be interested in. A good tutorial will supply you with a dataset to work on and it will be beautifully clean and structured. That's not how real data is.

A data scientist spends 70% of their time wrangling data into something that can be analysed and the other 30% complaining about it.

Although I would never mention it in a professional setting or in a teaching workshop if the data you are working with has been gathered using Excel, it will be very bad. I would never bring this up with the researchers compiling the dataset. They will get offended and argue their process forever. I'll tell you this because it's true and through experience, as soon as I'm getting a dataset in ".xlsx" format, I expect the data wrangling will take a good deal more time and have many more problems.

The good news is that Python is a great language to wrangle data with. It has great string processing functions and Pandas makes working with spreadsheet data quick and convenient. But I'd like to tell you about another tool I've started using for cleaning up datasets. OpenRefine.


OpenRefine was originally a Google project that was open sourced. It is a powerful tool for cleaning up messy data and transforming it into clean data. There is a great paper by Hadley Wickham on what tidy data is. You might recognise that name. Hadley Wickham is a famous R developer, but this concept of tidy data is relevant to all data science regardless of what language you use.

OpenRefine is built around a server/client model. That means the processing operations and the visual representation are two separate programs. That concept is harder to explain than to use. If you download OpenRefine and run it, OpenRefine will start a small web server on your computer and open a browser window pointed at the web server's address ( by default). The web server is where the processing happens and the web page is where you interact with your data. If you've used Jupyter Notebooks, you might recognise this arrangement.

I mention it because this arrangement facilitates the use of Docker containers and cloud computing resources. Two things I think all data scientists should be familiar with.

So brace yourself. This is going to get good!


If you run a Windows environment, I suggest you wipe your hard drive and install Linux Mint or Ubuntu and never look back. Yeah, I know. You have to use Windows for reasons...

Again, just like ditching Excel, I would never suggest this in a professional environment or a programming workshop. People get offended and will argue with you. But I am suggesting to you that Linux or OSX are better options. Collectively referred to as a "posix" environment.

If you use a mac (and I do), you have a terminal, you're set to go. You will need to learn Linux as that is what a docker container is running and cloud computing instances usually run Linux. Most of the time, commands will be interchangeable but there are some differences.

On Linux (this is also what you would use on a cloud instance)

# This script is meant for quick & easy install via:
#   $ curl -fsSL -o
#   $ sh
I have made that last instruction a little vague on purpose. If you are not quite sure about how to go about that, you need to do a beginner shell programming tutorial. The terminal is a powerful tool. You can do a great deal of damage to your system from the terminal so do the ground work first, rather than running commands in the terminal you don't understand.

If you are using OSX or you've ignored my advice and are going to do this on a Windows computer, you can install an application that will let you run docker on your system Docker install for OSX or Windows

Once you have Docker, run this command in a terminal window

docker run -p 80:3333 spaziodati/openrefine
Which will download a Docker container (it might take a while - be patient) with an OpenRefine server and start it. You can then open a browser and type into the address bar.

Bam! An OpenRefine Server in a Docker container on your local machine. When you get to running this server on a powerful cloud instance it is only a few steps away.

This is how I run my projects. Twitter data scraping and processing with a MongoDB database, neuroimaging analysis, statistical analysis on behavioural data, Flask based webapps for less technical members of the team to upload data and interact with analyses. I do it all on cloud instances with a browser based front end of some sort of server in docker containers on the backend. 

It might sound complicated but really it's a robust and flexible approach. 

Post your comments and questions. I'll update this post with better instructions if it becomes obvious I've skipped over parts that aren't clear because I'm used to doing this now and I might be assuming things are obvious that aren't.

The next post will be using open refine to prepare some data for a machine learning process.