Friday 4 May 2018

Everyone else's data is untidy and in the wrong format



I forget where that quote in the title comes from, but it pops into my head everytime I work on any real machine learning problem.

By real I mean an actual real-world problem that requires using machine learning to find insights into a research problem rather than a tutorial to explain some aspect of machine learning I might be interested in. A good tutorial will supply you with a dataset to work on and it will be beautifully clean and structured. That's not how real data is.

A data scientist spends 70% of their time wrangling data into something that can be analysed and the other 30% complaining about it.

Although I would never mention it in a professional setting or in a teaching workshop if the data you are working with has been gathered using Excel, it will be very bad. I would never bring this up with the researchers compiling the dataset. They will get offended and argue their process forever. I'll tell you this because it's true and through experience, as soon as I'm getting a dataset in ".xlsx" format, I expect the data wrangling will take a good deal more time and have many more problems.

The good news is that Python is a great language to wrangle data with. It has great string processing functions and Pandas makes working with spreadsheet data quick and convenient. But I'd like to tell you about another tool I've started using for cleaning up datasets. OpenRefine.


OpenRefine




OpenRefine was originally a Google project that was open sourced. It is a powerful tool for cleaning up messy data and transforming it into clean data. There is a great paper by Hadley Wickham on what tidy data is. You might recognise that name. Hadley Wickham is a famous R developer, but this concept of tidy data is relevant to all data science regardless of what language you use.

OpenRefine is built around a server/client model. That means the processing operations and the visual representation are two separate programs. That concept is harder to explain than to use. If you download OpenRefine and run it, OpenRefine will start a small web server on your computer and open a browser window pointed at the web server's address (192.127.0.0:3333 by default). The web server is where the processing happens and the web page is where you interact with your data. If you've used Jupyter Notebooks, you might recognise this arrangement.

I mention it because this arrangement facilitates the use of Docker containers and cloud computing resources. Two things I think all data scientists should be familiar with.

So brace yourself. This is going to get good!

Docker



If you run a Windows environment, I suggest you wipe your hard drive and install Linux Mint or Ubuntu and never look back. Yeah, I know. You have to use Windows for reasons...

Again, just like ditching Excel, I would never suggest this in a professional environment or a programming workshop. People get offended and will argue with you. But I am suggesting to you that Linux or OSX are better options. Collectively referred to as a "posix" environment.

If you use a mac (and I do), you have a terminal, you're set to go. You will need to learn Linux as that is what a docker container is running and cloud computing instances usually run Linux. Most of the time, commands will be interchangeable but there are some differences.

On Linux (this is also what you would use on a cloud instance)

# This script is meant for quick & easy install via:
#   $ curl -fsSL get.docker.com -o get-docker.sh
#   $ sh get-docker.sh
I have made that last instruction a little vague on purpose. If you are not quite sure about how to go about that, you need to do a beginner shell programming tutorial. The terminal is a powerful tool. You can do a great deal of damage to your system from the terminal so do the ground work first, rather than running commands in the terminal you don't understand.

If you are using OSX or you've ignored my advice and are going to do this on a Windows computer, you can install an application that will let you run docker on your system Docker install for OSX or Windows

Once you have Docker, run this command in a terminal window

docker run -p 80:3333 spaziodati/openrefine
Which will download a Docker container (it might take a while - be patient) with an OpenRefine server and start it. You can then open a browser and type 0.0.0.0 into the address bar.



Bam! An OpenRefine Server in a Docker container on your local machine. When you get to running this server on a powerful cloud instance it is only a few steps away.

This is how I run my projects. Twitter data scraping and processing with a MongoDB database, neuroimaging analysis, statistical analysis on behavioural data, Flask based webapps for less technical members of the team to upload data and interact with analyses. I do it all on cloud instances with a browser based front end of some sort of server in docker containers on the backend. 

It might sound complicated but really it's a robust and flexible approach. 

Post your comments and questions. I'll update this post with better instructions if it becomes obvious I've skipped over parts that aren't clear because I'm used to doing this now and I might be assuming things are obvious that aren't.

The next post will be using open refine to prepare some data for a machine learning process.



Tuesday 5 September 2017

When Wintermute unites with Neuromancer

A few articles about Etherium have been showing up on my feed lately. Basically, Etherium is a digital currency like Bitcoin but it has a broader range of uses aside from allowing anonymous currency trading. One of the most interesting features of this new system is the ability to create decentralised applications.



After looking at what is possible with Decentralized Autonomous Organizations (DAO) and considering that we already have advanced AI in use in many areas, I can see huge potential in this new technology but just as obviously, there is the possibility of a decentralised application causing problems.
The central feature of decentralised applications is for me, also the most worrying. Once the application is created and started, it does not exist on any particular real world structure, it exists in 'the internet', or more accurately, it exists in a widely distributed network that is constantly changing. If you've ever used torrenting software to download a movie, you might have a clearer grasp of this concept. The application exists in the connection between participating computers and servers.
This is a wonderful feature because it means the applications are immune to the failure of hardware at any particular site or the interruption to any particular connection. Nothing short of turning off every computer involved in the network would stop the application.





And there is also what makes me cautious.
The application has access to a form of currency. It is possible for the application to use that currency to hire premises and equipment and create more of that currency. The application could even form a company, buy land, apply for planning permits, hire trades people and build premises and then pay staff to run it.



If the application started to do things that we didn't want to continue, it has no off switch, it doesn't exist on a particular server. In the same way that the internet has no off switch. We saw during the Arab spring, a government try to shut down the internet in their country and citizens were able to work around that. I think it's got to the point of accepting that the internet cannot be shut down except with extremely severe methods. Almost unthinkable methods that would have a devastating effect on human life.



Autonomous software and decentralised applications are an exciting new possibility. The scenario I have described sounds far fetched but Steve Jobs announced the first iPhone on January 9, 2007. It has only been 10 years since the first pocket sized computer was announced and now we have wireless internet, streaming video, driving directions from the cloud linked to GPS positioning. An internet connected fridge or toaster sounded like flying cars 10 years ago. They are available in retail stores now, oh and the flying car is coming too.



My point is that I don't think this is something that our grand children will have to deal with, I think we will be dealing with it within the next 5 years. That's right - half the time it took to get from the announcement of the first iPhone to the layer of mobile computing and wireless connectivity that exists today. Let's start talking about it now. Do you think my prediction is too far fetched?

Have a look at what decentralised apps exist now

Top 10 Disruptive Ethereum Decentralized Apps (DApps) and Projects

7 Cool Decentralized Apps Being Built on Ethereum

Top 5 Ethereum DApps Available Right now

Vladimir Putin and Elon Musk have both made comments about the importance of AI to the world

Putin says the nation that leads in AI ‘will be the ruler of the world’

Competition for AI superiority at national level most likely cause of WW3 imo - Elon Musk

In the few hours after this blog post was posted, there have been 3 times as many views from Russia and Poland as the rest of the world.

Saturday 28 February 2015

NeuralCode -The Rules!

1) You are allowed to be a beginner.

We all start somewhere and not knowing this stuff doesn't make you any worse than someone who does, it just means you haven't learnt it yet. Beginners are actually a valuable resource because we need people to teach. Your understanding of information like programming improves greatly when you explain it to someone else.

2) We have a policy of politeness and inclusiveness.

Just be your wonderful generous self and this one will be easy to follow.

3) Your problem is our problem

it is so much more fun to work on a real problem. Finding the programmatic solution to a real problem feels fantastic! We also learn about other fields of research. The caveat on this one is that we aren't here to do your work for you. We like doing the fun stuff like solving a tricky problem, you still get to do the dull repetitive stuff.

4) You must code

Unfortunately you can't learn coding by joining a facebook group or liking a post or saving a pdf to 'read later'. Reading articles on coding might teach you a lot about coding but it won't teach you to code. Coding is very much like learning a musical instrument. You must put fingers on keys and type code. Our sessions are live coding sessions - bring a laptop and be prepared to type code.



Databases SQLite3








It will be SQL week here at NeuralCode. If you are a database ninja please share your dbfu. If you're a noob like the rest of us, join in and we'll work this out together. Databases are neat and cool! Cool because they run lots of data really fast without suffering the sluggish performance of spreadsheets and Neat because fields can be defined very precisely and strictly to maintain data accuracy. A database can have many tables that relate to one another and store relevant information together, that is still available to other tables and to SQL queries. Databases can also store much more than text and numeric data. Images, video, audio, whole documents, and references, can all be stored along with the information about them. It is this feature that i really appealing when you are handling lots of complex data related to an experiment. It might just keep your information organised in a manner that will make it useful to others and still useful to you in five years time. Databases will save you when that excel file starts to slow down (each scroll down takes a few seconds) or is so complicated that no-one can remember what parts of the 15 sheets is involved in calculating the values for the 16th. Phil has some databases of word frequency from different sources. They are already sluggish (150,000 entries on the one excel file we were playing with on friday). There is some redundancy and even a little error. The aim is to combine all the information into a database and extract the information relevant to the word list Phil will use in his upcoming experiment.
If you are using a mac - open a terminal window and type
sqlite3
you will start sqlite3 on the command line. We will progress toward using sqlite3 from within the ipython notebook but now it might be easier to understand in this simple format.

There are two types of commands I'm going to show you. Some start with a
.
like
.help
this is how you can give instructions to the database engine, the rest of the commands are in the SQL language and that is how we create and query the information in our database.
Right now, if you are at the sqlite prompt, you can prettify the output by entering the following
.headers on 
.mode columns
This will cause the output to include headers and align the output in columns. Much nicer and easier to read.

The SQL CREATE TABLE Statement
The CREATE TABLE statement is used to create a table in a database.
Tables are organized into rows and columns; and each table must have a name.
SQL CREATE TABLE Syntax
CREATE TABLE table_name
(
column_name1 data_type(size),
column_name2 data_type(size),
column_name3 data_type(size),
....
);
How to write output of an SQL query to a csv
sqlite> .mode csv
sqlite> .separator ,
sqlite> .output test_file_1.csv
sqlite> select * from tbl2;
sqlite> .exit
Psycholinguistic_databases Wintermute$ cat test_file_1.csv
id,word
0,aaron
1,aargh
Psycholinguistic_databases Wintermute$


import sqlite3
 
conn = sqlite3.connect("mydatabase.db") # or use :memory: to put it in RAM
 
cursor = conn.cursor()
 
# create a table
cursor.execute("""CREATE TABLE albums
                  (title text, artist text, release_date text, 
                   publisher text, media_type text)  """) 


# insert some data
cursor.execute("INSERT INTO albums VALUES ('Glow', 'Andy Hunter', '7/24/2012', 'Xplore Records', 'MP3')")
 
# save data to database
conn.commit()
 
# insert multiple records using the more secure "?" method
albums = [('Exodus', 'Andy Hunter', '7/9/2002', 'Sparrow Records', 'CD'),
          ('Until We Have Faces', 'Red', '2/1/2011', 'Essential Records', 'CD'),
          ('The End is Where We Begin', 'Thousand Foot Krutch', '4/17/2012', 'TFKmusic', 'CD'),
          ('The Good Life', 'Trip Lee', '4/10/2012', 'Reach Records', 'CD')]
cursor.executemany("INSERT INTO albums VALUES (?,?,?,?,?)", albums)
conn.commit()


import sqlite3
 
conn = sqlite3.connect("mydatabase.db")
cursor = conn.cursor()
 
sql = """
UPDATE albums 
SET artist = 'John Doe' 
WHERE artist = 'Andy Hunter'
"""
cursor.execute(sql)
conn.commit()

import sqlite3
 
conn = sqlite3.connect("mydatabase.db")
cursor = conn.cursor()
 
sql = """
DELETE FROM albums
WHERE artist = 'John Doe'
"""
cursor.execute(sql)
conn.commit()


import sqlite3
 
conn = sqlite3.connect("mydatabase.db")
#conn.row_factory = sqlite3.Row
cursor = conn.cursor()
 
sql = "SELECT * FROM albums WHERE artist=?"
cursor.execute(sql, [("Red")])
print cursor.fetchall()  # or use fetchone()
 
print "\nHere's a listing of all the records in the table:\n"
for row in cursor.execute("SELECT rowid, * FROM albums ORDER BY artist"):
    print row
 
print "\nResults from a LIKE query:\n"
sql = """
SELECT * FROM albums 
WHERE title LIKE 'The%'"""
cursor.execute(sql)
print cursor.fetchall()


How do i import a xls file data into sqlite3

The table must exist first. i can create the blp-items table with

CREATE TABLE blp_items
(spelling TEXT, 
lexicality TEXT(1),  
rt REAL,
zscore REAL, 
accuracy REAL, 
rt_sd REAL, 
zscore_sd REAL,
accuracy_sd REAL
);
use the .import command
.separator ","
.import blp-items.csv blp_items
the .import function will import the headers from the csv file

SELECT * FROM blp_items LIMIT 10;

spelling    lexicality  rt          zscore      accuracy    rt_sd       zscore_sd   accuracy_sd
----------  ----------  ----------  ----------  ----------  ----------  ----------  -----------
spelling    lexicality  rt          zscore      accuracy    rt.sd       zscore.sd   accuracy.sd
a/c         N           668.036363  0.32196203  0.76923076  180.088554  1.09726646  0.424052095
aband       N           706.5       0.48159747  0.925       170.172149  0.86149330  0.266746782
abayed      N           689.935483  0.15776501  0.86842105  281.050165  1.02394376  0.342569987
abbear      N           560.583333  -0.3614142  0.95        121.360472  0.52986732  0.220721427
abbears     N           611.410256  -0.1372765  0.975       169.857946  0.89889305  0.158113883
abbens      N           598.289473  -0.3845125  1.0         144.779964  0.59991825  0.0        
abchaim     N           591.763157  -0.1323686  1.0         128.560940  0.65425409  0.0        
abects      N           635.205128  -0.0324051  0.975       155.666002  0.64429156  0.158113883
abeme       N           591.055555  -0.2430494  0.94736842  182.908243  0.93264000  0.226294285

we can delete this line by

DELETE FROM blp_items WHERE lexicality='lexicality';






spelling    lexicality  rt              zscore        accuracy      rt_sd           zscore_sd     accuracy_sd 
----------  ----------  --------------  ------------  ------------  --------------  ------------  ------------
a/c         N           668.0363636364  0.3219620388  0.7692307692  180.0885547226  1.0972664605  0.4240520956
aband       N           706.5           0.4815974773  0.925         170.1721499131  0.8614933071  0.2667467828
abayed      N           689.935483871   0.1577650108  0.8684210526  281.0501658048  1.0239437649  0.3425699875
abbear      N           560.5833333333  -0.361414284  0.95          121.3604725012  0.5298673287  0.2207214279
abbears     N           611.4102564103  -0.137276547  0.975         169.8579465239  0.8988930562  0.158113883 
abbens      N           598.2894736842  -0.384512528  1.0           144.7799649972  0.5999182553  0.0         
abchaim     N           591.7631578947  -0.132368665  1.0           128.5609402685  0.6542540915  0.0         
abects      N           635.2051282051  -0.032405186  0.975         155.6660020156  0.6442915682  0.158113883 
abeme       N           591.0555555556  -0.243049439  0.9473684211  182.9082431079  0.9326400097  0.2262942859
abents      N           690.2564102564  0.4099671341  0.975         207.919356577   0.8366905183  0.1581138
and then dump the sql output

.output blp_items.sql

.dump blp_items









Monday 26 January 2015

How to code: a syllabus



If I was asked today, what a researcher (who isn't from a computer science background) should do to learn how to code, I would suggest they just start. It doesn't really matter where you start or in what language. There are some languages that offer more return for the same effort, like Python, because they are well structured and logical languages. Python can be used for a huge range of tasks because it is a general programming language that has libraries (add on functionality) that add domain specific functionality. Rather than a programming language designed for a specialist task like R or MATLAB that are often difficult to use in other domains like text handling or negotiating network connections.

import antigravity

def main():
     antigravity.fly()

if __name__ == '__main__':
     main()

 
 

But it doesn't really matter what sort of programming you start with, they will all start you on the path to thinking like a coder. That is - logically, structurally, algorithmically and in a way that breaks a problem down into component pieces that can be solved once and used as a component in many situations after that.
Most languages have the basic parts to assign a data structure to a variable and create a loop to repeat an action a particular number of times and maybe change one component each time through the loop.

words = 'hello world ! we require a shrubbery'.split()
 
for word in words:
    print word, len(word)
 
 
hello 5
world 5
! 1
we 2
require 7
a 1
shrubbery 9

or

# First create a list of words to use

# The easiest way to do this and avoid putting in all the quotation marks and commas is


words = 'hello world ! we require a shrubbery'.split()
 
# The split() function will split the string at the spaces and return a list. 
 
# Pop will remove a word from the end of the list words, 
# check the length with len() to stop when we have removed all the words
 
while len(words) > 0:
    word = words.pop()
 
# Use string formatting to construct the response and indexing to read 
# the word backwards 
 
    print word, "reversed is :- %s" % word[::-1]
 
Output: 
 
shrubbery reversed is :- yrebburhs
a reversed is :- a
require reversed is :- eriuqer
we reversed is :- ew
! reversed is :- !
world reversed is :- dlrow
hello reversed is :- olleh

The will have a way to perform an action based on a decision, like True or False:

# get a response from the user and assign it to the variable answer

answer = raw_input("Would you like a quote from Monty Python, yes or no? ")
 
# make answer all lowercase because 'Yes' and 'yes' are different to Python
 
if answer.lower() == "yes":
 
# this only gets printed if answer == 'yes' 
     
    print "We require a shrubbery"
    
Output:
Would you like a quote from Monty Python? yes or no? no




For researchers the Software Carpentry workshops are the best way I've come across to learn about programming. They are presented by domain experts and usually these presenters are not primarily programmers, so they have insight into the difficulties of learning programming from another field.

In the boot camp you will cover shell programming, version control with Github, the concept of code testing for reliable and maintainable code and a programming language (either Python, R or MATLAB).

Once you have this overview of the parts necessary for your journey into code, there are a number of great (and free!) resources I would strongly recommend.

"Programming for Everyone" on Coursera is run by the fantastic Dr. Chuck.

After those two steps, I'd suggest coming along to the NeuralCode group or something like it in your area. 




Tuesday 2 December 2014

HTML5 CSS Javascript

The browser in becoming increasingly important in computing. 

This seems to be inline with the rise of cloud based computing and storage. The fracturing of the smartphone market has also made HTML5 apps more attractive for phone application developers. As HTML5 apps can be be run in any modern browser and smartphones run those browsers, HTML5 apps are automatically supported on any smartphone OS.

Last year I played with the Bitalino board and was surprised by how responsive their HTML5 based application was with the Bitalino connected by Bluetooth. It looked good and the non-blocking nature of Javascript made the interface very responsive. Python in the back end to do the serial communication and hardware interfacing made for a slick and reliable experience.

With the arrival of the OpenBCI headset imminent, I'd like to know more about the possibilities of coding in the browser.

When I came to the conclusion that programming was a necessary part of neuroscience, I started looking for the perfect language. One that would do all the things I needed so I wouldn't waste time learning inferior languages. After a lot of reading and a little experience, I understand why more experienced programmers don't spend a lot of time answering that question. There is no one best language, they all have their tradeoffs.

Except for Matlab..... Matlab is just bad. ;-)

You will probably end up learning a few languages.

Please excuse my explanations of Javascript, I'm not experienced with it and my knowledge is growing all the time. I'll reread this in a few months and see if I can make it clearer and fix up any mistakes.

Javascript has emerged as the Queen of the browser, it has nothing to do with JAVA despite the name, and is probably the most widely used languages due to this. It is an event driven, asynchronous (or non-blocking) language, meaning that unlike other languages where each instruction is performed in turn in a single thread, Javascript has a system of requests and callbacks that allow other work to be done while waiting for a requested resource (typically disk or network resources).

When I want to learn a programming language, I look around for articles on the nature of the language. Not tutorials but discussions about what it looks, feels and smells like. I try to get a feel for the personality of the language so I have some framework to piece together the information I will acquire. Next I do some online tutorials involving guided coding in a browser. Something like codeacademy. or Didacto

UPDATE 1
I have been doing a lot of reading and video watching about web frameworks. One web page and video that really open my eyes up is Pixel Monkeys piece on lightweight web frameworks. In it Andrew Montalenti talks about the connection between HTML CSS Javascript and light web frameworks like Flask. Flask is a Python framework that does templating and handles connections. Along with these tools, integration with databases and a lot of nifty editors like Emmet are mentioned.

But first a little HTML. Why? Because we want to use Javascript in the browser, and HTML is the markup language that makes a place for the Javacript to live and display it's results.

This material is taken from the codeacademy site.

the basic syntax of HTML is

< something >  ....... < something />

With the somethings (tags) defining what goes between these markers (thus the term - markup language).

It can be headers (h1 - h6)

<h2> This is a heading </ h2>

Or what we're interested in, a javascript file

<script src="js/all.min.js"></script>
 
 
 
<p> Text can go between paragraph tags </p> 
 
In the opening tag we can also state some attributes. 
In the next line, the 'a' tag denotes links and 
the 'href' is an attribute.
In this case it is a web address and notice it is in quotation marks. 
The text between the tags 'links' will be what the user sees and clicks on.

<a href='http://www.codecademy.com'> links </a>
 
images are also bracketed by tags and look a lot like the javascript markup.
Images have an attribute 'source' which gives the path to an image file. 
 
 
<img source="picture.png"></img>
 
 

CSS

The styling of the different elements of the html page are defined by the CSS file.

h1{
color:red;

The 'h1' part is a selector, the 'color' part is a property and the 'red' is it's value. A property:value pair is called a CSS rule.
 
 


 
 

Using LaTeX

To get started with LaTeX you first need to install it. Then an IDE is a good idea. It is possible to write raw tex but there are several great packages that make the task a lot easier. I'm going to use texmaker on the mac and this program works on Apple, Microsoft and Linux operating systems. It is also free and open source, of course.

To start a LaTeX document with texmaker, we can use the 'Quick Start' wizard and put in author and title as well as choose some other options. These chosen options end up in the preamble of the created tex document. As most of my writting involves referencing, I usually start up a bibtex file at the same time. The \use package{natbib} goes in the preamble and \bibliographystyle{humannat} goes before the \bibliography{path_to_bib}, which goes at the end before \end{document}


\documentclass[12pt,a4paper]{article}
\usepackage{natbib}


\author{Alistair Walsh}
\title{Neurofeedback Non-Learners and Brain Computer Interface Illiteracy. 
Epidemiology, identifying features and differentiation.}

\begin{document}
 
\maketitle
 
\begin{abstract}

Not everyone who is provided training in neurofeedback can learn cortical control 
nor that tries to operate a brain computer interface is able to. 
This inability is refered to in NFB as NFB non-learning and in BCI as BCI-illiteracy. 
Very little has been written on this area yet it affects an estimated 30\% to 50\
of participants. It has been suggested that various attributes can be used to identify 
NFB-nonlearning and BCI-illiteracy. These include  high hand dexterity, 
external locus of control, high hypnotisability, high disassociative index, 
and strong reward response. Discovering the  reason for some not learning 
cortical control would be of benefit to those receiving treatment and would 
inform the field on the mechanism behind cortical control. It would also possibly 
improve the techniques of teaching it.
 
\end{abstract}

\section{Introduction}


\citet{Grubler:2014aa} in a paper involving both patients and 
clinicians, the possible ethical concerns were raised.

\citet{Suk:2014aa} estimated the incidence of BCI-illiteracy at around 20\%

\citet{Ahn:2013aa} in a large study of 52 people, suggested that high theta and 
low alpha might predict BCI-I


\bibliographystyle{humannat}
\bibliography{/Users/Wintermute/Dropbox/research_2015/
litreview_BCI-illiteracy/BCI-illiteracy}

\end{document}
 
Results in the following Document