Saturday, 27 February 2016

Football (Soccer) Stats Analysis Using Raspberry Pi, MongoDB, Python and R - Setup

I'm currently reading a book called "The Numbers Game" by Chris Anderson and David Sally.  It's a really interesting read for Geeks, applying statistical methods to analyse football (soccer if you're from some parts of the world) to come up with some counter intuitive claims such as "defense is more important than attack" and "managers don't deliver any benefit to a team".  Overall I suspect it appeals to geeks like me who are secretly jealous of footballers!

Inspired by the book I decided to see if I could do some stats analysis on my Raspberry Pi 2 Model B.  This post is about setting myself up and doing some first trivial analysis.

Step 1 - Get R for my Raspberry Pi
I decided to use R for my analysis simply because I like it, it's open source and I've used it before.

The packaged version of R for Raspbian can be installed with apt-get install r-base but this installs quite an old version of R to which you can't add packages like ggplot2.

Hence I used directions from "teramonagi" on this Stack Overflow page to do the install:

This works and installs version 3.1.2 OK so all credit and thanks to teramonagi.

Step 2 - Get a (Document) Database
Previously I've used MySQL on a Raspberry Pi.  This works OK but it seems that a new-fangled movement in the database world is to move away from relational databases like MySQL and use document/NoSQL database.  I decided to give this a go as I've never used one before!

I did some cursory research that pointed me towards CouchDB, Cassandra and MongoDB as potential document  databases.  I decided to use MongoDB as it seemed to have the easiest install instructions!!

I used this site by Andy Felong for instructions on how to setup MongoDB on a Raspberry Pi so all credit to Andy and no credit to me.

For those schooled in relational databases, a quick terminology lesson:

Relational Database Term Document Database Term
Database Database
Table Collection
Row Document

Documents are structured as JSON documents and are much less structured than a relational database row.  I used this tutorial to teach me the basics of MongoDB.

I specified I wanted to use the default "test" database in MongoDB by typing this in the Mongo command line utility:
> use test

I then added a collection using this command:
> db.createCollection("footie")

I then added documents using these commands:
> db.footie.insert({"name":"Portsmouth"})
> db.footie.insert({"name":"Liverpool"})
> db.footie.insert({"name":"West Ham"})

I could then list all the documents in the collection with:

...which yielded:
{ "_id" : ObjectId("56c89ec46d7d708337299f0b"), "name" : "Portsmouth" }
{ "_id" : ObjectId("56c89ed86d7d708337299f0c"), "name" : "Liverpool" }
{ "_id" : ObjectId("56c9d686d9991d46ab4ad152"), "name" : "West Ham" }

Step 3 - Getting R to Talk to MongoDB
Someone has helpfully written an R package to do this.  The one I used was "RMongo" and I installed it using:

> install.packages("RMongo")

I then connected to and extracted data from my MongoDB collection by doing:

pi@raspberrypi:~/Documents $ sudo R

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: armv7l-unknown-linux-gnueabihf (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(RMongo)
Loading required package: rJava
> mg1 <- mongoDbConnect('test')
> print(dbShowCollections(mg1))
[1] "footie"         "system.indexes"
> query <- dbGetQuery(mg1, 'footie', "{'name' : 'Portsmouth'}")
> data1 <- query
> data1
        name                     X_id
1 Portsmouth 56c89ec46d7d708337299f0b

So here we've connected to the MongoDB "test" database, obtained a list of collections and run a query to get data for the team named "Portsmouth".

Step 4 - Getting Python to Talk to MongoDB
I like Python and so want to use it as a method to extract data from files and APIs, process it and load it into the document database.

First I needed the Python module for MongoDB and I installed this by doing:

sudo python -m pip install pymongo

I was then ready to write a Python script to write to the MongoDB database.  The comments in the code below explain what is going on.

#Import the pymongo module
from pymongo import MongoClient

#Connect to the test database
client = MongoClient()

#Create a database object
db = client.test

#Get a collection
collection = db.footie

#Get a document from the database and print it to screen
MyVar = collection.find_one()
print (MyVar)

#This is the document to write.  Use a Python dictionary
MongoDoc = {'name':'Lincoln City'}
print MongoDoc    #Just prints it to screen as a check

#Write the document to the footie collection

The result when I run the "find" query in MongoDB is:

> db.footie.find()
{ "_id" : ObjectId("56c89ec46d7d708337299f0b"), "name" : "Portsmouth" }
{ "_id" : ObjectId("56c89ed86d7d708337299f0c"), "name" : "Liverpool" }
{ "_id" : ObjectId("56c9d686d9991d46ab4ad152"), "name" : "West Ham" }
{ "_id" : ObjectId("56ccbe8374fece7f011807b2"), "name" : "Lincoln City" }

...and in R:

> query <- dbGetQuery(mg1, 'footie', "")
> data1 <- query
> data1
          name                     X_id
1   Portsmouth 56c89ec46d7d708337299f0b
2    Liverpool 56c89ed86d7d708337299f0c
3     West Ham 56c9d686d9991d46ab4ad152
4 Lincoln City 56ccbe8374fece7f011807b2

So I'm now ready to do some hard core football stats analysis.  Here's a (laughably) simple "architectural" diagram of what I've set up: