Saturday 27 February 2016

Football (Soccer) Stats Analysis Using Raspberry Pi, MongoDB, Python and R - Setup

I'm currently reading a book called "The Numbers Game" by Chris Anderson and David Sally.  It's a really interesting read for Geeks, applying statistical methods to analyse football (soccer if you're from some parts of the world) to come up with some counter intuitive claims such as "defense is more important than attack" and "managers don't deliver any benefit to a team".  Overall I suspect it appeals to geeks like me who are secretly jealous of footballers!

Inspired by the book I decided to see if I could do some stats analysis on my Raspberry Pi 2 Model B.  This post is about setting myself up and doing some first trivial analysis.

Step 1 - Get R for my Raspberry Pi
I decided to use R for my analysis simply because I like it, it's open source and I've used it before.

The packaged version of R for Raspbian can be installed with apt-get install r-base but this installs quite an old version of R to which you can't add packages like ggplot2.

Hence I used directions from "teramonagi" on this Stack Overflow page to do the install:
http://stackoverflow.com/questions/28309891/install-r-3-1-2-on-wheezy-7-8-raspbian-fails

This works and installs version 3.1.2 OK so all credit and thanks to teramonagi.

Step 2 - Get a (Document) Database
Previously I've used MySQL on a Raspberry Pi.  This works OK but it seems that a new-fangled movement in the database world is to move away from relational databases like MySQL and use document/NoSQL database.  I decided to give this a go as I've never used one before!

I did some cursory research that pointed me towards CouchDB, Cassandra and MongoDB as potential document  databases.  I decided to use MongoDB as it seemed to have the easiest install instructions!!

I used this site by Andy Felong for instructions on how to setup MongoDB on a Raspberry Pi so all credit to Andy and no credit to me.

For those schooled in relational databases, a quick terminology lesson:

Relational Database Term Document Database Term
Database Database
Table Collection
Row Document

Documents are structured as JSON documents and are much less structured than a relational database row.  I used this tutorial to teach me the basics of MongoDB.

I specified I wanted to use the default "test" database in MongoDB by typing this in the Mongo command line utility:
> use test

I then added a collection using this command:
> db.createCollection("footie")

I then added documents using these commands:
> db.footie.insert({"name":"Portsmouth"})
> db.footie.insert({"name":"Liverpool"})
> db.footie.insert({"name":"West Ham"})

I could then list all the documents in the collection with:
>db.footie.find()

...which yielded:
{ "_id" : ObjectId("56c89ec46d7d708337299f0b"), "name" : "Portsmouth" }
{ "_id" : ObjectId("56c89ed86d7d708337299f0c"), "name" : "Liverpool" }
{ "_id" : ObjectId("56c9d686d9991d46ab4ad152"), "name" : "West Ham" }

Step 3 - Getting R to Talk to MongoDB
Someone has helpfully written an R package to do this.  The one I used was "RMongo" and I installed it using:

> install.packages("RMongo")

I then connected to and extracted data from my MongoDB collection by doing:

pi@raspberrypi:~/Documents $ sudo R

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: armv7l-unknown-linux-gnueabihf (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(RMongo)
Loading required package: rJava
> mg1 <- mongoDbConnect('test')
> print(dbShowCollections(mg1))
[1] "footie"         "system.indexes"
> query <- dbGetQuery(mg1, 'footie', "{'name' : 'Portsmouth'}")
> data1 <- query
> data1
        name                     X_id
1 Portsmouth 56c89ec46d7d708337299f0b

So here we've connected to the MongoDB "test" database, obtained a list of collections and run a query to get data for the team named "Portsmouth".

Step 4 - Getting Python to Talk to MongoDB
I like Python and so want to use it as a method to extract data from files and APIs, process it and load it into the document database.

First I needed the Python module for MongoDB and I installed this by doing:

sudo python -m pip install pymongo

I was then ready to write a Python script to write to the MongoDB database.  The comments in the code below explain what is going on.

#Import the pymongo module
from pymongo import MongoClient

#Connect to the test database
client = MongoClient()

#Create a database object
db = client.test

#Get a collection
collection = db.footie

#Get a document from the database and print it to screen
MyVar = collection.find_one()
print (MyVar)

#This is the document to write.  Use a Python dictionary
MongoDoc = {'name':'Lincoln City'}
print MongoDoc    #Just prints it to screen as a check

#Write the document to the footie collection
collection.insert_one(MongoDoc)

The result when I run the "find" query in MongoDB is:

> db.footie.find()
{ "_id" : ObjectId("56c89ec46d7d708337299f0b"), "name" : "Portsmouth" }
{ "_id" : ObjectId("56c89ed86d7d708337299f0c"), "name" : "Liverpool" }
{ "_id" : ObjectId("56c9d686d9991d46ab4ad152"), "name" : "West Ham" }
{ "_id" : ObjectId("56ccbe8374fece7f011807b2"), "name" : "Lincoln City" }

...and in R:

> query <- dbGetQuery(mg1, 'footie', "")
> data1 <- query
> data1
          name                     X_id
1   Portsmouth 56c89ec46d7d708337299f0b
2    Liverpool 56c89ed86d7d708337299f0c
3     West Ham 56c9d686d9991d46ab4ad152
4 Lincoln City 56ccbe8374fece7f011807b2

Conclusion
So I'm now ready to do some hard core football stats analysis.  Here's a (laughably) simple "architectural" diagram of what I've set up:




Wednesday 17 February 2016

Raspberry Pi 2 Model B Tutorial

Some fellow Geek Dads and wannabe Geek Dads have frequently asked me for more information on setting up and using a Raspberry Pi.

I bought a new Raspberry Pi 2 model B a few weeks ago and I thought it would be fine to record a 'how to' video for my geek brethren.

Here's the video:



So first I describe the kit list. You'll need:
  • A Raspberry Pi
  • A case (optional)
  • A micro SD memory card. Best to either buy or make a "NOOBS" card.
  • A good quality 2 Amp power supply with a micro USB connector.
  • An Ethernet cable to connect to your home router*
  • A USB mouse and keyboard
  • A HDMI cable or HDMI to VGA adaptor**

* You could also use WiFi but I don't cover this.
** I used an adapter as my monitor is VGA.

In the video I then show how to connect up all the kit, how to install your first operating system and use of a few applications.

Happy watching and all comments welcome below!

Sunday 7 February 2016

Correlation of Fitbit API Step and Heart Rate Data and Strava API Data Using Raspberry Pi, Python and R

I was idly scanning some charts from the Fitbit mobile app when I spotted this one and had a geek idea!



This shows my heart rate over the course of 2015 and early 2016, as measured by my Fitbit Charge HR. I set to wondering why it had the profile shown, i.e. starts reasonably low in early 2015, rises and peaks in mid 2015 then drops again in late 2015 through to early 2016. The theory here is that my resting heart rate is getting lower as I get fitter.  So why am I getting fitter?

My quest for truth (well the 4 minutes I thought about it) made me recall this chart I created for a previous blog post.



This shows my step count over the period and I put down the variation to how much running I was doing at the time. Not comfortable with such a loose coupling of vague recollected half truths, I set about gathering some data to try and prove (or dis-prove) that there was a correlation.

So three sources of data to compare and contrast:
  • Fitbit steps data (via Fitbit API).
  • Fitbit heart rate data (via Fitbit API).
  • Strava exercise data (via Strava API).
I have the Fitbit steps data from the previous post.

Part 1 - Compare Step Data with Strava Running Data
I used the same technique to get Strava API data as I used in this post.  You can extract data from the API and put it in a R data frame using these commands:

> library(jsonlite)
> stravadata <- fromJSON('https://www.strava.com/api/v3/activities?access_token=<your key here>&per_page=200&after=1422748800',flatten=TRUE)

Here "1422748800" is the Unix epoch timestamp for 2015-02-01 which I obtained using this site.

Get just running data using this command:
> stravadata_run <- stravadata[grep("Run", stravadata$type), ]

Get just date, name and distance using:
> stravadata_run_simple <- stravadata_run[,c(11,5,6)]

...which yields (abridged):

> stravadata_run_simple
              start_date                                                         name distance
1   2015-02-01T09:25:19Z                                   Old Skool Farley Mount Run  11893.7
6   2015-02-07T09:03:36Z                                  Winchester Parkrun 20150207   5044.2
9   2015-02-15T09:05:19Z        Mildly Moist & Muddy Up Somborne and Farley Mount Run  12290.7

Now to compare with my monthly step data I need to aggregate this into monthly Strava data (with distance summed).

> stravadata_run_simple$DistanceInt <- as.numeric(stravadata_run_simple$distance)
> stravadata_run_simple$TimePosix <- as.POSIXct(stravadata_run_simple$start_date)
> stravadata_run_simple_agg_sum <- aggregate(list(Distance = stravadata_run_simple$DistanceInt), list(Month = cut(stravadata_run_simple$TimePosix, "month")), sum)

The first two commands above turn the date into a Posix date object and the distance into a number whilst the third does the aggregation.

This yields:

> stravadata_run_simple_agg_sum
        Month Distance
1  2015-02-01  34044.6
2  2015-03-01  15302.4
3  2015-04-01  33326.7
4  2015-05-01  51032.4
5  2015-06-01  29983.8
6  2015-07-01  22940.9
7  2015-08-01   4650.9
8  2015-09-01  15166.1
9  2015-10-01  17009.7
10 2015-11-01  34529.1
11 2015-12-01  57782.0
12 2016-01-01  87035.6
13 2016-02-01  10552.2

I can easily trim the extra data row for 2016-02-01 using:
> stravadata_run_simple_agg_sum <- stravadata_run_simple_agg_sum[-c(13),]

Which means "remove row 13 and leave all the columns in place".

So now I have this and my monthly total step data which is:

> stepdata_2015_agg_sum
        month  Steps tenthoublocks footsteps
1  2015-02-01 350767       35.0767        35
2  2015-03-01 385209       38.5209        39
3  2015-04-01 385578       38.5578        39
4  2015-05-01 477423       47.7423        48
5  2015-06-01 391484       39.1484        39
6  2015-07-01 375393       37.5393        38
7  2015-08-01 373952       37.3952        37
8  2015-09-01 379701       37.9701        38
9  2015-10-01 417465       41.7465        42
10 2015-11-01 290259       29.0259        29
11 2015-12-01 434621       43.4621        43
12 2016-01-01 442431       44.2431        44

Meaning I can form a data frame with step and run distance data using:

> strava_and_steps <- stravadata_run_simple_agg_sum
> strava_and_steps$Steps <- stepdata_2015_agg_sum$Steps

...yielding (abridged):

> strava_and_steps
        Month Distance  Steps
1  2015-02-01  34044.6 350767
2  2015-03-01  15302.4 385209
3  2015-04-01  33326.7 385578
4  2015-05-01  51032.4 477423
5  2015-06-01  29983.8 391484

I'm not a stats expert but a bit of reading shows that R contains some built in correlation functions.  The simplest being:

> cor(strava_and_steps$Distance,strava_and_steps$Steps)
[1] 0.4801041

This is saying a coefficient of correlation of 0.48.  A number of sites, including this one state the following interpretation of the coefficient of correlation.

Value of rStrength of relationship
-1.0 to -0.5 or 1.0 to 0.5Strong
-0.5 to -0.3 or 0.3 to 0.5Moderate
-0.3 to -0.1 or 0.1 to 0.3Weak
-0.1 to 0.1None or very weak
...so according to this a moderate positive correlation.

Now to draw a graph to see if I can spot any correlation.  I need to plot two series on the graph and for this I learned about melting using the reshape library and then about plotting melted data.

To melt:
> install.packages("reshape")
> library(reshape)
> melted_data <- melt(strava_and_steps, id=c("Month"))

...which yields:

> melted_data
            Month variable    value
1  2015-02-01 Distance  34044.6
2  2015-03-01 Distance  15302.4
3  2015-04-01 Distance  33326.7
4  2015-05-01 Distance  51032.4
5  2015-06-01 Distance  29983.8
6  2015-07-01 Distance  22940.9
7  2015-08-01 Distance   4650.9
8  2015-09-01 Distance  15166.1
9  2015-10-01 Distance  17009.7
10 2015-11-01 Distance  34529.1
11 2015-12-01 Distance  57782.0
12 2016-01-01 Distance  87035.6
13 2015-02-01    Steps 350767.0
14 2015-03-01    Steps 385209.0
15 2015-04-01    Steps 385578.0
16 2015-05-01    Steps 477423.0
17 2015-06-01    Steps 391484.0
18 2015-07-01    Steps 375393.0
19 2015-08-01    Steps 373952.0
20 2015-09-01    Steps 379701.0
21 2015-10-01    Steps 417465.0
22 2015-11-01    Steps 290259.0
23 2015-12-01    Steps 434621.0
24 2016-01-01    Steps 442431.0


So effectively taking the side-by-side data frame and turning it into a longer list with "variable" defining whether the data element is a Steps value or a Distance value.

This can then be easily plotted using:

> ggplot(data = melted_data, aes(x = Month, y = value, color = variable)) +
  geom_point()

Which yields:


...which is kind of OK but not that useful as the Distance values are an order of magnitude lower than the Steps values, meaning their variances are somewhat lost.

I learnt it's better to plot two linked charts using something called facets.

The command is:
> ggplot(data = melted_data, aes(x = Month, y = value, color = variable, group=1)) + facet_grid(variable ~ .,scales="free_y") + geom_line()

Here the facet_grid(variable ~ . means split "variable" vertically and leave the horizontal axis as is and scales="free_y" means make the scale on the Y axis independent of each other.



I like the graph a heck of a lot but can I spot any correlation from it?

  • Months that correlate: May, Jun, Jul, Oct, Dec, Jan 
  • Months that don't correlate: Feb, Mar, Apr, Aug
  • Outliers: Nov as my Fitbit stopped working for part of it so I "lost" steps.

Which probably shows why the statistical correlation was only "moderate".

Part 2 - Compare Heart Rate Data with Strava Running Data
I obtained monthly average resting heart rate data using my Raspberry Pi, the OAUTH2.0 method I explained in this post and the URL below:

https://api.fitbit.com/1/user/-/activities/heart/date/2016-01-31/1y.json 

The data looked like this (abridged):
pi@raspberrypi ~/fitbit $ sudo more 2015_heart.json
{"activities-heart":[{"dateTime":"2015-02-01","value":{"customHeartRateZones":[],"heartRateZones":[{"caloriesOut":1731.78528,"max":90,"min":30,"minutes":1012,"name":"Ou
t of Range"},{"caloriesOut":1178.0111,"max":126,"min":90,"minutes":265,"name":"Fat Burn"},{"caloriesOut":404.61594,"max":153,"min":126,"minutes":32,"name":"Cardio"},{"c
aloriesOut":387.18186,"max":220,"min":153,"minutes":28,"name":"Peak"}],"restingHeartRate":67}},{"dateTime":"2015-02-02","value":{"customHeartRateZones":[],"heartRateZone

After transferring the data to my PC I opened up R and loaded the JSON using this command:

> library(jsonlite)
> stepdata2015 <- fromJSON(file.choose(),flatten=TRUE)

Where file.choose() means the Windows file chooser form is opened to allow you to select the JSON file.  The data looked messy so I transformed it into a data frame using this:

> heartdata2015_df <- as.data.frame(heartdata2015)

Then I could select the data I wanted (date and heart rate) using this to get just one row:

> heartdata2015_df[c(1),c(1,4)]
  activities.heart.dateTime activities.heart.value.restingHeartRate
1                2015-02-01                                      67

...and could just put this data into it's own data frame by doing (i.e. all rows and just columns 1 and 4):

> heartdata2015_df_simple <- heartdata2015_df[,c(1,4)]
> heartdata2015_df_simple
    activities.heart.dateTime activities.heart.value.restingHeartRate
1                  2015-02-01                                      67
2                  2015-02-02                                      67
3                  2015-02-03                                      67
4                  2015-02-04                                      68

Then I could summarise it into monthly average data by doing:

> heartdata2015_df_simple$TimePosix <- as.POSIXct(heartdata2015_df_simple$activities.heart.dateTime)
> heartdata2015_df_simple$HeartInt <- as.integer(heartdata2015_df_simple$activities.heart.value.restingHeartRate)
> heartdata_2015_agg_mean <- aggregate(list(Heart = heartdata2015_df_simple$HeartInt), list(month = cut(heartdata2015_df_simple$TimePosix, "month")), mean)

The first two commands create proper date fields and integer fields from the JSON derived data and the third does the aggregation.  This yields (abridged):

> heartdata_2015_agg_mean
        month    Heart
1  2015-02-01 68.14286
2  2015-03-01 69.19355
3  2015-04-01 68.36667

I did a simple line plot using:

> ggplot(data=heartdata_2015_agg_mean, aes(x=month,y=Heart,group=1)) + geom_line()

...which  yields this graph:

Even though the time base is different and the Fitbit derived graph (see top of post) has smoothing applied the points on this graph do match those on the Fitbit derived graph.

I merged the Strava distance and heart rate data using:

> strava_and_heart <- stravadata_run_simple_agg_sum
> strava_and_heart$Heart <- heartdata_2015_agg_mean$Heart

Yielding:

> strava_and_heart
        Month Distance    Heart
1  2015-02-01  34044.6 68.14286
2  2015-03-01  15302.4 69.19355
3  2015-04-01  33326.7 68.36667
4  2015-05-01  51032.4 67.09677
5  2015-06-01  29983.8 68.83333
6  2015-07-01  22940.9 66.12903
7  2015-08-01   4650.9 69.65217
8  2015-09-01  15166.1 70.66667
9  2015-10-01  17009.7 69.29032
10 2015-11-01  34529.1 68.78261
11 2015-12-01  57782.0 65.93548
12 2016-01-01  87035.6 65.29032

First let's correlate:

> cor(strava_and_heart$Distance,strava_and_heart$Heart)
[1] -0.8061387

Now we're talking!  A strong negative correlation!  As distance goes up, heart rate comes down.  Love it.

So to melt (melted heart - geddit!!):

> melted_heart <- melt(strava_and_heart, id=c("Month"))

> melted_heart
        Month variable       value
1  2015-02-01 Distance 34044.60000
2  2015-03-01 Distance 15302.40000
3  2015-04-01 Distance 33326.70000
4  2015-05-01 Distance 51032.40000
5  2015-06-01 Distance 29983.80000
6  2015-07-01 Distance 22940.90000
7  2015-08-01 Distance  4650.90000
8  2015-09-01 Distance 15166.10000
9  2015-10-01 Distance 17009.70000
10 2015-11-01 Distance 34529.10000
11 2015-12-01 Distance 57782.00000
12 2016-01-01 Distance 87035.60000
13 2015-02-01    Heart    68.14286
14 2015-03-01    Heart    69.19355
15 2015-04-01    Heart    68.36667
16 2015-05-01    Heart    67.09677
17 2015-06-01    Heart    68.83333
18 2015-07-01    Heart    66.12903
19 2015-08-01    Heart    69.65217
20 2015-09-01    Heart    70.66667
21 2015-10-01    Heart    69.29032
22 2015-11-01    Heart    68.78261
23 2015-12-01    Heart    65.93548
24 2016-01-01    Heart    65.29032

...and to facet:

> ggplot(data = melted_heart, aes(x = Month, y = value, color = variable, group=1)) + facet_grid(variable ~ .,scales="free_y") + geom_line()

...yielding this chart:


The only outlier is July where distance went down and heart rate went down.  Maybe I did a lot of cycling and swimming in July?  Other than that you can pretty much see the correlation.

Conclusion
  • There is a moderate postive correlation between step count and distance run.
  • There is a strong negative correlation between distance run and average heart rate.