Saturday, 26 March 2016

Instagram APi Using Raspberry Pi

Mmmm brunch, my favourite meal of the day (after breakfast, dinner, lunch, tea, high tea, elevensies and supper).

But why a picture of brunch on a serious tech blog like this?  It would seem that the young people these days all seem to like Instagram.  Never one to not follow fashion I thought I'd get into Instagram Geek Dad style!  That means learning how to use the Instagram API on my Raspberry Pi and getting access to pictures of brunch from my account.

There's good documentation online, but here's my step-by-step guide on how to do it. Remember this post is for Geek Dads like me,not IT professionals so please don't scorn me if you think this is easy meat.

Step 1 - Sign up and Register an Application
Using the Instagram API relies on OAUTH2.0, much like the Fitbit API that I blogged about here.

Step 1 is about logging on, registering your app and getting secret data for your application.

Go to instagram developer site at and log in using your Instagram account credentials, (i.e. those which you get when you sign up as a normal, non-Geek user).  You'll need to provide extra details like why you want to get access to the API.  I put something along the lines of "to test what the API can do with my Raspberry Pi".  I'm sure you can put whatever you like here (within reason).

Select the "Manage Clients" button (top right of the screen) and add details for your specific test client.  I just used this blogs address for the various URL fields requested and my Gmail address for the Support email address.  You'll be given a client ID and a client secret that you need to log these for the next steps.

Step 2 - Get the User (You) to Authorise Your Test App
For the next step you need to get the user (you in this case) to authorise you accessing their account with your app.  To do this you simply point a browser at an address like the one below:<YourIDHere>&redirect_uri=

Here the client ID is the one you get when you registered your app and the scope parameters at the end are described here.  I basically requested all possible scope items for my app.

You may have to login with your Instagram credentials if not already logged in and you will have to click a button marked "Authorise".  This effectively links your Instagram account to the developer application you created.

The browser session will be redirected to the URL you specified as your "Redirect URI" when you signed up and the last segment of the URL is the "code" you use for step 3.  Log this!!

Step 3 - Get an Access Token
Like all OAUTH2.0 implementations I've seen you use an access token to authorise your API requests.  However the token seems to be long lived, there's no periodic requesting of a new access token using a refresh token.  (However Instagram warn you may need to request a new access token at some point).

To get the access token you need to make a HTTP POST request using a bunch of parameters in the message body.  Instagram helpfully gives you the structure of a cURL command you can run on your Raspberry Pi.  Here it is:

curl -F 'client_id=CLIENT_ID' \ -F 'client_secret=CLIENT_SECRET' \ -F 'grant_type=authorization_code' \ -F 'redirect_uri=AUTHORIZATION_REDIRECT_URI' \ -F 'code=CODE' \

Simply replace the parameters in the above command with the ones you've collected through this process.

The response to this cURL command will be something like this (details changed of course):
{"access_token":"3911234.abcder.;dfjlkpaefjpfj;ldfjljdfljdff","user":{"username":"MyUserName","bio":"A geek, UK\udfdfdfd\dfdff\rrr\34546\454534545\fdfhh I love my girls\u\dfdfdf\ererr\webgh\jjkgkt","website":"http:\/\/","profile_picture":"https:\/\/\/t51.2885-9\/10570229_699866173_1234567_a.jpg","full_name":"The Geek Dad (Geeky Man)","id":"678765"}}

So with the access token at the start of the JSON response, here's a few things you can do on the API.

Using the API
The key thing that Instagram warn is that your app starts in "Sandbox Mode" meaning restricted access to the API in terms of API calls, number of users etc.  You would then need to register your app for it to be promoted to use the full API.  Not required for someone who just wants to play.

The API has a set of endpoints that I'll describe with a few examples below.  The documentation on the Instagram developer site here is pretty good.

In the examples below, replace <YourAccessToken> with the one you got from the OAUTH2.0 request above.

Get information about yourself as a user:<YourAccessToken>

This provides the same JSON response as when you requested the access token (step 3 above).

You can also see recent media for you as a user by doing:<YourAccessToken>

This gives you a whole bunch  of JSON detailing your most recent Instagram posts.  Within this you can see parameters like creation date, location, the filter used, the caption, number of likes etc.  There is also a bunch of URLs for different size versions of the image (low resolution, thumbnail and standard resolution).  Each of these are shown below.




Mmmmm, brunch!

Sunday, 20 March 2016

What to do with 3.5 Million Heart Rate Monitor Readings?

Previously on Paul's Geek Dad blog I've written about how the heart rate monitor readings from my Fitbit Charge HR seem to be showing I'm getting fitter.  Here's the latest chart:

A nice trend but massively aggregated and smoothed.  I decided to play with the data in it's rawest format possible to see what I could see.  Fitbit allow you access to their "intraday" data for personal projects if you ask them nicely.  The webpage explaining this is here and what they say is:

Access to the Intraday Time Series for all other uses is currently granted on a case-by-case basis. Applications must demonstrate necessity to create a great user experience. Fitbit is very supportive of non-profit research and personal projects. Commercial applications require thorough review and are subject to additional requirements. Only select applications are granted access and Fitbit reserves the right to limit this access. To request access, email

I've previously used this intraday data to look at my running cadence, using Fitbit API derived data to see whether attempts to change my running style were actually working.  Looking at the Fitbit API documentation I saw that heart rate data could be obtained at sub-minute granularity.  Whoopee!

An example URL to get 1 minute data from the Fitbit API using the OAUTH2.0 method I previously blogged about is:

...which yelds at the start (abridged):

{"activities-heart":[{"dateTime":"2015-03-01","value":{"customHeartRateZones":[],"heartRateZones":[{"caloriesOut":2184.1542,"max":90,"min":30,"minutes":1169,"name":"Out of Range"},{"caloriesOut":891.10584,"max":126,"min":90,"minutes":167,"name":"Fat Burn"},{"caloriesOut":230.65056,"max":153,"min":126,"minutes":23,"name":"Cardio"},{"caloriesOut":133.98084,"max":220,"min":153,"minutes":11,"name":"Peak"}],"restingHeartRate":66}}],"activities-heart-intraday":{"dataset":[{"time":"00:00:00","value":75},{"time":"00:00:15","value":75},{"time":"00:00:30","value":75},{"time":"00:00:45","value":75},{"time":"00:01:00","value":75},{"time":"00:01:15","value":75},{"time":"00:01:30","value":75},{"time":"00:01:45","value":75},{"time":"00:02:00","value":75},{"time":"00:02:15","value":75},{"time":"00:02:30","value":75},{"time":"00:02:45","value":75},{"time":"00:03:00","value":75},{"time":"00:03:15","value":75},{"time":"00:03:30","value":74},{"time":"00:03:40","value":72}

...and then ends...


So it seems it's not actually per second data, ( measurement per second), but rather a measurement every 10 to 15 seconds.  Which is enough I think!

What I wanted was every single sub-one minute record for the whole time I've had my Fitbit Charge HR( since Jan-15 to ~14 months at the time of writing).  I found that stretching out the time period for which "1sec" data is requested results in it being summarised to daily data.  Hence I needed to write a script to call the API multiple times and log the results.  Bring on the Python (my favourite programming language) on my Raspberry Pi 2.

The full scripted is pasted in below (you'll need to workout your own secret keys etc using my OAUTH2.0 method).  The core of it is my Fitbit OAUTH2,0 API script from before but I've added elements that takes a date range and makes one API call  per day.  Key elements:

  • Constants "StartDate" and "EndDate" that specify the range of dates to make API calls for.
  • Function "CountTheDays" that computes the number of days between the StartDate and EndDate constants.
  • A for loop that counts down in increments of 1 from the value returned by CountTheDays to 0.  This creates an index that is used for....
  • Function "ComputeADate" that takes the index and turns it back into a date string representing the number of days before EndDate.  This means we step from StartDate to EndDate, making....
  • Call to function "MakeAPICall" to actually make the call.
  • Code to take the API response JSON, strip out the key elements and write to a simple comma separated variable text file.

#Gets the heart rate in per second format, parses it and writes it to file.

import base64
import urllib2
import urllib
import sys
import json
import os
from datetime import datetime, timedelta
import time

#Typical URL for heart rate data.  Date goes in the middle
FitbitURLStart = ""
FitbitURLEnd = "/1d/1sec.json"

#The date fields.  Start date and end date can be altered to deal with the period you want to deal with
StartDate = "2016-03-10"
EndDate = "2016-03-16"

#Use this URL to refresh the access token
TokenURL = ""

#Get and write the tokens from here
IniFile = "/home/pi/fitbit/tokens.txt"

#Here's where we log to
LogFile = "/home/pi/fitbit/persecheartlog.txt"

#From the developer site
OAuthTwoClientID = "<ClientIDHere>"
ClientOrConsumerSecret = "<SecretHere>"

#Some contants defining API error handling responses
TokenRefreshedOK = "Token refreshed OK"
ErrorInAPI = "Error when making API call that I couldn't handle"

#Determine how many days to process for.  First day I ever logged was 2015-01-27
def CountTheDays(FirstDate,LastDate):
  #See how many days there's been between today and my first Fitbit date.
  FirstDt = datetime.strptime(FirstDate,"%Y-%m-%d")    #First Fitbit date as a Python date object
  LastDt = datetime.strptime(LastDate,"%Y-%m-%d")      #Last Fitbit date as a Python date object

  #Calculate difference between the two and return it
  return abs((LastDt - FirstDt).days)

#Produce a date in yyyy-mm-dd format that is n days before the end date to be processed
def ComputeADate(DaysDiff, LastDate):
  #Get today's date
  LastDt = datetime.strptime(LastDate,"%Y-%m-%d")      #Last Fitbit date as a Python date object

  #Compute the difference betwen now and the day difference paremeter passed
  DateResult = LastDt - timedelta(days=DaysDiff)
  return DateResult.strftime("%Y-%m-%d")

#Get the config from the config file.  This is the access and refresh tokens
def GetConfig():
  print "Reading from the config file"

  #Open the file
  FileObj = open(IniFile,'r')

  #Read first two lines - first is the access token, second is the refresh token
  AccToken = FileObj.readline()
  RefToken = FileObj.readline()

  #Close the file

  #See if the strings have newline characters on the end.  If so, strip them
  if (AccToken.find("\n") > 0):
    AccToken = AccToken[:-1]
  if (RefToken.find("\n") > 0):
    RefToken = RefToken[:-1]

  #Return values
  return AccToken, RefToken

def WriteConfig(AccToken,RefToken):
  print "Writing new token to the config file"
  print "Writing this: " + AccToken + " and " + RefToken

  #Delete the old config file

  #Open and write to the file
  FileObj = open(IniFile,'w')
  FileObj.write(AccToken + "\n")
  FileObj.write(RefToken + "\n")

#Make a HTTP POST to get a new
def GetNewAccessToken(RefToken):
  print "Getting a new access token"

  #Form the data payload
  BodyText = {'grant_type' : 'refresh_token',
              'refresh_token' : RefToken}
  #URL Encode it
  BodyURLEncoded = urllib.urlencode(BodyText)
  print "Using this as the body when getting access token >>" + BodyURLEncoded

  #Start the request
  tokenreq = urllib2.Request(TokenURL,BodyURLEncoded)

  #Add the headers, first we base64 encode the client id and client secret with a : inbetween and create the authorisation header
  tokenreq.add_header('Authorization', 'Basic ' + base64.b64encode(OAuthTwoClientID + ":" + ClientOrConsumerSecret))
  tokenreq.add_header('Content-Type', 'application/x-www-form-urlencoded')

  #Fire off the request
    tokenresponse = urllib2.urlopen(tokenreq)

    #See what we got back.  If it's this part of  the code it was OK
    FullResponse =

    #Need to pick out the access token and write it to the config file.  Use a JSON manipluation module
    ResponseJSON = json.loads(FullResponse)

    #Read the access token as a string
    NewAccessToken = str(ResponseJSON['access_token'])
    NewRefreshToken = str(ResponseJSON['refresh_token'])
    #Write the access token to the ini file

    print "New access token output >>> " + FullResponse
  except urllib2.URLError as e:
    #Gettin to this part of the code means we got an error
    print "An error was raised when getting the access token.  Need to stop here"
    print e.code

#This makes an API call.  It also catches errors and tries to deal with them
def MakeAPICall(InURL,AccToken,RefToken):
  #Start the request
  req = urllib2.Request(InURL)

  #Add the access token in the header
  req.add_header('Authorization', 'Bearer ' + AccToken)

  print "I used this access token " + AccToken
  #Fire off the request
    #Do the request
    response = urllib2.urlopen(req)
    #Read the response
    FullResponse =

    #Return values
    return True, FullResponse
  #Catch errors, e.g. A 401 error that signifies the need for a new access token
  except urllib2.URLError as e:
    print "Got this HTTP error: " + str(e)
    HTTPErrorMessage =
    print "This was in the HTTP error message: " + HTTPErrorMessage
    #See what the error was
    if (e.code == 401) and (HTTPErrorMessage.find("Access token invalid or expired") > 0):
      return False, TokenRefreshedOK
    elif (e.code == 401) and (HTTPErrorMessage.find("Access token expired") > 0):
      return False, TokenRefreshedOK
    #Return that this didn't work, allowing the calling function to handle it
    return False, ErrorInAPI

#Main part of the code
#Declare these global variables that we'll use for the access and refresh tokens
AccessToken = ""
RefreshToken = ""

print "Fitbit API Heart Rate Data Getter"

#Get the config
AccessToken, RefreshToken = GetConfig()

#Get the number of days to process for
DayCount = CountTheDays(StartDate,EndDate)

#Open a file to log to
MyLog = open(LogFile,'a')

#Loop for the date range
#Process each one of these days stepping back in the for loop and thus stepping up in time
for i in range(DayCount,-1,-1):
  #Get the date to process
  DateForAPI = ComputeADate(i,EndDate)

  #Say what is going on
  print ("Processing for: " + DateForAPI)

  #Form the URL
  FitbitURL = FitbitURLStart + DateForAPI + FitbitURLEnd

  #Make the API call
  APICallOK, APIResponse = MakeAPICall(FitbitURL, AccessToken, RefreshToken)

  if APICallOK:
    #We got a response, let's deal with it
    ResponseAsJSON = json.loads(APIResponse)

    #Get the date from the JSON response just in case.  Then loop through the JSON getting the HR measurements. 
    JSONDate = str(ResponseAsJSON["activities-heart"][0]["dateTime"])

    #Loop through picking out values and forming a string
    for HeartRateJSON in ResponseAsJSON["activities-heart-intraday"]["dataset"]:
      OutString = JSONDate + "," + str(HeartRateJSON["time"]) + "," + str(HeartRateJSON["value"]) + "\r\n"

      #Write to file
  else:  #Not sure I'm making best use of this logic.  Can tweak if necessary
    if (APIResponse == TokenRefreshedOK):
      print "Refreshed the access token.  Can go again"
      print ErrorInAPI


The code does the job; maybe the error handling could be better.  One thing I ran into was that Fitbit rate limit their API call to 150 calls per hour.  As I was grabbing nearly 14 months of data I found I hit the limit and had to wait for the hour to expire before I could re-start the script, (after editting the start and end dates).

The raw data output looks like:

pi@raspberrypi ~/fitbit $ head persecheartlog.txt


pi@raspberrypi ~/fitbit $ tail persecheartlog.txt

...and contained this many measurements:

pi@raspberrypi ~/fitbit $ wc -l persecheartlog.txt
3492490 persecheartlog.txt

So 3.5 million measuresments to play with.  Mmmmmmmmmmmmmm.

As I've been doing lots recently I used R to analyse the data.  I tried this on my  Raspberry Pi 2 and, whilst I could load and manipulate the data using my Pi, R kept crashing when I tried to graph the data :-(.  Hence I resorted to using my PC which is a bit boring but needs must...

Load the CSV file full of heart rate data:
> FitbitHeart1 <- read.csv(file="c:/myfiles/persecheartlog.txt",head=FALSE,sep=",")

Create useful column names:
> colnames(FitbitHeart1) <- c("Date","Time","HeartRate")

Add a Posix style date/time column to help with graphing:
> $DateTimePosix <- as.POSIXlt(paste(FitbitHeart1$Date,FitbitHeart1$Time,sep=" "))

And graph (this took about 5 mins to run on my PC, it got a bit hot and the fan went into over-drive)!
> library(ggplot2)
> qplot(DateTimePosix,HeartRate,data=FitbitHeart1,geom=c("point","smooth"),xlab="Date",ylab="Heart Rate (bpm)",main="Fitbit Heart Rate Data")

Yielding this interesting graph:

Hmmm, so this is what 3.5 million points plotted on a graph looks like!  Maybe a dense Norwegian fir tree forest in the dead of night!  I think there's beauty in any graph and, whilst this one only it's Dad can love I spot:
  • A regression line (blue) which is decreasing and so matches the Fitbit summarised chart and adds further proof that I'm getting fitter.
  • Gaps in the "trees" where my Fitbit has not been working for one reason or another*.
  • The bottom of the dense set of points (the tree roots maybe) nestling at about 50 beats per minute.  Just looking at the graph there currently appears to be more of these now than there were a year ago showing my resting heart rate is decreasing.
  • The "canopy" of the forest at ~125 bpm, meaning my heart generally sits within the range 50 to 125 bpm.
  • Numerous trees peaking above 125 bpm which must be when I exercise.  There's more of these trees now as I do more exercise.
OK, that's the Norwegian forest analogy stretched a bit too far...

So maybe I need to think a bit more as to what to do with 3.5 million heart rate date points.  Something for a future blog post...

(*This was where my Fitbit broke during an upgrade and the lovely people from Fitbit replaced it free-of-charge).

Tuesday, 15 March 2016

Strava API Lap Analysis Using Raspberry Pi, Python and R

I'm training for a Half Marathon at the moment and, without meaning to sound too full of myself, I think I'm getting fitter.  This seems to be born out by my resting heart rating as measured by my Fitbit Charge HR which, after my previous analysis, continues to get lower:

When out for a long run on Saturday it struck me that, for the same perceived effort, it feels like I'm getting faster in terms of how long each kilometer takes me to run.  As Greg Lemond once said "it doesn't get any easier, you just go faster".  Hence, when running, I formed a plan to look at the pace stats from my ~2 years worth of Garmin gathered Strava data to see how my pace is changing.

For a previous post I described how to get Strava activity data from the Strava API.  After registering for a key, a HTTP GET to an example URL such as:<YourKey>&per_page=200&page=1

...returns a bunch of JSON documents, each of which describes a Strava activity and each of which has a unique ID.  Then, as described in this post, you can get "lap" data for a particular activity with a HTTP GET to a URL like this:<ActvityID>/laps?access_token=<YourKey>

So what is a "lap"?  In  it's simplest form, you get a lap logged every time you press "Lap" on your stopwatch.  So for an old skool runner, every time you pass a km or mile marker in a race you pressed lap and looked at your watch to see if you were running at your target pace.

These days a modern smartwatch will log every lap for post-analysis and can also be set up to auto-lap on time or distance.  For the vast majority of my runs I have my watch configured to auto-lap every km so I have a large set of data ready-available to me!

As all good data is, there is also some messiness in it; specifically for some runs where I've chosen to manually log laps, have had the lap function turned off (so the whole run is a single lap) or have a small sub-km distance at the end of the run that is logged as a lap.

So to analyse the data.  I chose to write a Python script on my Raspberry Pi 2 that would:
  • Extract activity data from the Strava API.  It has a limit of 200 activities per page so I had so request multiple pages.
  • Then for each activity, if it was a run, extract lap data from the Strava API.
  • Then log all the lap data, taking into account any anomalies (specifically missing heart rate data), into a file for further analysis.
Here's all the code.  The comments should describe what's going on:

import urllib2
import json

#The base URL we use for activities
EndURLLaps = "/laps?access_token=<YourKey>"
LapLogFile = "/home/pi/Strava/lap_log_1.txt"

#Open the file to use
MyFile = open(LapLogFile,'w')

#Loop extracting data.  Remember it comes in pages
EndFound = False
LoopVar = 1

#Main loop - Getting all activities
while (EndFound == False):
  #Do a HTTP Get - First form the full URL
  ActivityURL = BaseURLActivities + str(LoopVar)
  StravaJSONData = urllib2.urlopen(ActivityURL).read()
  if StravaJSONData != "[]":   #This checks whether we got an empty JSON response and so should end
    #Now we process the JSON
    ActivityJSON = json.loads(StravaJSONData)

    #Loop through the JSON structure
    for JSONActivityDoc in ActivityJSON:
      #Start forming the string that we'll use for output
      OutStringStem = str(JSONActivityDoc["start_date"]) + "|" + str(JSONActivityDoc["type"]) + "|" + str(JSONActivityDoc["name"]) + "|" + str(JSONActivityDoc["id"]) + "|"
      #See if it was a run.  If so we're interested!!
      if (str(JSONActivityDoc["type"]) == "Run"):
        #Now form a URL and get the laps for this activity and get the JSON data
        LapURL = StartURLLaps + str(JSONActivityDoc["id"]) + EndURLLaps
        LapJSONData = urllib2.urlopen(LapURL).read()

        #Load the JSON to process it
        LapsJSON = json.loads(LapJSONData)

        #Loop through the lap, checking and logging data
        for MyLap in LapsJSON:
          OutString = OutStringStem + str(MyLap["lap_index"]) + "|" + str(MyLap["start_date_local"]) + "|" + str(MyLap["elapsed_time"]) + "|" 
          OutString = OutString + str(MyLap["moving_time"]) + "|" + str(MyLap["distance"]) + "|" + str(MyLap["total_elevation_gain"]) + "|"
          #Be careful with heart rate data, might not be  there if I didn't wear a strap!!!
          if "average_heartrate" not in MyLap:
            OutString = OutString + "-1|-1\n"
            OutString = OutString + str(MyLap["average_heartrate"]) + "|" + str(MyLap["max_heartrate"]) + "\n"
          #Print to screen and write to file
          print OutString
    #Set up for next loop
    LoopVar += 1
    EndFound = True

#Close the log file

So this created a log file that looked like this:

pi@raspberrypi:~/Strava $ tail lap_log_1.txt
2014-06-30T05:39:36Z|Run|Copenhagen Canter|160234567|8|2014-06-30T08:18:12Z|283|278|1000.0|6.3|-1|-1
2014-06-30T05:39:36Z|Run|Copenhagen Canter|160234567|9|2014-06-30T08:22:52Z|272|271|1000.0|16.2|-1|-1
2014-06-30T05:39:36Z|Run|Copenhagen Canter|160234567|10|2014-06-30T08:27:29Z|295|280|1000.0|18.1|-1|-1
2014-06-30T05:39:36Z|Run|Copenhagen Canter|160234567|11|2014-06-30T08:34:27Z|58|54|195.82|0.0|-1|-1
2014-06-26T11:16:34Z|Run|Smelsmore Loop|158234567|1|2014-06-26T12:16:34Z|2561|2561|8699.8|80.0|-1|-1
2014-06-20T11:09:00Z|Run|Smelsmore Loop|155234567|1|2014-06-20T12:09:00Z|2529|2484|8015.3|80.1|-1|-1
2014-06-16T16:23:19Z|Run|HQ to VW.  Strava was naughty and only caught part of it|154234567|1|2014-06-16T17:23:19Z|640|640|2169.9|39.2|-1|-1
2014-06-10T11:13:31Z|Run|Sunny squelchy Smelsmore|151234567|1|2014-06-10T12:13:31Z|2439|2429|8235.2|83.4|-1|-1
2014-06-03T10:57:58Z|Run|Lost in Donnington|148234567|1|2014-06-03T11:57:58Z|1933|1874|6266.7|86.0|-1|-1
2014-05-24T07:43:52Z|Run|Calf rehab run|144234567|1|2014-05-24T08:43:52Z|2992|2964|9977.4|170.7|-1|-1

Time to analyse the data in R!

First import the data into a data frame:
> StravaLaps1 <- read.csv(file="/home/pi/Strava/lap_log_1.txt",head=FALSE,sep="|")

Add some meaningful column names:
> colnames(StravaLaps1) <- c("ActvityStartDate","Type","Name","ActivityID","LapIndex","LapStartDate","ElapsedTime","MovingTime","Distance","ElevationGain","AveHeart","MaxHeart")

Turn the distance and time values to numbers so we can do some maths on them:
> StravaLaps1$ElapsedTimeNum = as.numeric(StravaLaps1$ElapsedTime)
> StravaLaps1$DistanceNum = as.numeric(StravaLaps1$Distance)

Now calculate the per km pace.  For the laps which were derived from the "auto-lap at 1 km" settings this just means we're dividing the elapsed time for the lap by 1.  Otherwise it scales up (for <1km laps) or down (for >1km laps) as required.
> StravaLaps1$PerKmLapTime <- StravaLaps1$ElapsedTimeNum / (StravaLaps1$DistanceNum / 1000)

 The data comes off the Strava API in reverse chronological order.  Hence to make sure it can be ordered for graphing I need to create a Posix time column, i.e. a column that's interpreted as a date and time, not just text.  To do this I first re-format the date and time using strptime, then turn into Posix.

> StravaLaps1$LapStartDateSimple <- strptime(StravaLaps1$LapStartDate, '%Y-%m-%dT%H:%M:%SZ')
> StravaLaps1$LapStartDatePosix <- as.POSIXlt(StravaLaps1$LapStartDateSimple)

...which gives us data like this:

> head(StravaLaps1[,c(13,14,15,17)])
  MovingTimeNum DistanceNum PerKmLapTime   LapStartDatePosix
1           269        1000          268 2016-03-12 08:55:11
2           263        1000          266 2016-03-12 08:59:44
3           264        1000          267 2016-03-12 09:04:10
4           258        1000          259 2016-03-12 09:08:37
5           271        1000          272 2016-03-12 09:12:56
6           252        1000          255 2016-03-12 09:17:30

Now to draw a lovely graph using ggplot2:
> qplot(LapStartDatePosix,PerKmLapTime,data=StravaLaps1,geom=c("point","smooth"),ylim=c(200,600),xlab="Date",ylab="KM Pace(s)",main="KM Pace from Strava")

Which gives this:

Now that is an interesting graph!  Each "vertical line" represents a single run with each point being a lap for that run.  A lot of the recent points are between 250 seconds (so 4m10s per km) and 300s (so 5m per km) which is about right.

On the graph you can also see a nice even spread of runs from spring 2014 to early summer 2015.  There was then a gap when I was injured until Sep 2015 when I returned from injury and then Dec 2015 when I started training in earnest.

The regression line is interesting, reaching it's min point by Autumn 2015 (when I started doing short, fast 5km runs at ~4m10s per km) and then starting to increase again as my distance increased (to ~4m30s per km).

So it was interesting to just look at the most recent data. To find the start point I scanned back in the data to the point I started running again after my injury.  This was derived by doing the following command to just extract the first rows of the data frame into a new data frame:
>StravaLaps2 <- StravaLaps1[c(1:423),]

> tail(StravaLaps2[,c(1,3)])
        ActvityStartDate          Name
418 2015-11-10T07:54:51Z   Morning Run
419 2015-11-10T07:54:51Z   Morning Run
420 2015-11-10T07:54:51Z   Morning Run
421 2015-11-05T07:51:20Z Cheeky HQ Run
422 2015-11-05T07:51:20Z Cheeky HQ Run
423 2015-11-05T07:51:20Z Cheeky HQ Run

Where "Cheeky HQ Run" was a short tentative run I did as the first of my "comeback".  A plot using this data and a regression line is shown below:

> qplot(LapStartDatePosix,PerKmLapTime,data=StravaLaps2,geom=c("point","smooth"),ylim=c(200,600),xlab="Date",ylab="KM Pace(s)",main="KM Pace from Strava - Recent")

Now I REALLY like this graph.  Especially as the regression line shows I am getting faster which was the answer I wanted!  However with a bit less data you can see each run in more detail (each vertical line) and an interesting pattern emerges.

Best to look at this by delving into the data even more and just taking Feb and March data:

> StravaLaps3 <- StravaLaps2[c(1:201),]

> qplot(LapStartDatePosix,PerKmLapTime,data=StravaLaps3,geom=c("point","smooth"),ylim=c(200,600),xlab="Date",ylab="KM Pace(s)",main="KM Pace from Strava - Feb/Mar 2016")

Taking the run (vertical set of points) on the far right and moving left we see:

  • Long 21k run at a consistent pace so lots of points clustered together.
  • Shorter hillier run so less points and similar pace.
  • Intervals session so some very fast laps (sub 4 min km) and some slow jogging
  • Long 18k run at a consistent pace but not so nicely packed together as the 21k run

...and so on back in time with each type of run (long, short and intervals) having it's own telltale "finger print".  For example the second run from the right is a 5k fast (for me) Parkrun so has a small number of laps at a pretty good (for me) pace.

Overall I really like this data and what Strava, Raspberry Pi, Python and R lets me do with it.  First of all it tells me I'm getting faster which is always good.  Second it has an interesting pattern and each type of run is easily distinguishable which is nice.  Finally it's MY data; I'm playing with and learning about this stuff with my own data which is somehow more fun than using pre-prepared sample data.

Friday, 11 March 2016

First Football (Soccer) Stats Analysis Using Raspberry Pi, Python, MongoDB and R

In my last post I described the setup I'd created on my Raspberry Pi to do football (soccer for some of you) statistics analysis.  Here's an "architecture" diagram:

So in simple terms I:

  • Use Python to gather data from internet sources, parse it and load it into..
  • MongoDB where I store data in document format before...
  • Analysing the data using R and...
  • Presenting the results for you lucky people in Blogger!

So it was time to gather some data, load it and analyse it!

A quick look around the internet showed me a variety of sources, the first of which was:

This is a betting focussed website but has a set of free CSV files showing results from multiple European leagues going back to the early '90s, (older data is more sparse than more recent data).  As I say, they offer their data for free but I urge you to say thanks to them as an ad funded website in the only way you can if you see what I mean.

Grabbing a file and looking at it in LibreOffice Calc shows there to be all sorts of interesting data available.  Here's a screenshot showing data for the Belgian Pro league.

So much nice data to play with!!

Step 1- Getting All the Data
Looking at an example URL for one of the CSV files on shows it to be:

So here we see:
  • A base URL -
  • Four digits indicating the year.  Here 1516 means the season 2015-16
  • The league in question.  Here E0 means English Premier League

Hence it's pretty easy to write a Python script to iterate through the years (9394 to 1516) and leagues to grab all the CSV files.  The script is below.  The comments should explain all but in simple terms it iterates through a list of years (YearList) and for each year a list of leagues (LeagueList), forms a URL for a wget command and then renames the file so we can have all the leagues for all the years in the same folder.

#Downloading a bulk load of Football CSV files from
#Example URL is - This is the URL for EPL Season 1516
import os
import time

DirForFiles = "/home/pi/Documents/FootballProject/"

#Next part is a set of digits that represent the year.  Do these as a list
YearList = ['9394','9495','9596','9697','9798','9899','9900','0001','0102','0203','0304','0405','0506','0607','0708','0809','0910','1011','1112','1213','1314','

#Then the values that are used to represent leagues.  Another List
LeagueList = ['E0','E1','E2','E3','EC','SC0','SC1','SC2','SC3','D1','D2','I1','I2','SP1','SP2','F1','F2','N1','B1','P1','T1','G1']

#Iterate through the years
for TheYear in YearList:
  #Now iterate through the leagues, forming the command to get
  for TheLeague in LeagueList:
    GetCommand = "sudo wget " + BaseURL + TheYear + "/" + TheLeague + ".csv"
    #Also form the name the file will take when downloaded
    FileWhenDownloaded = DirForFiles + TheLeague + ".csv"

    #And the file name to rename to
    RenameFileTo = DirForFiles + TheLeague + "_" + TheYear + ".csv"
    #Run the wget command#

    #Rename the file
    RenameCommand = "sudo mv " + FileWhenDownloaded + " " + RenameFileTo

    #print (GetCommand + "|" + FileWhenDownloaded + "|" + RenameFileTo)

A quick ls command shows all the lovely files ready for analysing! 

Step 2 - Loading into MongoDB 
In my last post I covered the basics of loading data into MongoDB using Python. Now it was time to load data downloaded from the football-data site.  I decided to just load one league for one season to have an initial play with the data.

It seems that there is different fields for different leagues for different years.  Field names are common, it's just that they're absent or present from file to file.  The full list of field names is here.

I decided to model the data as one simple JSON document per match with each record as a key value pair.  So for example a simplified document for the first match of the Belgian league file shown above would be:

  "AwayTeam":"St Truiden"

There may well be more elegant ways of modelling the data.  As I explore and learn more I'll work this out!

To prepare MongoDB I created a database called "Footie" and a collection called "Results" using these commands in the Mongo utility:

> use Footie
> db.createCollection("Results")

I wrote the script below to load data for one season and one league (English League 2, 2015/16).

In the script I:

  • Create an object to access MongoDB
  • Open a CSV file to read it
  • Read the first line, remove trailing non-printing characters, split it and use this to form a list (HeaderList) that will make up the key element of the JSON document.
  • Then for each subsequent line, read it, split it into another list (LineList)
  • I then iterate through each list, forming key value pairs and writing them to a Python dictionary, (which is required to write a JSON document to MongoDB).
  • I then write the document to MongoDB!
#Create JSON from football results and write to MongoDB
from pymongo import MongoClient
import sys

DirPath = "/home/pi/Documents/FootballProject/"

#Connect to the Footie database
client = MongoClient()
db = client.Footie
#Get a collection
collection = db.Results

#Open the file to process
MyFile = open(DirPath + "E3_1516.csv",'r')

#Read the first lines which is the header line, remove the \r\n at the end and turn it into a list
LineOne = MyFile.readline()
LineOne = LineOne[:-2]
HeaderList = LineOne.split(',')

#Now loop through the file reading lines, creating JSONs and writing them
FileLine = MyFile.readline()
while len(FileLine) > 0:
  #Get rid of last two characters and put in a list
  FileLine = FileLine[:-2]
  LineList = FileLine.split(',')
  #Form a JSON from these lists; needs to be a Python Dictionary
  JSONDict = dict()

  #Loop through both lists and add to the JSON
  for i in range(len(HeaderList)):
    #Interestingly field names in MongoDB can't contain a "." so turn it into a European style ","
    HeaderList[i] = HeaderList[i].replace(".",",")
    JSONDict[HeaderList[i]] = LineList[i]
  #Write the document to the collection
  print JSONDict

  #Setup for next loop
  FileLine = MyFile.readline()

print ("Finished writing JSONs")

#Close the file

The net result from the Mongo tool being (abridged):

> db.Results.find()
{ "_id" : ObjectId("56ccc42d74fece04d3d5a323"), "BbAHh" : "0.25", "HY" : "1", "BbAH" : "25", "BbMx<2,5" : "1.7", "HTHG" : "0", "HR" : "0", "HS" : "12", "VCA" : "2.5", "BbMx>2,5" : "2.28", "BbMxD" : "3.4", "AwayTeam" : "Luton", "BbAvD" : "3.19", "PSD" : "3.34", "BbAvA" : "2.38", "HC" : "3", "HF" : "12", "Bb1X2" : "44", "BbAvH" : "2.96", "WHD" : "3.2", "Referee" : "G Eltringham", "WHH" : "2.9", "WHA" : "2.5", "IWA" : "2.2", "AST" : "4", "BbMxH" : "3.25", "HTAG" : "0", "BbMxAHA" : "2.14", "IWH" : "2.8", "LBA" : "2.4", "BWA" : "2.15", "BWD" : "3.2", "LBD" : "3.25", "HST" : "4", "PSA" : "2.46", "Date" : "08/08/15", "LBH" : "3.1", "BbAvAHA" : "2.06", "BbAvAHH" : "1.77", "IWD" : "3.1", "AC" : "4", "FTR" : "D", "VCD" : "3.4", "AF" : "15", "VCH" : "3", "FTHG" : "1", "BWH" : "3.1", "AS" : "10", "AR" : "0", "BbAv<2,5" : "1.65", "AY" : "0", "BbAv>2,5" : "2.16", "Div" : "E3", "PSH" : "3.08", "B365H" : "3.2", "HomeTeam" : "Accrington", "B365D" : "3.4", "B365A" : "2.4", "BbMxAHH" : "1.82", "HTR" : "D", "BbOU" : "37", "FTAG" : "1", "BbMxA" : "2.5" }
{ "_id" : ObjectId("56ccc4c674fece05830a706c"), "BbAHh" : "0.25", "HY" : "1", "BbAH" : "25", "BbMx<2,5" : "1.7", "HTHG" : "0", "HR" : "0", "HS" : "12", "VCA" : "2.5", "BbMx>2,5" : "2.28", "BbMxD" : "3.4", "AwayTeam" : "Luton", "BbAvD" : "3.19", "PSD" : "3.34", "BbAvA" : "2.38", "HC" : "3", "HF" : "12", "Bb1X2" : "44", "BbAvH" : "2.96", "WHD" : "3.2", "Referee" : "G Eltringham", "WHH" : "2.9", "WHA" : "2.5", "IWA" : "2.2", "AST" : "4", "BbMxH" : "3.25", "HTAG" : "0", "BbMxAHA" : "2.14", "IWH" : "2.8", "LBA" : "2.4", "BWA" : "2.15", "BWD" : "3.2", "LBD" : "3.25", "HST" : "4", "PSA" : "2.46", "Date" : "08/08/15", "LBH" : "3.1", "BbAvAHA" : "2.06", "BbAvAHH" : "1.77", "IWD" : "3.1", "AC" : "4", "FTR" : "D", "VCD" : "3.4", "AF" : "15", "VCH" : "3", "FTHG" : "1", "BWH" : "3.1", "AS" : "10", "AR" : "0", "BbAv<2,5" : "1.65", "AY" : "0", "BbAv>2,5" : "2.16", "Div" : "E3", "PSH" : "3.08", "B365H" : "3.2", "HomeTeam" : "Accrington", "B365D" : "3.4", "B365A" : "2.4", "BbMxAHH" : "1.82", "HTR" : "D", "BbOU" : "37", "FTAG" : "1", "BbMxA" : "2.5" }

Step 3 - Some Analysis in R
Using R I did some initial analysis of the data.  I thought it would be interesting to ask the all-important question "who is the naughtiest team in League 2?".  The data can help as the following fields are present:

HF = Home Team Fouls Committed
AF = Away Team Fouls Committed 
HY = Home Team Yellow Cards
AY = Away Team Yellow Cards
HR = Home Team Red Cards
AR = Away Team Red Cards

Before using I, I practised the query I wanted to run in the MongoDB shell.  The query is:

> use Footie 
switched to db Footie 
> db.Results.find({},{HomeTeam: 1, AwayTeam:1,HF:1,AF:1,HY:1,AY:1,HR:1,AR:1,_id: 0})

Which breaks down as:

> db.Results.find({}, means run a query on the Results collection and provide no filter parameters, i.e. give everything. 


{HomeTeam: 1, AwayTeam:1,HF:1,AF:1,HY:1,AY:1,HR:1,AR:1,_id: 0}) 
 means turn all the fields with a "1" on in the output and turn the _id (which is on by default) off.

...and this yields (abridged):

> db.Results.find({},{HomeTeam: 1, AwayTeam:1,HF:1,AF:1,HY:1,AY:1,HR:1,AR:1,_id: 0})
{ "HY" : "1", "HR" : "0", "AwayTeam" : "Luton", "HF" : "12", "AF" : "15", "AR" : "0", "AY" : "0", "HomeTeam" : "Accrington" }
{ "HY" : "1", "HR" : "0", "AwayTeam" : "Luton", "HF" : "12", "AF" : "15", "AR" : "0", "AY" : "0", "HomeTeam" : "Accrington" }
{ "HY" : "1", "HR" : "0", "AwayTeam" : "Plymouth", "HF" : "9", "AF" : "7", "AR" : "0", "AY" : "0", "HomeTeam" : "AFC Wimbledon" }
{ "HY" : "1", "HR" : "0", "AwayTeam" : "Northampton", "HF" : "11", "AF" : "8", "AR" : "0", "AY" : "1", "HomeTeam" : "Bristol Rvs" }
{ "HY" : "1", "HR" : "0", "AwayTeam" : "Newport County", "HF" : "9", "AF" : "12", "AR" : "0", "AY" : "2", "HomeTeam" : "Cambridge" }

To get the same data into a R data frame I did the following to set things up:

> library(RMongo)
Loading required package: rJava
> mg1 <- mongoDbConnect('Footie')
> print(dbShowCollections(mg1))
[1] "Results"        "system.indexes"

...then this to run the query.  

> query <- dbGetQueryForKeys(mg1, 'Results', "{}","{HomeTeam: 1, AwayTeam:1,HF:1,AF:1,HY:1,AY:1,HR:1,AR:1,_id:0}")
> data1 <- query

(Note the use of the dbGetQueryForKeys method which splits the Mongo shell query shown above into two parts).

Which gives this output (abridged):

> data1
          HomeTeam       AwayTeam HF AF HY AY HR AR X_id
1       Accrington          Luton 12 15  1  0  0  0   NA
2    AFC Wimbledon       Plymouth  9  7  1  0  0  0   NA
3      Bristol Rvs    Northampton 11  8  1  1  0  0   NA
4        Cambridge Newport County  9 12  1  2  0  0   NA
5           Exeter         Yeovil  5 10  1  0  0  0   NA not sure why I get the X_id put I'm sure I can deal with it!

I now need to get this side-by-side data (so home and away team on the same row) into a data frame where there's one row per match per team.

To do this I created one data frame for home teams, one for away teams then merged them.

First for the home team.  Get the home team data (columns 1, 3, 5 and 7) and then rename them to make them consistent when we combine home and away data frames:

> data2 <- data1[,c(1,3,5,7)
> colnames(data2) <- c("Team","Fouls","Yellows","Reds") 
> head(data2)
           Team Fouls Yellows Reds
1    Accrington    12       1    0
2 AFC Wimbledon     9       1    0
3   Bristol Rvs    11       1    0
4     Cambridge     9       1    0
5        Exeter     5       1    0
6    Hartlepool    12       4    0

Get away team data and rename:

> data3 <- data1[,c(2,4,6,8)]
> colnames(data3) <- c("Team","Fouls","Yellows","Reds")
            Team Fouls Yellows Reds
1          Luton    15       0    0
2       Plymouth     7       0    0
3    Northampton     8       1    0
4 Newport County    12       2    0
5         Yeovil    10       0    0
6      Morecambe    14       2    0

Merge the two together using the bind function:

data4 < rbind(data2, data3)

> head(data4)
           Team Fouls Yellows Reds
1    Accrington    12       1    0
2 AFC Wimbledon     9       1    0
3   Bristol Rvs    11       1    0
4     Cambridge     9       1    0
5        Exeter     5       1    0
6    Hartlepool    12       4    0

So as a quick check, in the first data frame we had this as the bottom result:

> data1[371,]
    HomeTeam   AwayTeam HF AF HY AY HR AR X_id
371   Yeovil Portsmouth  7 10  1  0  0  1   NA

Now you can see this is split over two rows of our data frame:
> data4[c(371,742),]
           Team Fouls Yellows Reds
371      Yeovil     7       1    0
3711 Portsmouth    10       0    1

Now aggregate to get the count per team across the season so far.  Here you first list the values you're aggregating, then what to group them by, the the function (sum in this case):

> data_agg <-aggregate(list(Fouls=data4$Fouls,Yellows=data4$Yellows,Reds=data4$Reds), list(Team=data4$Team), sum)

Which gives us (abridged):

> data_agg
             Team Fouls Yellows Reds
1      Accrington   302      57    3
2   AFC Wimbledon   377      40    3
3          Barnet   331      46    3
4     Bristol Rvs   299      39    2

Not much use so time to order the data using the "order" function.  There are many ways to order, I selected to order by fouls first, then yellow cards, then red cars.

> agg_sort <- data_agg[order(-data_agg$Fouls,-data_agg$Yellows,-data_agg$Reds),]
> agg_sort
             Team Fouls Yellows Reds
22        Wycombe   420      48    2
13      Mansfield   417      62    5
2   AFC Wimbledon   377      40    3
8     Dag and Red   373      42    2
21      Stevenage   351      56    3
12          Luton   351      53    3
7    Crawley Town   348      52    4
16    Northampton   347      44    5
11  Leyton Orient   338      45    3
19       Plymouth   332      53    0
3          Barnet   331      46    3
17   Notts County   329      56    3
5       Cambridge   329      38    3
23         Yeovil   317      34    3
24           York   316      50    3
18         Oxford   310      46    2
15 Newport County   306      37    2
1      Accrington   302      57    3
9          Exeter   301      38    0
4     Bristol Rvs   299      39    2
14      Morecambe   294      55    2
6        Carlisle   284      39    3
20     Portsmouth   264      28    5
10     Hartlepool   252      45    2

So there, it's official*, Wycombe Wanderers are the naughtiest team in English League 2.

(*Apologies for fans of Wycombe.  It;s just stats geekery and no reflection on your fine football team!).