Tuesday, 15 March 2016

Strava API Lap Analysis Using Raspberry Pi, Python and R

I'm training for a Half Marathon at the moment and, without meaning to sound too full of myself, I think I'm getting fitter.  This seems to be born out by my resting heart rating as measured by my Fitbit Charge HR which, after my previous analysis, continues to get lower:

When out for a long run on Saturday it struck me that, for the same perceived effort, it feels like I'm getting faster in terms of how long each kilometer takes me to run.  As Greg Lemond once said "it doesn't get any easier, you just go faster".  Hence, when running, I formed a plan to look at the pace stats from my ~2 years worth of Garmin gathered Strava data to see how my pace is changing.

For a previous post I described how to get Strava activity data from the Strava API.  After registering for a key, a HTTP GET to an example URL such as:


...returns a bunch of JSON documents, each of which describes a Strava activity and each of which has a unique ID.  Then, as described in this post, you can get "lap" data for a particular activity with a HTTP GET to a URL like this:


So what is a "lap"?  In  it's simplest form, you get a lap logged every time you press "Lap" on your stopwatch.  So for an old skool runner, every time you pass a km or mile marker in a race you pressed lap and looked at your watch to see if you were running at your target pace.

These days a modern smartwatch will log every lap for post-analysis and can also be set up to auto-lap on time or distance.  For the vast majority of my runs I have my watch configured to auto-lap every km so I have a large set of data ready-available to me!

As all good data is, there is also some messiness in it; specifically for some runs where I've chosen to manually log laps, have had the lap function turned off (so the whole run is a single lap) or have a small sub-km distance at the end of the run that is logged as a lap.

So to analyse the data.  I chose to write a Python script on my Raspberry Pi 2 that would:
  • Extract activity data from the Strava API.  It has a limit of 200 activities per page so I had so request multiple pages.
  • Then for each activity, if it was a run, extract lap data from the Strava API.
  • Then log all the lap data, taking into account any anomalies (specifically missing heart rate data), into a file for further analysis.
Here's all the code.  The comments should describe what's going on:

import urllib2
import json

#The base URL we use for activities
EndURLLaps = "/laps?access_token=<YourKey>"
LapLogFile = "/home/pi/Strava/lap_log_1.txt"

#Open the file to use
MyFile = open(LapLogFile,'w')

#Loop extracting data.  Remember it comes in pages
EndFound = False
LoopVar = 1

#Main loop - Getting all activities
while (EndFound == False):
  #Do a HTTP Get - First form the full URL
  ActivityURL = BaseURLActivities + str(LoopVar)
  StravaJSONData = urllib2.urlopen(ActivityURL).read()
  if StravaJSONData != "[]":   #This checks whether we got an empty JSON response and so should end
    #Now we process the JSON
    ActivityJSON = json.loads(StravaJSONData)

    #Loop through the JSON structure
    for JSONActivityDoc in ActivityJSON:
      #Start forming the string that we'll use for output
      OutStringStem = str(JSONActivityDoc["start_date"]) + "|" + str(JSONActivityDoc["type"]) + "|" + str(JSONActivityDoc["name"]) + "|" + str(JSONActivityDoc["id"]) + "|"
      #See if it was a run.  If so we're interested!!
      if (str(JSONActivityDoc["type"]) == "Run"):
        #Now form a URL and get the laps for this activity and get the JSON data
        LapURL = StartURLLaps + str(JSONActivityDoc["id"]) + EndURLLaps
        LapJSONData = urllib2.urlopen(LapURL).read()

        #Load the JSON to process it
        LapsJSON = json.loads(LapJSONData)

        #Loop through the lap, checking and logging data
        for MyLap in LapsJSON:
          OutString = OutStringStem + str(MyLap["lap_index"]) + "|" + str(MyLap["start_date_local"]) + "|" + str(MyLap["elapsed_time"]) + "|" 
          OutString = OutString + str(MyLap["moving_time"]) + "|" + str(MyLap["distance"]) + "|" + str(MyLap["total_elevation_gain"]) + "|"
          #Be careful with heart rate data, might not be  there if I didn't wear a strap!!!
          if "average_heartrate" not in MyLap:
            OutString = OutString + "-1|-1\n"
            OutString = OutString + str(MyLap["average_heartrate"]) + "|" + str(MyLap["max_heartrate"]) + "\n"
          #Print to screen and write to file
          print OutString
    #Set up for next loop
    LoopVar += 1
    EndFound = True

#Close the log file

So this created a log file that looked like this:

pi@raspberrypi:~/Strava $ tail lap_log_1.txt
2014-06-30T05:39:36Z|Run|Copenhagen Canter|160234567|8|2014-06-30T08:18:12Z|283|278|1000.0|6.3|-1|-1
2014-06-30T05:39:36Z|Run|Copenhagen Canter|160234567|9|2014-06-30T08:22:52Z|272|271|1000.0|16.2|-1|-1
2014-06-30T05:39:36Z|Run|Copenhagen Canter|160234567|10|2014-06-30T08:27:29Z|295|280|1000.0|18.1|-1|-1
2014-06-30T05:39:36Z|Run|Copenhagen Canter|160234567|11|2014-06-30T08:34:27Z|58|54|195.82|0.0|-1|-1
2014-06-26T11:16:34Z|Run|Smelsmore Loop|158234567|1|2014-06-26T12:16:34Z|2561|2561|8699.8|80.0|-1|-1
2014-06-20T11:09:00Z|Run|Smelsmore Loop|155234567|1|2014-06-20T12:09:00Z|2529|2484|8015.3|80.1|-1|-1
2014-06-16T16:23:19Z|Run|HQ to VW.  Strava was naughty and only caught part of it|154234567|1|2014-06-16T17:23:19Z|640|640|2169.9|39.2|-1|-1
2014-06-10T11:13:31Z|Run|Sunny squelchy Smelsmore|151234567|1|2014-06-10T12:13:31Z|2439|2429|8235.2|83.4|-1|-1
2014-06-03T10:57:58Z|Run|Lost in Donnington|148234567|1|2014-06-03T11:57:58Z|1933|1874|6266.7|86.0|-1|-1
2014-05-24T07:43:52Z|Run|Calf rehab run|144234567|1|2014-05-24T08:43:52Z|2992|2964|9977.4|170.7|-1|-1

Time to analyse the data in R!

First import the data into a data frame:
> StravaLaps1 <- read.csv(file="/home/pi/Strava/lap_log_1.txt",head=FALSE,sep="|")

Add some meaningful column names:
> colnames(StravaLaps1) <- c("ActvityStartDate","Type","Name","ActivityID","LapIndex","LapStartDate","ElapsedTime","MovingTime","Distance","ElevationGain","AveHeart","MaxHeart")

Turn the distance and time values to numbers so we can do some maths on them:
> StravaLaps1$ElapsedTimeNum = as.numeric(StravaLaps1$ElapsedTime)
> StravaLaps1$DistanceNum = as.numeric(StravaLaps1$Distance)

Now calculate the per km pace.  For the laps which were derived from the "auto-lap at 1 km" settings this just means we're dividing the elapsed time for the lap by 1.  Otherwise it scales up (for <1km laps) or down (for >1km laps) as required.
> StravaLaps1$PerKmLapTime <- StravaLaps1$ElapsedTimeNum / (StravaLaps1$DistanceNum / 1000)

 The data comes off the Strava API in reverse chronological order.  Hence to make sure it can be ordered for graphing I need to create a Posix time column, i.e. a column that's interpreted as a date and time, not just text.  To do this I first re-format the date and time using strptime, then turn into Posix.

> StravaLaps1$LapStartDateSimple <- strptime(StravaLaps1$LapStartDate, '%Y-%m-%dT%H:%M:%SZ')
> StravaLaps1$LapStartDatePosix <- as.POSIXlt(StravaLaps1$LapStartDateSimple)

...which gives us data like this:

> head(StravaLaps1[,c(13,14,15,17)])
  MovingTimeNum DistanceNum PerKmLapTime   LapStartDatePosix
1           269        1000          268 2016-03-12 08:55:11
2           263        1000          266 2016-03-12 08:59:44
3           264        1000          267 2016-03-12 09:04:10
4           258        1000          259 2016-03-12 09:08:37
5           271        1000          272 2016-03-12 09:12:56
6           252        1000          255 2016-03-12 09:17:30

Now to draw a lovely graph using ggplot2:
> qplot(LapStartDatePosix,PerKmLapTime,data=StravaLaps1,geom=c("point","smooth"),ylim=c(200,600),xlab="Date",ylab="KM Pace(s)",main="KM Pace from Strava")

Which gives this:

Now that is an interesting graph!  Each "vertical line" represents a single run with each point being a lap for that run.  A lot of the recent points are between 250 seconds (so 4m10s per km) and 300s (so 5m per km) which is about right.

On the graph you can also see a nice even spread of runs from spring 2014 to early summer 2015.  There was then a gap when I was injured until Sep 2015 when I returned from injury and then Dec 2015 when I started training in earnest.

The regression line is interesting, reaching it's min point by Autumn 2015 (when I started doing short, fast 5km runs at ~4m10s per km) and then starting to increase again as my distance increased (to ~4m30s per km).

So it was interesting to just look at the most recent data. To find the start point I scanned back in the data to the point I started running again after my injury.  This was derived by doing the following command to just extract the first rows of the data frame into a new data frame:
>StravaLaps2 <- StravaLaps1[c(1:423),]

> tail(StravaLaps2[,c(1,3)])
        ActvityStartDate          Name
418 2015-11-10T07:54:51Z   Morning Run
419 2015-11-10T07:54:51Z   Morning Run
420 2015-11-10T07:54:51Z   Morning Run
421 2015-11-05T07:51:20Z Cheeky HQ Run
422 2015-11-05T07:51:20Z Cheeky HQ Run
423 2015-11-05T07:51:20Z Cheeky HQ Run

Where "Cheeky HQ Run" was a short tentative run I did as the first of my "comeback".  A plot using this data and a regression line is shown below:

> qplot(LapStartDatePosix,PerKmLapTime,data=StravaLaps2,geom=c("point","smooth"),ylim=c(200,600),xlab="Date",ylab="KM Pace(s)",main="KM Pace from Strava - Recent")

Now I REALLY like this graph.  Especially as the regression line shows I am getting faster which was the answer I wanted!  However with a bit less data you can see each run in more detail (each vertical line) and an interesting pattern emerges.

Best to look at this by delving into the data even more and just taking Feb and March data:

> StravaLaps3 <- StravaLaps2[c(1:201),]

> qplot(LapStartDatePosix,PerKmLapTime,data=StravaLaps3,geom=c("point","smooth"),ylim=c(200,600),xlab="Date",ylab="KM Pace(s)",main="KM Pace from Strava - Feb/Mar 2016")

Taking the run (vertical set of points) on the far right and moving left we see:

  • Long 21k run at a consistent pace so lots of points clustered together.
  • Shorter hillier run so less points and similar pace.
  • Intervals session so some very fast laps (sub 4 min km) and some slow jogging
  • Long 18k run at a consistent pace but not so nicely packed together as the 21k run

...and so on back in time with each type of run (long, short and intervals) having it's own telltale "finger print".  For example the second run from the right is a 5k fast (for me) Parkrun so has a small number of laps at a pretty good (for me) pace.

Overall I really like this data and what Strava, Raspberry Pi, Python and R lets me do with it.  First of all it tells me I'm getting faster which is always good.  Second it has an interesting pattern and each type of run is easily distinguishable which is nice.  Finally it's MY data; I'm playing with and learning about this stuff with my own data which is somehow more fun than using pre-prepared sample data.