2 Jul 2014

R/plyr: ddply - Renaming the grouping/generated column when grouping by date

On Nicole’s recommendation I’ve been having a look at R’s plyr package to see if I could simplify my meetup analysis and I started by translating my code that grouped meetup join dates by day of the week.

To refresh, the code without plyr looked like this:

R library(Rneo4j)
timestampToDate <- function(x) as.POSIXct(x / 1000, origin="1970-01-01")

query = "MATCH (:Person)-[:HAS_MEETUP_PROFILE]->()-[:HAS_MEMBERSHIP]->(membership)-[:OF_GROUP]->(g:Group {name: \"Neo4j - London User Group\"})
         RETURN membership.joined AS joinDate"
meetupMembers = cypher(graph, query)
meetupMembers$joined <- timestampToDate(meetupMembers$joinDate)

dd = aggregate(meetupMembers$joined, by=list(format(meetupMembers$joined, "%A")), function(x) length(x))
colnames(dd) = c("dayOfWeek", "count")

which returns the following:

R > dd
  dayOfWeek count
1    Friday   135
2    Monday   287
3  Saturday    80
4    Sunday   102
5  Thursday   187
6   Tuesday   286
7 Wednesday   211

We need to use plyr’s ddply function which takes a data frame and transforms it into another one.

To refresh, this is what the initial data frame looks like:

R > meetupMembers[1:10,]
       joinDate              joined
1  1.376572e+12 2013-08-15 14:13:40
2  1.379491e+12 2013-09-18 08:55:11
3  1.349454e+12 2012-10-05 17:28:04
4  1.383127e+12 2013-10-30 09:59:03
5  1.372239e+12 2013-06-26 10:27:40
6  1.330295e+12 2012-02-26 22:27:00
7  1.379676e+12 2013-09-20 12:22:39
8  1.398462e+12 2014-04-25 22:41:19
9  1.331734e+12 2012-03-14 14:11:43
10 1.396874e+12 2014-04-07 13:32:26

Most of the examples of using ddply show how to group by a specific 'column' e.g. joined but I want to group by part of the value in that column and eventually came across an example which showed how to do it:

R > ddply(meetupMembers, .(format(joined, "%A")), function(x) {
    count <- length(x$joined)
    data.frame(count = count)
  })
  format(joined, "%A") count
1               Friday   135
2               Monday   287
3             Saturday    80
4               Sunday   102
5             Thursday   187
6              Tuesday   286
7            Wednesday   211

Unfortunately the generated column heading for the group by key isn’t very readable and it took me way longer than it should have to work out how to name it as I wanted! This is how you do it:

R > ddply(meetupMembers, .(dayOfWeek=format(joined, "%A")), function(x) {
    count <- length(x$joined)
    data.frame(count = count)
  })
  dayOfWeek count
1    Friday   135
2    Monday   287
3  Saturday    80
4    Sunday   102
5  Thursday   187
6   Tuesday   286
7 Wednesday   211

If we want to sort that in descending order by 'count' we can wrap that ddply in another one:

R > ddply(ddply(meetupMembers, .(dayOfWeek=format(joined, "%A")), function(x) {
    count <- length(x$joined)
    data.frame(count = count)
  }), .(count = count* -1))
  dayOfWeek count
1    Monday   287
2   Tuesday   286
3 Wednesday   211
4  Thursday   187
5    Friday   135
6    Sunday   102
7  Saturday    80

From reading a bit about ddply I gather that its slower than using some other approaches e.g. data.table but I’m not dealing with much data so it’s not an issue yet.

Once I got the hang of how it worked ddply was quite nice to work with so I think I’ll have a go at translating some of my other code to use it now.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.