R: data.table - Finding the maximum row
In my continued playing around with the R data.table package I wanted to find the maximum row based on one of the columns, grouped by another column, and then return back the whole row.
We’ll use the following data table to illustrate:
> blogDT = data.table(name = c("Property 1","Property 1","Property 1","Property 2","Property 2","Property 2"),
price = c(10000, 12500, 18000, 245000, 512000, 1000000),
date = c("Day 1", "Day 7", "Day 10", "Day 3", "Day 5", "Day 12"))
> blogDT[, lag.price := c(NA, price[-.N]), by = name]
> blogDT[, diff := price - lag.price]
> blogDT
name price date lag.price diff
1: Property 1 10000 Day 1 NA NA
2: Property 1 12500 Day 7 10000 2500
3: Property 1 18000 Day 10 12500 5500
4: Property 2 245000 Day 3 NA NA
5: Property 2 512000 Day 5 245000 267000
6: Property 2 1000000 Day 12 512000 488000
I wanted to find the biggest difference in 'price' and 'lag.price' grouped by the 'name' column.
If we just want to get the max 'diff' grouped by 'name' it’s quite easy:
> blogDT[!is.na(diff), .(max = max(diff)), keyby = name]
name max
1: Property 1 5500
2: Property 2 488000
However now we’ve lost the information about which 'date' that was on and what the 'price' and 'lag.price' were which is a bit annoying.
If we only want to keep the rows which have the highest 'diff' grouped by 'name', one way to go about this is to add a 'max.diff' column to each row and then filter appropriately. e.g.
> maxDT = blogDT[!is.na(diff)][, max := max(diff), by = name]
> maxDT
name price date lag.price diff max
1: Property 1 12500 Day 7 10000 2500 5500
2: Property 1 18000 Day 10 12500 5500 5500
3: Property 2 512000 Day 5 245000 267000 488000
4: Property 2 1000000 Day 12 512000 488000 488000
> maxDT[diff == max]
name price date lag.price diff max
1: Property 1 18000 Day 10 12500 5500 5500
2: Property 2 1000000 Day 12 512000 488000 488000
I’ve only been playing with data.table for a few days so if there’s a better way do let me know in the comments.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.