R: dplyr - Select 'random' rows from a data frame
Frequently I find myself wanting to take a sample of the rows in a data frame where just taking the head isn’t enough.
Let’s say we start with the following data frame:
data = data.frame(
letter = sample(LETTERS, 50000, replace = TRUE),
number = sample (1:10, 50000, replace = TRUE)
)
And we’d like to sample 10 rows to see what it contains. We’ll start by generating 10 random numbers to represent row numbers using the runif function:
> randomRows = sample(1:length(data[,1]), 10, replace=T)
> randomRows
[1] 8723 18772 4964 36134 27467 31890 16313 12841 49214 15621
We can then pass that list of row numbers into dplyr’s http://rpackages.ianhowson.com/cran/dplyr/man/slice.html function like so:
> data %>% slice(randomRows)
letter number
1 Z 4
2 F 1
3 Y 6
4 R 6
5 Y 4
6 V 10
7 R 6
8 D 6
9 J 7
10 E 2
If we’re using that code throughout our code then we might want to pull out a function like so:
pickRandomRows = function(df, numberOfRows = 10) {
df %>% slice(runif(numberOfRows,0, length(df[,1])))
}
And then call it like so:
> data %>% pickRandomRows()
letter number
1 W 5
2 Y 3
3 E 6
4 Q 8
5 M 9
6 H 9
7 E 10
8 T 2
9 I 5
10 V 4
> data %>% pickRandomRows(7)
letter number
1 V 7
2 N 4
3 W 1
4 N 8
5 G 7
6 V 1
7 N 7
Update
Antonios pointed out via email that we could just make use of the in-built sample_n function which I didn’t know about until now:
> data %>% sample_n(10)
letter number
29771 U 1
48666 T 10
30635 A 1
34865 X 7
20140 A 3
41715 T 10
43786 E 10
18284 A 7
21406 S 8
35542 J 8
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.