R: dplyr - segfault cause 'memory not mapped'
In my continued playing around with web logs in R I wanted to process the logs for a day and see what the most popular URIs were.
I first read in all the lines using the read_lines function in readr and put the vector it produced into a data frame so I could process it using dplyr.
library(readr)
dlines = data.frame(column = read_lines("~/projects/logs/2015-06-18-22-docs"))
In the previous post I showed some code to extract the URI from a log line. I extracted this code out into a function and adapted it so that I could pass in a list of values instead of a single value:
extract_uri = function(log) {
parts = str_extract_all(log, "\"[^\"]*\"")
return(lapply(parts, function(p) str_match(p[1], "GET (.*) HTTP")[2] %>% as.character))
}
Next I ran the following function to count the number of times each URI appeared in the logs:
library(dplyr)
pages_viewed = dlines %>%
mutate(uri = extract_uri(column)) %>%
count(uri) %>%
arrange(desc(n))
This crashed my R process with the following error message:
segfault cause 'memory not mapped'
I narrowed it down to a problem when doing a group by operation on the 'uri' field and came across this post which suggested that it was handled more cleanly in more recently version of dplyr.
I upgraded to 0.4.2 and tried again:
## Error in eval(expr, envir, enclos): cannot group column uri, of class 'list'
That makes more sense. We’re probably returning a list from extract_uri rather than a vector which would fit nicely back into the data frame. That’s fixed easily enough by unlisting the result:
extract_uri = function(log) {
parts = str_extract_all(log, "\"[^\"]*\"")
return(unlist(lapply(parts, function(p) str_match(p[1], "GET (.*) HTTP")[2] %>% as.character)))
}
And now when we run the count function it’s happy again, good times!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.