R: Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1, : zero non-NA points
I’ve been following Michy Alice’s logistic regression tutorial to create an attendance model for London dev meetups and ran into an interesting problem while doing so.
Our dataset has a class imbalance i.e. most people RSVP 'no' to events which can lead to misleading accuracy score where predicting 'no' every time would lead to supposed high accuracy.
Source: local data frame [2 x 2]
attended n
(dbl) (int)
1 0 1541
2 1 53
I sampled the data using caret's http://www.inside-r.org/packages/cran/caret/docs/upSample function to avoid this:
attended = as.factor((df %>% dplyr::select(attended))$attended)
upSampledDf = upSample(df %>% dplyr::select(-attended), attended)
upSampledDf$attended = as.numeric(as.character(upSampledDf$Class))
I then trained a logistic regression model but when I tried to plot the area under the curve I ran into trouble:
p <- predict(model, newdata=test, type="response")
pr <- prediction(p, test$attended)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1, :
zero non-NA points
I don’t have any NA values in my data frame so this message was a bit confusing to start with. As usual Stack Overflow came to the rescue with the suggestion that I was probably missing positive/negative values for the independent variable i.e. 'approved'.
A quick count on the test data frame using dplyr confirmed my mistake:
> test %>% count(attended)
Source: local data frame [1 x 2]
attended n
(dbl) (int)
1 1 582
I’ll have to randomly sort the data frame and then reassign my training and test data frames to work around it.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.