Kaggle Digit Recognizer: A feature extraction #fail
I’ve written a few blog posts about our attempts at the Kaggle Digit Recogniser problem and one thing we haven’t yet tried is feature extraction.
Feature extraction in this context means that we’d generate some other features to train a classifier with rather than relying on just the pixel values we were provided.
Every week Jen would try and persuade me that we should try it out but it wasn’t until I was flicking through the notes from the Columbia Data Science class that it struck home:
The Space between the Data Set and the Algorithm Many people go straight from a data set to applying an algorithm. But there’s a huge space in between of important stuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hard part. The hard part is doing it well. One needs to conduct exploratory data analysis as I’ve emphasized; and conduct feature selection as Will Cukierski emphasized.
I’ve highlighted the part of the post which describes exactly what we’ve been doing!
There were some examples of feature extraction on the Kaggle forums so I thought I’d try and create some other features using R.
I created features for the number of non zero pixels, the number of 255 pixels, the average number of pixels and the average of the middle pixels of a number.
The code reads like this:
initial <- read.csv("train.csv", header = TRUE)
initial$nonZeros <- apply(initial, 1, function(entries) length(Filter(function (x) x != 0, entries)))
initial$fullHouses <- apply(initial, 1, function(entries) length(Filter(function (x) x == 255, entries)))
initial$meanPixels <- apply(initial, 1, mean)
initial$middlePixels <- apply(initial[,200:500], 1, mean)
I then wrote those features out into a CSV file like so:
newFeatures <- subset(initial, select=c(label, nonZeros, meanPixels, fullHouses, middlePixels))
write.table(file="feature-extraction.txt", newFeatures, row.names=FALSE, sep=",")
I then created a 100 tree random forest using Mahout to see whether or not we could get any sort of accuracy using these features.
Unfortunately the accuracy on the cross validation set (10% of the training data) was only 24% which is pretty useless so it’s back to the drawing board!
Our next task is to try and work out whether we can derive some features which have a stronger correlation with the label values or combining the new features with the existing pixel values to see if that has any impact.
As you can probably tell I don’t really understand how you should go about extracting features so if anybody has ideas or papers/articles I can read to learn more please let me know in the comments!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.