Mahout: Parallelising the creation of DecisionTrees
A couple of months ago I wrote a blog post describing our use of Mahout random forests for the Kaggle Digit Recogniser Problem and after seeing how long it took to create forests with 500+ trees I wanted to see if this could be sped up by parallelising the process.
From looking at the https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/classifier/df/DecisionForest.java it seemed like it should be possible to create lots of small forests and then combine them together.
After unsuccessfully trying to achieve this by directly using DecisionForest I decided to just copy all the code from that class into my own version which allowed me to achieve this.
The code to build up the forest ends up looking like this:
List<Node> trees = new ArrayList<Node>();
MultiDecisionForest forest = MultiDecisionForest.load(new Configuration(), new Path("/path/to/mahout-tree"));
trees.addAll(forest.getTrees());
MultiDecisionForest forest = new MultiDecisionForest(trees);
We can then use forest to classify values in a test data set and it seems to work reasonably well.
I wanted to try and avoid putting any threading code in so I made use of GNU parallel which is available on Mac OS X with a brew install parallel and on Ubuntu by adding the following repository to /etc/apt/sources.list…
deb http://ppa.launchpad.net/ieltonf/ppa/ubuntu oneiric main
deb-src http://ppa.launchpad.net/ieltonf/ppa/ubuntu oneiric main
…followed by a apt-get update and apt-get install parallel.
I then wrote a script to parallelise the creation of the forests:
#!/bin/bash
start=`date`
startTime=`date '+%s'`
numberOfRuns=$1
seq 1 ${numberOfRuns} | parallel -P 8 "./build-forest.sh"
end=`date`
endTime=`date '+%s'`
echo "Started: ${start}"
echo "Finished: ${end}"
echo "Took: " $(expr $endTime - $startTime)
#!/bin/bash
java -Xmx1024m -cp target/machinenursery-1.0.0-SNAPSHOT-standalone.jar main.java.MahoutPlaybox
It should be possible to achieve this by using the parallel option in xargs but unfortunately I wasn’t able to achieve the same success with that command.
I hadn’t come across the seq command until today but it works quite well here for allowing us to specify how many times we want to call the script.
I was probably able to achieve about a 30% speed increase when running this on my Air. There was a greater increase running on a high CPU AWS instance although for some reason some of the jobs seemed to get killed and I couldn’t figure out why.
Sadly even with a new classifier with a massive number of trees I didn’t see an improvement over the Weka random forest using AdaBoost which I wrote about a month ago. We had an accuracy of 96.282% here compared to 96.529% with the Weka version.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.