Mahout: Using a saved Random Forest/DecisionTree
One of the things that I wanted to do while playing around with random forests using Mahout was to save the random forest and then use use it again which is something Mahout does cater for.
It was actually much easier to do this than I’d expected and assuming that we already have a https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/classifier/df/DecisionForest.java built we’d just need the following code to save it to disc:
int numberOfTrees = 1;
Data data = loadData(...);
DecisionForest forest = buildForest(numberOfTrees, data);
String path = "saved-trees/" + numberOfTrees + "-trees.txt";
DataOutputStream dos = new DataOutputStream(new FileOutputStream(path));
forest.write(dos);
When I was looking through the API for how to load that file back into memory again it seemed like all the public methods required you to be using Hadoop in some way which I thought was going to be a problem as I’m not using it.
For example the signature for DecisionForest.load reads like this:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
public static DecisionForest load(Configuration conf, Path forestPath) throws IOException { }
As it turns out though you can just pass an empty configuration and a normal file system path and the forest shall be loaded:
int numberOfTrees = 1;
Configuration config = new Configuration();
Path path = new Path("saved-trees/" + numberOfTrees + "-trees.txt");
DecisionForest forest = DecisionForest.load(config, path);
Much easier than expected!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.