13 Feb 2012

Reading Code: boilerpipe

I’m a big fan of the iPad application Flipboard, especially it’s ability to filter out the non important content on web pages and just show me the main content so I’ve been looking around at open source libraries which provide that facility.

I came across a quora page where someone had asked how this was done and the suggested libraries were readability, Goose and boilerpipe.

boilerpipe was written by Christian Kohlschütter and has a corresponding paper and video as well.

At a very high level this is my understanding of what the code is doing:

It is based around a pipes/filters architectural style whereby a TextDocument is passed through filters which perform transformations on it. After they’ve all been applied we can retrieve the main content of the article via a method call.

I’ve used the pipes/filters approach when playing around with clojure/F# but the problems I was working on were much smaller than this.

In the code there around about 7 or 8 fields being manipulated so I did sometimes find it difficult to know how fields could end up with certain values which often involved looking at other filters and seeing what they did to the document.

I always thought it should be possible to view each filter completely independently but when there’s state manipulation involved that doesn’t seem to be the case.

Luckily Christian has comments in his code which explain how you might compose the different filters again and why certain filters don’t make sense on their own, only if they’re combined with others.

For example the BlockProximityFusion class, which is used to merge together adjacent text blocks, contains the following comment:

Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit. This probably makes sense only in cases where an upstream filter already has removed some blocks.

I suppose the same thing could also have been achieved with some automated tests showing scenarios where different filters are composed.

Christian makes use of the logical OR ("

") operator throughout the code base to ensure that all the filters get executed even if a previous one has successfully made changes to the document.

For example the main entry point into the code is ArticleExtractor which reads like this:

public final class ArticleExtractor extends ExtractorBase {
    public static final ArticleExtractor INSTANCE = new ArticleExtractor();

    public static ArticleExtractor getInstance() {
        return INSTANCE;
    }

    public boolean process(TextDocument doc) throws BoilerpipeProcessingException {
        return TerminatingBlocksFinder.INSTANCE.process(doc)
                | new DocumentTitleMatchClassifier(doc.getTitle()).process(doc)
                | NumWordsRulesClassifier.INSTANCE.process(doc)
                // cut for brevity
                | ExpandTitleToContentFilter.INSTANCE.process(doc);
    }
}

I noticed a similar thing in the underscore.js code but in that case the '&&' operator was used to execute code on the right hand side only if the expression on the left had been successful.

If we’re not using any libraries that simulate first class collections in Java (totallylazy/http://code.google.com/p/guava-libraries/[Guava] for example) then something like this could also work:

public final class ArticleExtractor extends ExtractorBase {
    ...
    public boolean process(TextDocument doc) throws BoilerpipeProcessingException {
        List<BoilerpipeFilter> filters = asList(TerminatingBlocksFinder.INSTANCE, new DocumentTitleMatchClassifier(doc.getTitle()), ExpandTitleToContentFilter.INSTANCE);
        boolean result = true;

        for (BoilerpipeFilter filter : filters) {
            result = result | filter.process(doc);
        }

        return result;
    }
}

I originally started just browsing the code and thought I roughly understood it before realising I couldn’t explain what it actually did. I therefore changed my approach and started writing some unit tests around it to see what the current behaviour was.

From what I can tell the main algorithm in the code is contained inside NumWordsRulesClassifier where each text block in the document is classified as being either content or non content.

I wrote tests covering all the scenarios in this class and then refactored the code to see if I could make it a bit more expressive. I ended up with this:

private boolean currentBlockHasContent(final TextBlock prev, final TextBlock curr, final TextBlock next) {
    if (fewLinksInCurrentBlock(curr)) {
        if (fewLinksInPreviousBlock(prev)) {
            return curr.getNumWords() > 16 || next.getNumWords() > 15 || prev.getNumWords() > 4;
        } else {
            return curr.getNumWords() > 40 || next.getNumWords() > 17;
        }
    }
    return false;
}

private boolean fewLinksInCurrentBlock(TextBlock curr) {
    return curr.getLinkDensity() <= 0.333333;
}

private boolean fewLinksInPreviousBlock(TextBlock prev) {
    return prev.getLinkDensity() <= 0.555556;
}

The logic is all based around examining the text blocks immediately before and after the current one to work out whether or not it’s likely to be boiler plate content.

The logic around the next/previous text blocks is written quite imperatively and feels like it could be made more concise by using something like F#'s 'Seq.windowed' over the collection but I can’t quite see how at the moment!

You can read more about the algorithm on pages 4-7 of the paper.

From running the code against a few articles I’ve got saved to ReadItLater it does seem to work reasonably well.

Overall...

I haven’t read every single bit of the code base but from what I have read I think boilerpipe is a pretty cool library and the approach to filtering content is neat.

I found it especially useful to be able to read parts of the paper and then go and look at the corresponding code. Often that type of thing remains up to the imagination of the reader from my experience!

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.