5 Dec 2017

scikit-learn: Building a multi class classification ensemble

For the Kaggle Spooky Author Identification I wanted to combine multiple classifiers together into an ensemble and found the VotingClassifier that does exactly that.

We need to predict the probability that a sentence is written by one of three authors so the VotingClassifier needs to make a 'soft' prediction. If we only needed to know the most likely author we could have it make a 'hard' prediction instead.

We start with three classifiers which generate different n-gram based features. The code for those is as follows:

from sklearn import linear_model
from sklearn.ensemble import VotingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

ngram_pipe = Pipeline([
    ('cv', CountVectorizer(ngram_range=(1, 2))),
    ('mnb', MultinomialNB())
])

unigram_log_pipe = Pipeline([
    ('cv', CountVectorizer()),
    ('logreg', linear_model.LogisticRegression())
])

We can combine those classifiers together like this:

classifiers = [
    ("ngram", ngram_pipe),
    ("unigram", unigram_log_pipe),
]

mixed_pipe = Pipeline([
    ("voting", VotingClassifier(classifiers, voting="soft"))
])

Now it’s time to test our ensemble. I got the code for the test function from Sohier Dane's tutorial.

import pandas as pd
import numpy as np

from sklearn.model_selection import StratifiedKFold
from sklearn import metrics

Y_COLUMN = "author"
TEXT_COLUMN = "text"


def test_pipeline(df, nlp_pipeline):
    y = df[Y_COLUMN].copy()
    X = pd.Series(df[TEXT_COLUMN])
    rskf = StratifiedKFold(n_splits=5, random_state=1)
    losses = []
    accuracies = []
    for train_index, test_index in rskf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        nlp_pipeline.fit(X_train, y_train)
        losses.append(metrics.log_loss(y_test, nlp_pipeline.predict_proba(X_test)))
        accuracies.append(metrics.accuracy_score(y_test, nlp_pipeline.predict(X_test)))

    print("{kfolds log losses: {0}, mean log loss: {1}, mean accuracy: {2}".format(
        str([str(round(x, 3)) for x in sorted(losses)]),
        round(np.mean(losses), 3),
        round(np.mean(accuracies), 3)
    ))

train_df = pd.read_csv("train.csv", usecols=[Y_COLUMN, TEXT_COLUMN])
test_pipeline(train_df, mixed_pipe)

Let’s run the script:

kfolds log losses: ['0.388', '0.391', '0.392', '0.397', '0.398'], mean log loss: 0.393 mean accuracy: 0.849

Looks good.

I’ve actually got several other classifiers as well but I’m not sure which ones should be part of the ensemble. In a future post we’ll look at how to use GridSearch to work that out.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.