scikit-learn: Building a multi class classification ensemble
For the Kaggle Spooky Author Identification I wanted to combine multiple classifiers together into an ensemble and found the VotingClassifier that does exactly that.
We need to predict the probability that a sentence is written by one of three authors so the VotingClassifier needs to make a 'soft' prediction. If we only needed to know the most likely author we could have it make a 'hard' prediction instead.
We start with three classifiers which generate different n-gram based features. The code for those is as follows:
from sklearn import linear_model
from sklearn.ensemble import VotingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
ngram_pipe = Pipeline([
('cv', CountVectorizer(ngram_range=(1, 2))),
('mnb', MultinomialNB())
])
unigram_log_pipe = Pipeline([
('cv', CountVectorizer()),
('logreg', linear_model.LogisticRegression())
])
We can combine those classifiers together like this:
classifiers = [
("ngram", ngram_pipe),
("unigram", unigram_log_pipe),
]
mixed_pipe = Pipeline([
("voting", VotingClassifier(classifiers, voting="soft"))
])
Now it’s time to test our ensemble. I got the code for the test function from Sohier Dane's tutorial.
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics
Y_COLUMN = "author"
TEXT_COLUMN = "text"
def test_pipeline(df, nlp_pipeline):
y = df[Y_COLUMN].copy()
X = pd.Series(df[TEXT_COLUMN])
rskf = StratifiedKFold(n_splits=5, random_state=1)
losses = []
accuracies = []
for train_index, test_index in rskf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
nlp_pipeline.fit(X_train, y_train)
losses.append(metrics.log_loss(y_test, nlp_pipeline.predict_proba(X_test)))
accuracies.append(metrics.accuracy_score(y_test, nlp_pipeline.predict(X_test)))
print("{kfolds log losses: {0}, mean log loss: {1}, mean accuracy: {2}".format(
str([str(round(x, 3)) for x in sorted(losses)]),
round(np.mean(losses), 3),
round(np.mean(accuracies), 3)
))
train_df = pd.read_csv("train.csv", usecols=[Y_COLUMN, TEXT_COLUMN])
test_pipeline(train_df, mixed_pipe)
Let’s run the script:
kfolds log losses: ['0.388', '0.391', '0.392', '0.397', '0.398'], mean log loss: 0.393 mean accuracy: 0.849
Looks good.
I’ve actually got several other classifiers as well but I’m not sure which ones should be part of the ensemble. In a future post we’ll look at how to use GridSearch to work that out.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.