Python: Making scikit-learn and pandas play nice
In the last post I wrote about Nathan and my http://www.markhneedham.com/blog/2013/10/30/kaggle-titanic-python-pandas-attempt/[attempts at the http://www.kaggle.com/c/titanic-gettingStarted[Kaggle Titanic Problem\] I mentioned that we our next step was to try out http://scikit-learn.org/stable/tutorial/[scikit-learn\] so I thought I should summarise where we’ve got up to.</p>
We needed to write a classification algorithm to work out whether a person onboard the Titanic survived and luckily scikit-learn has http://scikit-learn.org/stable/supervised_learning.html#supervised-learning[extensive documentation on each of the algorithms\].
Unfortunately almost all those examples use http://www.numpy.org/[numpy\] data structures and we’d loaded the data using http://pandas.pydata.org/[pandas\] and didn’t particularly want to switch back!
Luckily it was really easy to get the data into numpy format by calling 'values' on the pandas data structure, something we learnt from http://stackoverflow.com/questions/17682613/how-to-convert-a-pandas-dataframe-subset-of-columns-and-rows-into-a-numpy-array[a reply on Stack Overflow\].
For example if we were to wire up an http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html[ExtraTreesClassifier\] which worked out survival rate based on the 'Fare' and 'Pclass' attributes we could write the following code: ~python import pandas as pd from sklearn.ensemble import ExtraTreesClassifier from sklearn.cross_validation import cross_val_score train_df = pd.read_csv("train.csv") et = ExtraTreesClassifier(n_estimators=100, max_depth=None, min_samples_split=1, random_state=0) columns = ["Fare", "Pclass"\] labels = train_df["Survived"\].values features = train_df[list(columns)\].values et_score = cross_val_score(et, features, labels, n_jobs=-1).mean() print("{0} -> ET: {1})".format(columns, et_score)) ~
To start with with read in the CSV file which looks like this: ~bash $ head -n5 train.csv PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S 2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S ~ </p> Next we create our classifier which "fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.". i.e. a better version of a http://www.markhneedham.com/blog/2012/10/27/kaggle-digit-recognizer-mahout-random-forest-attempt/[random forest\].</p>
On the next line we describe the features we want the classifier to use, then we convert the labels and features into numpy format so we can pass the to the classifier.
Finally we call the http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_score.html function which splits our training data set into training and test components and trains the classifier against the former and checks its accuracy using the latter.
If we run this code we’ll get roughly the following output: ~bash $ python et.py ['Fare', 'Pclass'\] -> ET: 0.687991021324) ~
This is actually a worse accuracy than we’d get by saying that females survived and males didn’t.
We can introduce 'Sex' into the classifier by adding it to the list of columns: ~python columns = ["Fare", "Pclass", "Sex"\] ~
If we re-run the code we’ll get the following error: ~bash $ python et.py An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (514, 0)) ... Traceback (most recent call last): File "et.py", line 14, in
This is a slightly verbose way of telling us that we can’t pass non numeric features to the classifier - in this case 'Sex' has the values 'female' and 'male'. We’ll need to write a function to replace those values with numeric equivalents. ~python train_df["Sex"\] = train_df["Sex"\].apply(lambda sex: 0 if sex == "male" else 1) ~
Now if we re-run the classifier we’ll get a slightly more accurate prediction: ~bash $ python et.py ['Fare', 'Pclass', 'Sex'\] -> ET: 0.813692480359) ~
The next step is use the classifier against the test data set so let’s load the data and run the prediction: ~python test_df = pd.read_csv("test.csv") et.fit(features, labels) et.predict(test_df[columns\].values) ~
Now if we run that: ~bash $ python et.py ['Fare', 'Pclass', 'Sex'\] -> ET: 0.813692480359) Traceback (most recent call last): File "et.py", line 22, in
which is the same problem we had earlier! We need to replace the 'male' and 'female' values in the test set too so we’ll pull out a function to do that now. ~python def replace_non_numeric(df): df["Sex"\] = df["Sex"\].apply(lambda sex: 0 if sex == "male" else 1) return df ~
Now we’ll call that function with our training and test data frames: ~python train_df = replace_non_numeric(pd.read_csv("train.csv")) ... test_df = replace_non_numeric(pd.read_csv("test.csv")) ~
If we run the program again: ~bash $ python et.py ['Fare', 'Pclass', 'Sex'\] -> ET: 0.813692480359) Traceback (most recent call last): File "et.py", line 26, in
There are missing values in the test set so we’ll replace those with average values from our training set using an http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html[Imputer\]: ~python from sklearn.preprocessing import Imputer imp = Imputer(missing_values='NaN', strategy='mean', axis=0) imp.fit(features) test_df = replace_non_numeric(pd.read_csv("test.csv")) et.fit(features, labels) print et.predict(imp.transform(test_df[columns\].values)) ~
If we run that it completes successfully: ~python $ python et.py ['Fare', 'Pclass', 'Sex'\] -> ET: 0.813692480359) [0 1 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 1 1 1 1 0 0 1 0 0 0\] ~
The final step is to add these values to our test data frame and then write that to a file so we can submit it to Kaggle.
The type of those values is 'numpy.ndarray' which we can convert to a pandas Series quite easily: ~python predictions = et.predict(imp.transform(test_df[columns\].values)) test_df["Survived"\] = pd.Series(predictions) ~
We can then write the 'PassengerId' and 'Survived' columns to a file: ~python test_df.to_csv("foo.csv", cols=['PassengerId', 'Survived'\], index=False) ~
Then output file looks like this: ~bash $ head -n5 foo.csv PassengerId,Survived 892,0 893,1 894,0 ~
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.