Note
This tutorial is intended to be run in an IPython notebook. It is also available as a notebook file here.
Explaining XGBoost predictions on the Titanic dataset¶
This tutorial will show you how to analyze predictions of an XGBoost classifier (regression for XGBoost and most scikit-learn tree ensembles are also supported by eli5). We will use Titanic dataset, which is small and has not too many features, but is still interesting enough.
We are using XGBoost 0.81 and data downloaded from https://www.kaggle.com/c/titanic/data (it is also bundled in the eli5 repo: https://github.com/TeamHG-Memex/eli5/blob/master/notebooks/titanic-train.csv).
1. Training data¶
Let’s start by loading the data:
import csv
import numpy as np
with open('titanic-train.csv', 'rt') as f:
data = list(csv.DictReader(f))
data[:1]
[OrderedDict([('PassengerId', '1'),
('Survived', '0'),
('Pclass', '3'),
('Name', 'Braund, Mr. Owen Harris'),
('Sex', 'male'),
('Age', '22'),
('SibSp', '1'),
('Parch', '0'),
('Ticket', 'A/5 21171'),
('Fare', '7.25'),
('Cabin', ''),
('Embarked', 'S')])]
Variable descriptions:
- Age: Age
- Cabin: Cabin
- Embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- Fare: Passenger Fare
- Name: Name
- Parch: Number of Parents/Children Aboard
- Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- Sex: Sex
- Sibsp: Number of Siblings/Spouses Aboard
- Survived: Survival (0 = No; 1 = Yes)
- Ticket: Ticket Number
Next, shuffle data and separate features from what we are trying to predict: survival.
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
_all_xs = [{k: v for k, v in row.items() if k != 'Survived'} for row in data]
_all_ys = np.array([int(row['Survived']) for row in data])
all_xs, all_ys = shuffle(_all_xs, _all_ys, random_state=0)
train_xs, valid_xs, train_ys, valid_ys = train_test_split(
all_xs, all_ys, test_size=0.25, random_state=0)
print('{} items total, {:.1%} true'.format(len(all_xs), np.mean(all_ys)))
891 items total, 38.4% true
We do just minimal preprocessing: convert obviously contiuous Age and Fare variables to floats, and SibSp, Parch to integers. Missing Age values are removed.
for x in all_xs:
if x['Age']:
x['Age'] = float(x['Age'])
else:
x.pop('Age')
x['Fare'] = float(x['Fare'])
x['SibSp'] = int(x['SibSp'])
x['Parch'] = int(x['Parch'])
2. Simple XGBoost classifier¶
Let’s first build a very simple classifier with xbgoost.XGBClassifier and sklearn.feature_extraction.DictVectorizer, and check its accuracy with 10-fold cross-validation:
from xgboost import XGBClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
clf = XGBClassifier()
vec = DictVectorizer()
pipeline = make_pipeline(vec, clf)
def evaluate(_clf):
scores = cross_val_score(_clf, all_xs, all_ys, scoring='accuracy', cv=10)
print('Accuracy: {:.3f} ± {:.3f}'.format(np.mean(scores), 2 * np.std(scores)))
_clf.fit(train_xs, train_ys) # so that parts of the original pipeline are fitted
evaluate(pipeline)
Accuracy: 0.823 ± 0.071
There is one tricky bit about the code above: one may be templed to just
pass dense=True
to DictVectorizer
: after all, in this case the
matrixes are small. But this is not a great solution, because we will
loose the ability to distinguish features that are missing and features
that have zero value.
3. Explaining weights¶
In order to calculate a prediction, XGBoost sums predictions of all its
trees. The number of trees is controlled by n_estimators
argument
and is 100 by default. Each tree is not a great predictor on it’s own,
but by summing across all trees, XGBoost is able to provide a robust
estimate in many cases. Here is one of the trees:
booster = clf.get_booster()
original_feature_names = booster.feature_names
booster.feature_names = vec.get_feature_names()
print(booster.get_dump()[0])
# recover original feature names
booster.feature_names = original_feature_names
0:[Sex=female<-9.53674316e-07] yes=1,no=2,missing=1
1:[Age<13] yes=3,no=4,missing=4
3:[SibSp<2] yes=7,no=8,missing=7
7:leaf=0.145454556
8:leaf=-0.125
4:[Fare<26.2687492] yes=9,no=10,missing=9
9:leaf=-0.151515156
10:leaf=-0.0727272779
2:[Pclass=3<-9.53674316e-07] yes=5,no=6,missing=5
5:[Fare<12.1750002] yes=11,no=12,missing=12
11:leaf=0.0500000007
12:leaf=0.175193802
6:[Fare<24.8083496] yes=13,no=14,missing=14
13:leaf=0.0365591422
14:leaf=-0.151999995
We see that this tree checks Sex, Age, Pclass, Fare and SibSp
features. leaf
gives the decision of a single tree, and they are
summed over all trees in the ensemble.
Let’s check feature importances with eli5.show_weights()
:
from eli5 import show_weights
show_weights(clf, vec=vec)
Weight | Feature |
---|---|
0.4278 | Sex=female |
0.1949 | Pclass=3 |
0.0665 | Embarked=S |
0.0510 | Pclass=2 |
0.0420 | SibSp |
0.0417 | Cabin= |
0.0385 | Embarked=C |
0.0358 | Ticket=1601 |
0.0331 | Age |
0.0323 | Fare |
0.0220 | Pclass=1 |
0.0143 | Parch |
0 | Name=Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) |
0 | Name=Roebling, Mr. Washington Augustus II |
0 | Name=Rosblom, Mr. Viktor Richard |
0 | Name=Ross, Mr. John Hugo |
0 | Name=Rush, Mr. Alfred George John |
0 | Name=Rouse, Mr. Richard Henry |
0 | Name=Ryerson, Miss. Emily Borie |
0 | Name=Ryerson, Miss. Susan Parker "Suzette" |
… 1972 more … |
There are several different ways to calculate feature importances. By
default, “gain” is used, that is the average gain of the feature when it
is used in trees. Other types are “weight” - the number of times a
feature is used to split the data, and “cover” - the average coverage of
the feature. You can pass it with importance_type
argument.
Now we know that two most important features are Sex=female and Pclass=3, but we still don’t know how XGBoost decides what prediction to make based on their values.
4. Explaining predictions¶
To get a better idea of how our classifier works, let’s examine
individual predictions with eli5.show_prediction()
:
from eli5 import show_prediction
show_prediction(clf, valid_xs[1], vec=vec, show_feature_values=True)
y=1 (probability 0.566, score 0.264) top features
Contribution? | Feature | Value |
---|---|---|
+1.673 | Sex=female | 1.000 |
+0.479 | Embarked=S | Missing |
+0.070 | Fare | 7.879 |
-0.004 | Cabin= | 1.000 |
-0.006 | Parch | 0.000 |
-0.009 | Pclass=2 | Missing |
-0.009 | Ticket=1601 | Missing |
-0.012 | Embarked=C | Missing |
-0.071 | SibSp | 0.000 |
-0.073 | Pclass=1 | Missing |
-0.147 | Age | 19.000 |
-0.528 | <BIAS> | 1.000 |
-1.100 | Pclass=3 | 1.000 |
Weight means how much each feature contributed to the final prediction across all trees. The idea for weight calculation is described in http://blog.datadive.net/interpreting-random-forests/; eli5 provides an independent implementation of this algorithm for XGBoost and most scikit-learn tree ensembles.
Here we see that classifier thinks it’s good to be a female, but bad to
travel third class. Some features have “Missing” as value (we are
passing show_feature_values=True
to view the values): that means
that the feature was missing, so in this case it’s good to not have
embarked in Southampton. This is where our decision to go with sparse
matrices comes handy - we still see that Parch is zero, not missing.
It’s possible to show only features that are present using
feature_filter
argument: it’s a function that accepts feature name
and value, and returns True value for features that should be shown:
no_missing = lambda feature_name, feature_value: not np.isnan(feature_value)
show_prediction(clf, valid_xs[1], vec=vec, show_feature_values=True, feature_filter=no_missing)
y=1 (probability 0.566, score 0.264) top features
Contribution? | Feature | Value |
---|---|---|
+1.673 | Sex=female | 1.000 |
+0.070 | Fare | 7.879 |
-0.004 | Cabin= | 1.000 |
-0.006 | Parch | 0.000 |
-0.071 | SibSp | 0.000 |
-0.147 | Age | 19.000 |
-0.528 | <BIAS> | 1.000 |
-1.100 | Pclass=3 | 1.000 |
5. Adding text features¶
Right now we treat Name field as categorical, like other text features. But in this dataset each name is unique, so XGBoost does not use this feature at all, because it’s such a poor discriminator: it’s absent from the weights table in section 3.
But Name still might contain some useful information. We don’t want to guess how to best pre-process it and what features to extract, so let’s use the most general character ngram vectorizer:
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
vec2 = FeatureUnion([
('Name', CountVectorizer(
analyzer='char_wb',
ngram_range=(3, 4),
preprocessor=lambda x: x['Name'],
max_features=100,
)),
('All', DictVectorizer()),
])
clf2 = XGBClassifier()
pipeline2 = make_pipeline(vec2, clf2)
evaluate(pipeline2)
Accuracy: 0.839 ± 0.081
In this case the pipeline is more complex, we slightly improved our result, but the improvement is not significant. Let’s look at feature importances:
show_weights(clf2, vec=vec2)
Weight | Feature |
---|---|
0.3138 | Name__ Mr. |
0.0821 | All__Pclass=3 |
0.0443 | Name__sso |
0.0294 | All__Sex=female |
0.0212 | Name__lia |
0.0205 | All__Fare |
0.0203 | All__Ticket=1601 |
0.0197 | All__Embarked=S |
0.0187 | Name__ Ma |
0.0177 | All__Cabin= |
0.0172 | Name__ Mar |
0.0168 | Name__s, |
0.0160 | Name__ Mr |
0.0157 | Name__son |
0.0138 | Name__ne |
0.0137 | Name__ber |
0.0136 | All__SibSp |
0.0136 | Name__e, |
0.0134 | All__Pclass=1 |
0.0125 | All__Embarked=C |
… 2072 more … |
We see that now there is a lot of features that come from the Name field (in fact, a classifier based on Name alone gives about 0.79 accuracy). Name features listed in this way are not very informative, they make more sense when we check out predictions. We hide missing features here because there is a lot of missing features in text, but they are not very interesting:
from IPython.display import display
for idx in [4, 5, 7, 37, 81]:
display(show_prediction(clf2, valid_xs[idx], vec=vec2,
show_feature_values=True, feature_filter=no_missing))
y=1 (probability 0.771, score 1.215) top features
Contribution? | Feature | Value |
---|---|---|
+0.995 | Name: Highlighted in text (sum) | |
+0.347 | All__Fare | 17.800 |
+0.236 | All__Sex=female | 1.000 |
+0.109 | All__Age | 18.000 |
-0.029 | All__Cabin= | 1.000 |
-0.069 | All__Parch | 0.000 |
-0.150 | All__Embarked=S | 1.000 |
-0.215 | All__SibSp | 1.000 |
-0.539 | <BIAS> | 1.000 |
-0.932 | All__Pclass=3 | 1.000 |
Name: Arnold-Franchi, Mrs. Josef (Josefine Franchi)
y=0 (probability 0.905, score -2.248) top features
Contribution? | Feature | Value |
---|---|---|
+0.948 | Name: Highlighted in text (sum) | |
+0.539 | <BIAS> | 1.000 |
+0.387 | All__Parch | 0.000 |
+0.221 | All__Age | 45.000 |
+0.071 | All__Cabin= | 1.000 |
+0.037 | All__SibSp | 0.000 |
-0.067 | All__Pclass=1 | 1.000 |
-0.492 | All__Fare | 26.550 |
Name: Romaine, Mr. Charles Hallace ("Mr C Rolmane")
y=0 (probability 0.941, score -2.762) top features
Contribution? | Feature | Value |
---|---|---|
+1.946 | All__SibSp | 8.000 |
+0.942 | All__Fare | 69.550 |
+0.678 | All__Pclass=3 | 1.000 |
+0.539 | <BIAS> | 1.000 |
+0.160 | All__Parch | 2.000 |
+0.074 | All__Embarked=S | 1.000 |
+0.029 | All__Cabin= | 1.000 |
-0.669 | Name: Highlighted in text (sum) |
Name: Sage, Master. Thomas Henry
y=1 (probability 0.679, score 0.750) top features
Contribution? | Feature | Value |
---|---|---|
+0.236 | All__Sex=female | 1.000 |
+0.226 | All__Fare | 7.879 |
+0.141 | Name: Highlighted in text (sum) | |
+0.010 | All__SibSp | 0.000 |
-0.029 | All__Cabin= | 1.000 |
-0.041 | All__Parch | 0.000 |
-0.539 | <BIAS> | 1.000 |
-0.932 | All__Pclass=3 | 1.000 |
Name: Mockler, Miss. Helen Mary "Ellie"
y=1 (probability 0.660, score 0.663) top features
Contribution? | Feature | Value |
---|---|---|
+0.236 | All__Sex=female | 1.000 |
+0.161 | All__Fare | 23.250 |
+0.158 | Name: Highlighted in text (sum) | |
+0.152 | All__Embarked=Q | 1.000 |
+0.010 | All__SibSp | 2.000 |
-0.029 | All__Cabin= | 1.000 |
-0.069 | All__Parch | 0.000 |
-0.539 | <BIAS> | 1.000 |
-0.932 | All__Pclass=3 | 1.000 |
Name: McCoy, Miss. Agnes
Text features from the Name field are highlighted directly in text, and the sum of weights is shown in the weights table as “Name: Highlighted in text (sum)”.
Looks like name classifier tried to infer both gender and status from the title: “Mr.” is bad because women are saved first, and it’s better to be “Mrs.” (married) than “Miss.”. Also name classifier is trying to pick some parts of names and surnames, especially endings, perhaps as a proxy for social status. It’s especially bad to be “Mary” if you are from the third class.