Note

This tutorial is intended to be run in an IPython notebook. It is also available as a notebook file here.

Explaining XGBoost predictions on the Titanic dataset¶

This tutorial will show you how to analyze predictions of an XGBoost classifier (regression for XGBoost and most scikit-learn tree ensembles are also supported by eli5). We will use Titanic dataset, which is small and has not too many features, but is still interesting enough.

We are using XGBoost 0.81 and data downloaded from https://www.kaggle.com/c/titanic/data (it is also bundled in the eli5 repo: https://github.com/TeamHG-Memex/eli5/blob/master/notebooks/titanic-train.csv).

1. Training data¶

Let’s start by loading the data:

import csv
import numpy as np

with open('titanic-train.csv', 'rt') as f:
    data = list(csv.DictReader(f))
data[:1]

[OrderedDict([('PassengerId', '1'),
              ('Survived', '0'),
              ('Pclass', '3'),
              ('Name', 'Braund, Mr. Owen Harris'),
              ('Sex', 'male'),
              ('Age', '22'),
              ('SibSp', '1'),
              ('Parch', '0'),
              ('Ticket', 'A/5 21171'),
              ('Fare', '7.25'),
              ('Cabin', ''),
              ('Embarked', 'S')])]

Variable descriptions:

Age: Age
Cabin: Cabin
Embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
Fare: Passenger Fare
Name: Name
Parch: Number of Parents/Children Aboard
Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Sex: Sex
Sibsp: Number of Siblings/Spouses Aboard
Survived: Survival (0 = No; 1 = Yes)
Ticket: Ticket Number

Next, shuffle data and separate features from what we are trying to predict: survival.

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

_all_xs = [{k: v for k, v in row.items() if k != 'Survived'} for row in data]
_all_ys = np.array([int(row['Survived']) for row in data])

all_xs, all_ys = shuffle(_all_xs, _all_ys, random_state=0)
train_xs, valid_xs, train_ys, valid_ys = train_test_split(
    all_xs, all_ys, test_size=0.25, random_state=0)
print('{} items total, {:.1%} true'.format(len(all_xs), np.mean(all_ys)))

891 items total, 38.4% true

We do just minimal preprocessing: convert obviously contiuous Age and Fare variables to floats, and SibSp, Parch to integers. Missing Age values are removed.

for x in all_xs:
    if x['Age']:
        x['Age'] = float(x['Age'])
    else:
        x.pop('Age')
    x['Fare'] = float(x['Fare'])
    x['SibSp'] = int(x['SibSp'])
    x['Parch'] = int(x['Parch'])

2. Simple XGBoost classifier¶

Let’s first build a very simple classifier with xbgoost.XGBClassifier and sklearn.feature_extraction.DictVectorizer, and check its accuracy with 10-fold cross-validation:

from xgboost import XGBClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

clf = XGBClassifier()
vec = DictVectorizer()
pipeline = make_pipeline(vec, clf)

def evaluate(_clf):
    scores = cross_val_score(_clf, all_xs, all_ys, scoring='accuracy', cv=10)
    print('Accuracy: {:.3f} ± {:.3f}'.format(np.mean(scores), 2 * np.std(scores)))
    _clf.fit(train_xs, train_ys)  # so that parts of the original pipeline are fitted

evaluate(pipeline)

Accuracy: 0.823 ± 0.071

There is one tricky bit about the code above: one may be templed to just pass dense=True to DictVectorizer: after all, in this case the matrixes are small. But this is not a great solution, because we will loose the ability to distinguish features that are missing and features that have zero value.

3. Explaining weights¶

In order to calculate a prediction, XGBoost sums predictions of all its trees. The number of trees is controlled by n_estimators argument and is 100 by default. Each tree is not a great predictor on it’s own, but by summing across all trees, XGBoost is able to provide a robust estimate in many cases. Here is one of the trees:

booster = clf.get_booster()
original_feature_names = booster.feature_names
booster.feature_names = vec.get_feature_names()
print(booster.get_dump()[0])
# recover original feature names
booster.feature_names = original_feature_names

0:[Sex=female<-9.53674316e-07] yes=1,no=2,missing=1
    1:[Age<13] yes=3,no=4,missing=4
            3:[SibSp<2] yes=7,no=8,missing=7
                    7:leaf=0.145454556
                    8:leaf=-0.125
            4:[Fare<26.2687492] yes=9,no=10,missing=9
                    9:leaf=-0.151515156
                    10:leaf=-0.0727272779
    2:[Pclass=3<-9.53674316e-07] yes=5,no=6,missing=5
            5:[Fare<12.1750002] yes=11,no=12,missing=12
                    11:leaf=0.0500000007
                    12:leaf=0.175193802
            6:[Fare<24.8083496] yes=13,no=14,missing=14
                    13:leaf=0.0365591422
                    14:leaf=-0.151999995

We see that this tree checks Sex, Age, Pclass, Fare and SibSp features. leaf gives the decision of a single tree, and they are summed over all trees in the ensemble.

Let’s check feature importances with eli5.show_weights():

from eli5 import show_weights
show_weights(clf, vec=vec)

Weight	Feature
0.4278	Sex=female
0.1949	Pclass=3
0.0665	Embarked=S
0.0510	Pclass=2
0.0420	SibSp
0.0417	Cabin=
0.0385	Embarked=C
0.0358	Ticket=1601
0.0331	Age
0.0323	Fare
0.0220	Pclass=1
0.0143	Parch
0	Name=Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)
0	Name=Roebling, Mr. Washington Augustus II
0	Name=Rosblom, Mr. Viktor Richard
0	Name=Ross, Mr. John Hugo
0	Name=Rush, Mr. Alfred George John
0	Name=Rouse, Mr. Richard Henry
0	Name=Ryerson, Miss. Emily Borie
0	Name=Ryerson, Miss. Susan Parker "Suzette"
… 1972 more …

There are several different ways to calculate feature importances. By default, “gain” is used, that is the average gain of the feature when it is used in trees. Other types are “weight” - the number of times a feature is used to split the data, and “cover” - the average coverage of the feature. You can pass it with importance_type argument.

Now we know that two most important features are Sex=female and Pclass=3, but we still don’t know how XGBoost decides what prediction to make based on their values.

4. Explaining predictions¶

To get a better idea of how our classifier works, let’s examine individual predictions with eli5.show_prediction():

from eli5 import show_prediction
show_prediction(clf, valid_xs[1], vec=vec, show_feature_values=True)

y=1 (probability 0.566, score 0.264) top features

Contribution^?	Feature	Value
+1.673	Sex=female	1.000
+0.479	Embarked=S	Missing
+0.070	Fare	7.879
-0.004	Cabin=	1.000
-0.006	Parch	0.000
-0.009	Pclass=2	Missing
-0.009	Ticket=1601	Missing
-0.012	Embarked=C	Missing
-0.071	SibSp	0.000
-0.073	Pclass=1	Missing
-0.147	Age	19.000
-0.528	<BIAS>	1.000
-1.100	Pclass=3	1.000

Weight means how much each feature contributed to the final prediction across all trees. The idea for weight calculation is described in http://blog.datadive.net/interpreting-random-forests/; eli5 provides an independent implementation of this algorithm for XGBoost and most scikit-learn tree ensembles.

Here we see that classifier thinks it’s good to be a female, but bad to travel third class. Some features have “Missing” as value (we are passing show_feature_values=True to view the values): that means that the feature was missing, so in this case it’s good to not have embarked in Southampton. This is where our decision to go with sparse matrices comes handy - we still see that Parch is zero, not missing.

It’s possible to show only features that are present using feature_filter argument: it’s a function that accepts feature name and value, and returns True value for features that should be shown:

no_missing = lambda feature_name, feature_value: not np.isnan(feature_value)
show_prediction(clf, valid_xs[1], vec=vec, show_feature_values=True, feature_filter=no_missing)

y=1 (probability 0.566, score 0.264) top features

Contribution^?	Feature	Value
+1.673	Sex=female	1.000
+0.070	Fare	7.879
-0.004	Cabin=	1.000
-0.006	Parch	0.000
-0.071	SibSp	0.000
-0.147	Age	19.000
-0.528	<BIAS>	1.000
-1.100	Pclass=3	1.000

5. Adding text features¶

Right now we treat Name field as categorical, like other text features. But in this dataset each name is unique, so XGBoost does not use this feature at all, because it’s such a poor discriminator: it’s absent from the weights table in section 3.

But Name still might contain some useful information. We don’t want to guess how to best pre-process it and what features to extract, so let’s use the most general character ngram vectorizer:

from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer

vec2 = FeatureUnion([
    ('Name', CountVectorizer(
        analyzer='char_wb',
        ngram_range=(3, 4),
        preprocessor=lambda x: x['Name'],
        max_features=100,
    )),
    ('All', DictVectorizer()),
])
clf2 = XGBClassifier()
pipeline2 = make_pipeline(vec2, clf2)
evaluate(pipeline2)

Accuracy: 0.839 ± 0.081

In this case the pipeline is more complex, we slightly improved our result, but the improvement is not significant. Let’s look at feature importances:

show_weights(clf2, vec=vec2)

Weight	Feature
0.3138	Name__ Mr.
0.0821	All__Pclass=3
0.0443	Name__sso
0.0294	All__Sex=female
0.0212	Name__lia
0.0205	All__Fare
0.0203	All__Ticket=1601
0.0197	All__Embarked=S
0.0187	Name__ Ma
0.0177	All__Cabin=
0.0172	Name__ Mar
0.0168	Name__s,
0.0160	Name__ Mr
0.0157	Name__son
0.0138	Name__ne
0.0137	Name__ber
0.0136	All__SibSp
0.0136	Name__e,
0.0134	All__Pclass=1
0.0125	All__Embarked=C
… 2072 more …

We see that now there is a lot of features that come from the Name field (in fact, a classifier based on Name alone gives about 0.79 accuracy). Name features listed in this way are not very informative, they make more sense when we check out predictions. We hide missing features here because there is a lot of missing features in text, but they are not very interesting:

from IPython.display import display

for idx in [4, 5, 7, 37, 81]:
    display(show_prediction(clf2, valid_xs[idx], vec=vec2,
                            show_feature_values=True, feature_filter=no_missing))

y=1 (probability 0.771, score 1.215) top features

Contribution^?	Feature	Value
+0.995	Name: Highlighted in text (sum)
+0.347	All__Fare	17.800
+0.236	All__Sex=female	1.000
+0.109	All__Age	18.000
-0.029	All__Cabin=	1.000
-0.069	All__Parch	0.000
-0.150	All__Embarked=S	1.000
-0.215	All__SibSp	1.000
-0.539	<BIAS>	1.000
-0.932	All__Pclass=3	1.000

Name: Arnold-Franchi, Mrs. Josef (Josefine Franchi)

y=0 (probability 0.905, score -2.248) top features

Contribution^?	Feature	Value
+0.948	Name: Highlighted in text (sum)
+0.539	<BIAS>	1.000
+0.387	All__Parch	0.000
+0.221	All__Age	45.000
+0.071	All__Cabin=	1.000
+0.037	All__SibSp	0.000
-0.067	All__Pclass=1	1.000
-0.492	All__Fare	26.550

Name: Romaine, Mr. Charles Hallace ("Mr C Rolmane")

y=0 (probability 0.941, score -2.762) top features

Contribution^?	Feature	Value
+1.946	All__SibSp	8.000
+0.942	All__Fare	69.550
+0.678	All__Pclass=3	1.000
+0.539	<BIAS>	1.000
+0.160	All__Parch	2.000
+0.074	All__Embarked=S	1.000
+0.029	All__Cabin=	1.000
-0.669	Name: Highlighted in text (sum)

Name: Sage, Master. Thomas Henry

y=1 (probability 0.679, score 0.750) top features

Contribution^?	Feature	Value
+0.236	All__Sex=female	1.000
+0.226	All__Fare	7.879
+0.141	Name: Highlighted in text (sum)
+0.010	All__SibSp	0.000
-0.029	All__Cabin=	1.000
-0.041	All__Parch	0.000
-0.539	<BIAS>	1.000
-0.932	All__Pclass=3	1.000

Name: Mockler, Miss. Helen Mary "Ellie"

y=1 (probability 0.660, score 0.663) top features

Contribution^?	Feature	Value
+0.236	All__Sex=female	1.000
+0.161	All__Fare	23.250
+0.158	Name: Highlighted in text (sum)
+0.152	All__Embarked=Q	1.000
+0.010	All__SibSp	2.000
-0.029	All__Cabin=	1.000
-0.069	All__Parch	0.000
-0.539	<BIAS>	1.000
-0.932	All__Pclass=3	1.000

Name: McCoy, Miss. Agnes

Text features from the Name field are highlighted directly in text, and the sum of weights is shown in the weights table as “Name: Highlighted in text (sum)”.

Looks like name classifier tried to infer both gender and status from the title: “Mr.” is bad because women are saved first, and it’s better to be “Mrs.” (married) than “Miss.”. Also name classifier is trying to pick some parts of names and surnames, especially endings, perhaps as a proxy for social status. It’s especially bad to be “Mary” if you are from the third class.