Note
This tutorial can be run as an IPython notebook.
TextExplainer: debugging black-box text classifiers¶
While eli5 supports many classifiers and preprocessing methods, it can’t support them all.
If a library is not supported by eli5 directly, or the text processing
pipeline is too complex for eli5, eli5 can still help - it provides an
implementation of LIME (Ribeiro et
al., 2016) algorithm which allows to explain predictions of arbitrary
classifiers, including text classifiers. eli5.lime
can also help
when it is hard to get exact mapping between model coefficients and text
features, e.g. if there is dimension reduction involved.
Example problem: LSA+SVM for 20 Newsgroups dataset¶
Let’s load “20 Newsgroups” dataset and create a text processing pipeline which is hard to debug using conventional methods: SVM with RBF kernel trained on LSA features.
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian',
'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(
subset='train',
categories=categories,
shuffle=True,
random_state=42,
remove=('headers', 'footers'),
)
twenty_test = fetch_20newsgroups(
subset='test',
categories=categories,
shuffle=True,
random_state=42,
remove=('headers', 'footers'),
)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline, make_pipeline
vec = TfidfVectorizer(min_df=3, stop_words='english',
ngram_range=(1, 2))
svd = TruncatedSVD(n_components=100, n_iter=7, random_state=42)
lsa = make_pipeline(vec, svd)
clf = SVC(C=150, gamma=2e-2, probability=True)
pipe = make_pipeline(lsa, clf)
pipe.fit(twenty_train.data, twenty_train.target)
pipe.score(twenty_test.data, twenty_test.target)
0.89014647137150471
The dimension of the input documents is reduced to 100, and then a kernel SVM is used to classify the documents.
This is what the pipeline returns for a document - it is pretty sure the first message in test data belongs to sci.med:
def print_prediction(doc):
y_pred = pipe.predict_proba([doc])[0]
for target, prob in zip(twenty_train.target_names, y_pred):
print("{:.3f} {}".format(prob, target))
doc = twenty_test.data[0]
print_prediction(doc)
0.001 alt.atheism
0.001 comp.graphics
0.995 sci.med
0.004 soc.religion.christian
TextExplainer¶
Such pipelines are not supported by eli5 directly, but one can use
eli5.lime.TextExplainer
to debug the prediction - to check what was
important in the document to make this decision.
Create a TextExplainer
instance, then pass the document to explain
and a black-box classifier (a function which returns probabilities) to
the fit()
method, then check the explanation:
import eli5
from eli5.lime import TextExplainer
te = TextExplainer(random_state=42)
te.fit(doc, pipe.predict_proba)
te.show_prediction(target_names=twenty_train.target_names)
y=alt.atheism (probability 0.000, score -9.663) top features
Contribution? | Feature |
---|---|
-0.360 | <BIAS> |
-9.303 | Highlighted in text (sum) |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
y=comp.graphics (probability 0.000, score -8.503) top features
Contribution? | Feature |
---|---|
-0.210 | <BIAS> |
-8.293 | Highlighted in text (sum) |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
y=sci.med (probability 0.996, score 5.826) top features
Contribution? | Feature |
---|---|
+5.929 | Highlighted in text (sum) |
-0.103 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
y=soc.religion.christian (probability 0.004, score -5.504) top features
Contribution? | Feature |
---|---|
-0.342 | <BIAS> |
-5.162 | Highlighted in text (sum) |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
Why it works¶
Explanation makes sense - we expect reasonable classifier to take highlighted words in account. But how can we be sure this is how the pipeline works, not just a nice-looking lie? A simple sanity check is to remove or change the highlighted words, to confirm that they change the outcome:
import re
doc2 = re.sub(r'(recall|kidney|stones|medication|pain|tech)', '', doc, flags=re.I)
print_prediction(doc2)
0.065 alt.atheism
0.145 comp.graphics
0.376 sci.med
0.414 soc.religion.christian
Predicted probabilities changed a lot indeed.
And in fact, TextExplainer
did something similar to get the
explanation. TextExplainer
generated a lot of texts similar to the
document (by removing some of the words), and then trained a white-box
classifier which predicts the output of the black-box classifier (not
the true labels!). The explanation we saw is for this white-box
classifier.
This approach follows the LIME algorithm; for text data the algorithm is actually pretty straightforward:
- generate distorted versions of the text;
- predict probabilities for these distorted texts using the black-box classifier;
- train another classifier (one of those eli5 supports) which tries to predict output of a black-box classifier on these texts.
The algorithm works because even though it could be hard or impossible to approximate a black-box classifier globally (for every possible text), approximating it in a small neighbourhood near a given text often works well, even with simple white-box classifiers.
Generated samples (distorted texts) are available in samples_
attribute:
print(te.samples_[0])
As my kidney , isn' any
can .
Either they , be ,
to .
, - tech to mention ' had kidney
and , .
By default TextExplainer
generates 5000 distorted texts (use
n_samples
argument to change the amount):
len(te.samples_)
5000
Trained white-box classifier and vectorizer are available as vec_
and clf_
attributes:
te.vec_, te.clf_
(CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 2), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\b\w+\b', tokenizer=None, vocabulary=None), SGDClassifier(alpha=0.001, average=False, class_weight=None, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='log', n_iter=5, n_jobs=1, penalty='elasticnet', power_t=0.5, random_state=<mtrand.RandomState object at 0x10e1dcf78>, shuffle=True, verbose=0, warm_start=False))
Should we trust the explanation?¶
Ok, this sounds fine, but how can we be sure that this simple text classification pipeline approximated the black-box classifier well?
One way to do that is to check the quality on a held-out dataset (which
is also generated). TextExplainer
does that by default and stores
metrics in metrics_
attribute:
te.metrics_
{'mean_KL_divergence': 0.020120624088861134, 'score': 0.98625304704899297}
- ‘score’ is an accuracy score weighted by cosine distance between generated sample and the original document (i.e. texts which are closer to the example are more important). Accuracy shows how good are ‘top 1’ predictions.
- ‘mean_KL_divergence’ is a mean Kullback–Leibler divergence for all target classes; it is also weighted by distance. KL divergence shows how well are probabilities approximated; 0.0 means a perfect match.
In this example both accuracy and KL divergence are good; it means our white-box classifier usually assigns the same labels as the black-box classifier on the dataset we generated, and its predicted probabilities are close to those predicted by our LSA+SVM pipeline. So it is likely (though not guaranteed, we’ll discuss it later) that the explanation is correct and can be trusted.
When working with LIME (e.g. via TextExplainer
) it is always a good
idea to check these scores. If they are not good then you can tell that
something is not right.
Let’s make it fail¶
By default TextExplainer
uses a very basic text processing pipeline:
Logistic Regression trained on bag-of-words and bag-of-bigrams features
(see te.clf_
and te.vec_
attributes). It limits a set of
black-box classifiers it can explain: because the text is seen as “bag
of words/ngrams”, the default white-box pipeline can’t distinguish
e.g. between the same word in the beginning of the document and in the
end of the document. Bigrams help to alleviate the problem in practice,
but not completely.
Black-box classifiers which use features like “text length” (not directly related to tokens) can be also hard to approximate using the default bag-of-words/ngrams model.
This kind of failure is usually detectable though - scores (accuracy and KL divergence) will be low. Let’s check it on a completely synthetic example - a black-box classifier which assigns a class based on oddity of document length and on a presence of ‘medication’ word.
import numpy as np
def predict_proba_len(docs):
# nasty predict_proba - the result is based on document length,
# and also on a presence of "medication"
proba = [
[0, 0, 1.0, 0] if len(doc) % 2 or 'medication' in doc else [1.0, 0, 0, 0]
for doc in docs
]
return np.array(proba)
te3 = TextExplainer().fit(doc, predict_proba_len)
te3.show_prediction(target_names=twenty_train.target_names)
y=sci.med (probability 0.989, score 4.466) top features
Contribution? | Feature |
---|---|
+4.576 | Highlighted in text (sum) |
-0.110 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
TextExplainer
correctly figured out that ‘medication’ is important,
but failed to account for “len(doc) % 2” condition, so the explanation
is incomplete. We can detect this failure by looking at metrics - they
are low:
te3.metrics_
{'mean_KL_divergence': 0.3312922355257879, 'score': 0.79050673156810314}
If (a big if…) we suspect that the fact document length is even or odd
is important, it is possible to customize TextExplainer
to check
this hypothesis.
To do that, we need to create a vectorizer which returns both “is odd”
feature and bag-of-words features, and pass this vectorizer to
TextExplainer
. This vectorizer should follow scikit-learn API. The
easiest way is to use FeatureUnion
- just make sure all transformers
joined by FeatureUnion
have get_feature_names()
methods.
from sklearn.pipeline import make_union
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin
class DocLength(TransformerMixin):
def fit(self, X, y=None): # some boilerplate
return self
def transform(self, X):
return [
# note that we needed both positive and negative
# feature - otherwise for linear model there won't
# be a feature to show in a half of the cases
[len(doc) % 2, not len(doc) % 2]
for doc in X
]
def get_feature_names(self):
return ['is_odd', 'is_even']
vec = make_union(DocLength(), CountVectorizer(ngram_range=(1,2)))
te4 = TextExplainer(vec=vec).fit(doc[:-1], predict_proba_len)
print(te4.metrics_)
te4.explain_prediction(target_names=twenty_train.target_names)
{'mean_KL_divergence': 0.024826114773734968, 'score': 1.0}
y=sci.med (probability 0.996, score 5.511) top features
Contribution? | Feature |
---|---|
+8.590 | countvectorizer: Highlighted in text (sum) |
-0.043 | <BIAS> |
-3.037 | doclength__is_even |
countvectorizer: as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less
Much better! It was a toy example, but the idea stands - if you think
something could be important, add it to the mix as a feature for
TextExplainer
.
Let’s make it fail, again¶
Another possible issue is the dataset generation method. Not only feature extraction should be powerful enough, but auto-generated texts also should be diverse enough.
TextExplainer
removes random words by default, so by default it
can’t e.g. provide a good explanation for a black-box classifier which
works on character level. Let’s try to use TextExplainer
to explain
a classifier which uses char ngrams as features:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vec_char = HashingVectorizer(analyzer='char_wb', ngram_range=(4,5))
clf_char = SGDClassifier(loss='log')
pipe_char = make_pipeline(vec_char, clf_char)
pipe_char.fit(twenty_train.data, twenty_train.target)
pipe_char.score(twenty_test.data, twenty_test.target)
0.88082556591211714
This pipeline is supported by eli5 directly, so in practice there is no
need to use TextExplainer
for it. We’re using this pipeline as an
example - it is possible check the “true” explanation first, without
using TextExplainer
, and then compare the results with
TextExplainer
results.
eli5.show_prediction(clf_char, doc, vec=vec_char,
targets=['sci.med'], target_names=twenty_train.target_names)
y=sci.med (probability 0.565, score -0.037) top features
Contribution? | Feature |
---|---|
+0.943 | Highlighted in text (sum) |
-0.980 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
TextExplainer
produces a different result:
te = TextExplainer(random_state=42).fit(doc, pipe_char.predict_proba)
print(te.metrics_)
te.show_prediction(targets=['sci.med'], target_names=twenty_train.target_names)
{'mean_KL_divergence': 0.020247299052285436, 'score': 0.92434669226497945}
y=sci.med (probability 0.576, score 0.621) top features
Contribution? | Feature |
---|---|
+0.972 | Highlighted in text (sum) |
-0.351 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
Scores look OK but not great; the explanation kind of makes sense on a first sight, but we know that the classifier works in a different way.
To explain such black-box classifiers we need to change both dataset generation method (change/remove individual characters, not only words) and feature extraction method (e.g. use char ngrams instead of words and word ngrams).
TextExplainer
has an option (char_based=True
) to use char-based
sampling and char-based classifier. If this makes a more powerful
explanation engine why not always use it?
te = TextExplainer(char_based=True, random_state=42)
te.fit(doc, pipe_char.predict_proba)
print(te.metrics_)
te.show_prediction(targets=['sci.med'], target_names=twenty_train.target_names)
{'mean_KL_divergence': 0.22136004391576117, 'score': 0.55669450678688481}
y=sci.med (probability 0.366, score -0.003) top features
Contribution? | Feature |
---|---|
+0.199 | Highlighted in text (sum) |
-0.202 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
Hm, the result look worse. TextExplainer
detected correctly that
only the first part of word “medication” is important, but the result is
noisy overall, and scores are bad. Let’s try it with more samples:
te = TextExplainer(char_based=True, n_samples=50000, random_state=42)
te.fit(doc, pipe_char.predict_proba)
print(te.metrics_)
te.show_prediction(targets=['sci.med'], target_names=twenty_train.target_names)
{'mean_KL_divergence': 0.060019833958355841, 'score': 0.86048000626542609}
y=sci.med (probability 0.630, score 0.800) top features
Contribution? | Feature |
---|---|
+1.018 | Highlighted in text (sum) |
-0.219 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
It is getting closer, but still not there yet. The problem is that it is much more resource intensive - you need a lot more samples to get non-noisy results. Here explaining a single example took more time than training the original pipeline.
Generally speaking, to do an efficient explanation we should make some assumptions about black-box classifier, such as:
- it uses words as features and doesn’t take word position in account;
- it uses words as features and takes word positions in account;
- it uses words ngrams as features;
- it uses char ngrams as features, positions don’t matter (i.e. an ngram means the same everywhere);
- it uses arbitrary attention over the text characters, i.e. every part of text could be potentionally important for a classifier on its own;
- it is important to have a particular token at a particular position, e.g. “third token is X”, and if we delete 2nd token then prediction changes not because 2nd token changed, but because 3rd token is shifted.
Depending on assumptions we should choose both dataset generation method and a white-box classifier. There is a tradeoff between generality and speed.
Simple bag-of-words assumptions allow for fast sample generation, and just a few hundreds of samples could be required to get an OK quality if the assumption is correct. But such generation methods / models will fail to explain a more complex classifier properly (they could still provide an explanation which is useful in practice though).
On the other hand, allowing for each character to be important is a more powerful method, but it can require a lot of samples (maybe hundreds thousands) and a lot of CPU time to get non-noisy results.
What’s bad about this kind of failure (wrong assumption about the black-box pipeline) is that it could be impossible to detect the failure by looking at the scores. Scores could be high because generated dataset is not diverse enough, not because our approximation is good.
The takeaway is that it is important to understand the “lenses” you’re looking through when using LIME to explain a prediction.
Customizing TextExplainer: sampling¶
TextExplainer
uses MaskingTextSampler
or MaskingTextSamplers
instances to generate texts to train on. MaskingTextSampler
is the
main text generation class; MaskingTextSamplers
provides a way to
combine multiple samplers in a single object with the same interface.
A custom sampler instance can be passed to TextExplainer
if we want
to experiment with sampling. For example, let’s try a sampler which
replaces no more than 3 characters in the text (default is to replace a
random number of characters):
from eli5.lime.samplers import MaskingTextSampler
sampler = MaskingTextSampler(
# Regex to split text into tokens.
# "." means any single character is a token, i.e.
# we work on chars.
token_pattern='.',
# replace no more than 3 tokens
max_replace=3,
# by default all tokens are replaced;
# replace only a token at a given position.
bow=False,
)
samples, similarity = sampler.sample_near(doc)
print(samples[0])
As I recal from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the ain.
Either thy pass, or they have to be broken up with sound, or they have
to be extracted surgically.
When I was in, the X-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.
te = TextExplainer(char_based=True, sampler=sampler, random_state=42)
te.fit(doc, pipe_char.predict_proba)
print(te.metrics_)
te.show_prediction(targets=['sci.med'], target_names=twenty_train.target_names)
{'mean_KL_divergence': 0.71042368337755823, 'score': 0.99933430578588944}
y=sci.med (probability 0.958, score 2.434) top features
Contribution? | Feature |
---|---|
+2.430 | Highlighted in text (sum) |
+0.005 | <BIAS> |
as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.
Note that accuracy score is perfect, but KL divergence is bad. It means
this sampler was not very useful: most generated texts were “easy” in
sense that most (or all?) of them should be still classified as
sci.med
, so it was easy to get a good accuracy. But because
generated texts were not diverse enough classifier haven’t learned
anything useful; it’s having a hard time predicting the probability
output of the black-box pipeline on a held-out dataset.
By default TextExplainer
uses a mix of several sampling strategies
which seems to work OK for token-based explanations. But a good sampling
strategy which works for many real-world tasks could be a research topic
on itself. If you’ve got some experience with it we’d love to hear from
you - please share your findings in eli5 issue tracker (
https://github.com/TeamHG-Memex/eli5/issues )!
Customizing TextExplainer: classifier¶
In one of the previous examples we already changed the vectorizer TextExplainer uses (to take additional features in account). It is also possible to change the white-box classifier - for example, use a small decision tree:
from sklearn.tree import DecisionTreeClassifier
te5 = TextExplainer(clf=DecisionTreeClassifier(max_depth=2), random_state=0)
te5.fit(doc, pipe.predict_proba)
print(te5.metrics_)
te5.show_weights()
{'mean_KL_divergence': 0.037836554598348969, 'score': 0.9838155527960798}
Weight | Feature |
---|---|
0.5461 | kidney |
0.4539 | pain |
How to read it: “kidney <= 0.5” means “word ‘kidney’ is not in the document” (we’re explaining the orginal LDA+SVM pipeline again).
So according to this tree if “kidney” is not in the document and “pain”
is not in the document then the probability of a document belonging to
sci.med
drops to 0.65
. If at least one of these words remain
sci.med
probability stays 0.9+
.
print("both words removed::")
print_prediction(re.sub(r"(kidney|pain)", "", doc, flags=re.I))
print("\nonly 'pain' removed:")
print_prediction(re.sub(r"pain", "", doc, flags=re.I))
both words removed::
0.013 alt.atheism
0.022 comp.graphics
0.894 sci.med
0.072 soc.religion.christian
only 'pain' removed:
0.002 alt.atheism
0.004 comp.graphics
0.979 sci.med
0.015 soc.religion.christian
As expected, after removing both words probability of sci.med
decreased, though not as much as our simple decision tree predicted (to
0.9 instead of 0.64). Removing pain
provided exactly the same effect
as predicted - probability of sci.med
became 0.98
.