Storage of Sensitive Data in a Mechanism without Access Control Affecting scikit-learn package, versions [,1.5.0)


Severity

Recommended
0.0
medium
0
10

CVSS assessment made by Snyk's Security Team. Learn more

Threat Intelligence

Exploit Maturity
Proof of concept
EPSS
0.04% (12th percentile)

Do your applications use this vulnerable package?

In a few clicks we can analyze your entire application and see what components are vulnerable in your application, and suggest you quick fixes.

Test your applications
  • Snyk IDSNYK-PYTHON-SCIKITLEARN-7217830
  • published7 Jun 2024
  • disclosed6 Jun 2024
  • creditKemal Tugrul

Introduced: 6 Jun 2024

CVE-2024-5206  (opens in a new tab)
CWE-921  (opens in a new tab)

How to fix?

Upgrade scikit-learn to version 1.5.0 or higher.

Overview

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.

Affected versions of this package are vulnerable to Storage of Sensitive Data in a Mechanism without Access Control due to the unexpected storage of all tokens present in the training data within the stop_words_ attribute. An attacker can access sensitive information, such as passwords or keys, by exploiting this behavior.

PoC

Limiting vocabulary is a very common setting hence provided by the library. The expected behaviour is that the object stores the frequent tokens, and discards the rest after the fitting process. In theory and practice, the vectorizer only needs the vocabulary and the rest of the possible tokens will be simply non needed, hence should be discarded.

While the object correctly forms the required vocabulary, it stores the rest of the tokens in the `stop_words_ attribute. Therefore stores the entire unique tokens that have been passed in the fitting operation. Below it's demonstrated this:

# ╰─$ pip freeze | grep pandas
# pandas==2.2.1
import pandas as pd
# ╰─$ pip freeze | grep scikit-learn
# scikit-learn==1.4.1.post1
from sklearn.feature_extraction.text import TfidfVectorizer

if name == 'main': # Fitting the vectorizer will save every token presented vectorizer = TfidfVectorizer( max_features=2, # min_df=2/6 # Same results occur with different ways of limiting the vocabulary ).fit( pd.Series([ "hello", "world", "hello", "world", "secretkey", "password123" ]) ) # Expected storage for frequent tokens print(vectorizer.vocabulary_) # {'hello': 0, 'server': 1} # Unexpected data leak print(vectorizer.stop_words_) # {'password123', 'secretkey'}

It is demonstrated below that the storage in the stop_words_ attribute is unnecessary. Nullifying the attribute will give the same results:

# ╰─$ pip freeze | grep pandas
# pandas==2.2.1
import pandas as pd
# ╰─$ pip freeze | grep scikit-learn
# scikit-learn==1.4.1.post1
from sklearn.feature_extraction.text import TfidfVectorizer

if name == 'main': # Fitting the vectorizer will save every token presented vectorizer = TfidfVectorizer( max_features=2, # min_df=2/6 # Same results occur with different ways of limiting the vocabulary ).fit( pd.Series([ "hello", "world", "hello", "world", "secretkey", "password123" ]) ) # Expected storage for frequent tokens print(vectorizer.vocabulary_) # {'hello': 0, 'server': 1} # Unexpected data leak print(vectorizer.stop_words_) # {'password123', 'secretkey'}

# Wiping-out the stop_words_ attribute does not change the behaviour
print(vectorizer.transform(["hello world"]).toarray())  # [[0.70710678 0.70710678]]
vectorizer.stop_words_ = None
assert vectorizer.stop_words_ is None
print(vectorizer.transform(["hello world"]).toarray())  # [[0.70710678 0.70710678]]

References

CVSS Scores

version 3.1