Snyk has a proof-of-concept or detailed explanation of how to exploit this vulnerability.
The probability is the direct output of the EPSS model, and conveys an overall sense of the threat of exploitation in the wild. The percentile measures the EPSS probability relative to all known EPSS scores. Note: This data is updated daily, relying on the latest available EPSS model version. Check out the EPSS documentation for more details.
In a few clicks we can analyze your entire application and see what components are vulnerable in your application, and suggest you quick fixes.
Test your applicationsUpgrade scikit-learn
to version 1.5.0 or higher.
scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.
Affected versions of this package are vulnerable to Storage of Sensitive Data in a Mechanism without Access Control due to the unexpected storage of all tokens present in the training data within the stop_words_
attribute. An attacker can access sensitive information, such as passwords or keys, by exploiting this behavior.
Limiting vocabulary is a very common setting hence provided by the library. The expected behaviour is that the object stores the frequent tokens, and discards the rest after the fitting process. In theory and practice, the vectorizer only needs the vocabulary and the rest of the possible tokens will be simply non needed, hence should be discarded.
While the object correctly forms the required vocabulary, it stores the rest of the tokens in the `stop_words_ attribute. Therefore stores the entire unique tokens that have been passed in the fitting operation. Below it's demonstrated this:
# ╰─$ pip freeze | grep pandas # pandas==2.2.1 import pandas as pd # ╰─$ pip freeze | grep scikit-learn # scikit-learn==1.4.1.post1 from sklearn.feature_extraction.text import TfidfVectorizer
if name == 'main': # Fitting the vectorizer will save every token presented vectorizer = TfidfVectorizer( max_features=2, # min_df=2/6 # Same results occur with different ways of limiting the vocabulary ).fit( pd.Series([ "hello", "world", "hello", "world", "secretkey", "password123" ]) ) # Expected storage for frequent tokens print(vectorizer.vocabulary_) # {'hello': 0, 'server': 1} # Unexpected data leak print(vectorizer.stop_words_) # {'password123', 'secretkey'}
It is demonstrated below that the storage in the stop_words_ attribute is unnecessary. Nullifying the attribute will give the same results:
# ╰─$ pip freeze | grep pandas # pandas==2.2.1 import pandas as pd # ╰─$ pip freeze | grep scikit-learn # scikit-learn==1.4.1.post1 from sklearn.feature_extraction.text import TfidfVectorizer
if name == 'main': # Fitting the vectorizer will save every token presented vectorizer = TfidfVectorizer( max_features=2, # min_df=2/6 # Same results occur with different ways of limiting the vocabulary ).fit( pd.Series([ "hello", "world", "hello", "world", "secretkey", "password123" ]) ) # Expected storage for frequent tokens print(vectorizer.vocabulary_) # {'hello': 0, 'server': 1} # Unexpected data leak print(vectorizer.stop_words_) # {'password123', 'secretkey'}
# Wiping-out the stop_words_ attribute does not change the behaviour print(vectorizer.transform(["hello world"]).toarray()) # [[0.70710678 0.70710678]] vectorizer.stop_words_ = None assert vectorizer.stop_words_ is None print(vectorizer.transform(["hello world"]).toarray()) # [[0.70710678 0.70710678]]