Storage of Sensitive Data in a Mechanism without Access Control in scikit-learn | CVE-2024-5206

Q: How to fix?

Upgrade scikit-learn to version 1.5.0 or higher.

Threat Intelligence

Proof of Concept

0.03% (6^th percentile)

Do your applications use this vulnerable package?

In a few clicks we can analyze your entire application and see what components are vulnerable in your application, and suggest you quick fixes.

Test your applications

Snyk IDSNYK-PYTHON-SCIKITLEARN-7217830
published7 Jun 2024
disclosed6 Jun 2024
creditKemal Tugrul

Report a new vulnerability Found a mistake?

Introduced: 6 Jun 2024

CVE-2024-5206 (opens in a new tab) CWE-921 (opens in a new tab)

How to fix?

Upgrade scikit-learn to version 1.5.0 or higher.

Overview

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.

Affected versions of this package are vulnerable to Storage of Sensitive Data in a Mechanism without Access Control due to the unexpected storage of all tokens present in the training data within the stop_words_ attribute. An attacker can access sensitive information, such as passwords or keys, by exploiting this behavior.

PoC

Limiting vocabulary is a very common setting hence provided by the library. The expected behaviour is that the object stores the frequent tokens, and discards the rest after the fitting process. In theory and practice, the vectorizer only needs the vocabulary and the rest of the possible tokens will be simply non needed, hence should be discarded.

While the object correctly forms the required vocabulary, it stores the rest of the tokens in the `stop_words_ attribute. Therefore stores the entire unique tokens that have been passed in the fitting operation. Below it's demonstrated this:

# ╰─$ pip freeze | grep pandas
# pandas==2.2.1
import pandas as pd
# ╰─$ pip freeze | grep scikit-learn
# scikit-learn==1.4.1.post1
from sklearn.feature_extraction.text import TfidfVectorizer

if name == 'main':
    # Fitting the vectorizer will save every token presented
    vectorizer = TfidfVectorizer(
        max_features=2,
        # min_df=2/6  # Same results occur with different ways of limiting the vocabulary
    ).fit(
        pd.Series([
            "hello", "world", "hello", "world", "secretkey", "password123"
        ])
    )
    # Expected storage for frequent tokens
    print(vectorizer.vocabulary_)  # {'hello': 0, 'server': 1}
    # Unexpected data leak
    print(vectorizer.stop_words_)  # {'password123', 'secretkey'}

It is demonstrated below that the storage in the stop_words_ attribute is unnecessary. Nullifying the attribute will give the same results:

# ╰─$ pip freeze | grep pandas
# pandas==2.2.1
import pandas as pd
# ╰─$ pip freeze | grep scikit-learn
# scikit-learn==1.4.1.post1
from sklearn.feature_extraction.text import TfidfVectorizer

if name == 'main':
    # Fitting the vectorizer will save every token presented
    vectorizer = TfidfVectorizer(
        max_features=2,
        # min_df=2/6  # Same results occur with different ways of limiting the vocabulary
    ).fit(
        pd.Series([
            "hello", "world", "hello", "world", "secretkey", "password123"
        ])
    )
    # Expected storage for frequent tokens
    print(vectorizer.vocabulary_)  # {'hello': 0, 'server': 1}
    # Unexpected data leak
    print(vectorizer.stop_words_)  # {'password123', 'secretkey'}
# Wiping-out the stop_words_ attribute does not change the behaviour
print(vectorizer.transform([&quot;hello world&quot;]).toarray())  # [[0.70710678 0.70710678]]
vectorizer.stop_words_ = None
assert vectorizer.stop_words_ is None
print(vectorizer.transform([&quot;hello world&quot;]).toarray())  # [[0.70710678 0.70710678]]

References

GitHub Commit

CVSS Base Scores

version 3.1

Attack Vector (AV)
Network
Attack Complexity (AC)
High
Privileges Required (PR)
Low
User Interaction (UI)
None

Scope (S)
Unchanged

Confidentiality (C)
High
Integrity (I)
None
Availability (A)
None

Storage of Sensitive Data in a Mechanism without Access Control Affecting scikit-learn package, versions [,1.5.0)

Severity