Python KMeans 聚类 ValueError: 处理数据类型错误

在使用 Python 的 scikit-learn 库进行 KMeans 聚类时，可能会遇到 ValueError 错误。常见的错误信息如下：

ValueError                                Traceback (most recent call last)
Cell In[138], line 2
      1 tool = KMeans(n_clusters=4)
----> 2 data['cluster'] = tool.fit_predict(data)
      3 #print(data['cluster'])
      4 data['cluster']=data['cluster'].astype('category')

File D:\anaconda\lib\site-packages\sklearn\cluster\_kmeans.py:1033, in _BaseKMeans.fit_predict(self, X, y, sample_weight)
   1010 def fit_predict(self, X, y=None, sample_weight=None):
   1011     """Compute cluster centers and predict cluster index for each sample.
   1012 
   1013     Convenience method; equivalent to calling fit(X) followed by
   (...)
   1031         Index of the cluster each sample belongs to.
   1032     """
-> 1033     return self.fit(X, sample_weight=sample_weight).labels_

File D:\anaconda\lib\site-packages\sklearn\cluster\_kmeans.py:1417, in KMeans.fit(self, X, y, sample_weight)
   1390 """Compute k-means clustering.
   1391 
   1392 Parameters
   (...)
   1413     Fitted estimator.
   1414 """
   1415 self._validate_params()
-> 1417 X = self._validate_data(
   1418     X,
   1419     accept_sparse="csr",
   1420     dtype=[np.float64, np.float32],
   1421     order="C",
   1422     copy=self.copy_x,
   1423     accept_large_sparse=False,
   1424 )
   1426 self._check_params_vs_input(X)
   1428 random_state = check_random_state(self.random_state)

File D:\anaconda\lib\site-packages\sklearn\base.py:546, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
    544     raise ValueError("Validation should be done on X, y or both.")
    545 elif not no_val_X and no_val_y:
--> 546     X = check_array(X, input_name="X", **check_params)
    547     out = X
    548 elif no_val_X and not no_val_y:

File D:\anaconda\lib\site-packages\sklearn\utils\validation.py:879, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    877         array = xp.astype(array, dtype, copy

错误原因:

KMeans 聚类算法要求输入数据为数值型数据，而 data['cluster'] 中的数据类型可能是 category，这会导致 ValueError 错误。

解决方法:

数据预处理: 在进行 KMeans 聚类之前，确保数据类型为数值型。可以使用 pd.to_numeric 函数将 category 类型的数据转换为数值型数据。
使用合适的特征: 如果你的数据中存在非数值型特征，需要先进行特征工程，将非数值型特征转换为数值型特征，例如使用 one-hot 编码。

代码示例:

import pandas as pd
from sklearn.cluster import KMeans

data = pd.DataFrame({'feature1': [1, 2, 3, 4], 'feature2': [5, 6, 7, 8], 'cluster': ['A', 'B', 'C', 'D']})

# 将 'cluster' 特征转换为数值型数据
data['cluster'] = pd.to_numeric(data['cluster'], errors='coerce')

tool = KMeans(n_clusters=4)
data['cluster'] = tool.fit_predict(data[['feature1', 'feature2']])

注意:

errors='coerce' 参数表示将无法转换为数值型的数据转换为 NaN。
在 fit_predict 函数中，只传递需要聚类的特征列，例如 data[['feature1', 'feature2']]。

通过以上方法，可以解决 ValueError 错误，并成功进行 KMeans 聚类。