Low silhouette coefficient

Question

Low silhouette coefficient

ItFreak

2022年5月12日 01:05

I am doing a kmeans clustering on a dataset of selling values of articles.

Each article has 52 selling values (one per week). I am trying to automatically calculate the optimum amount of clusters for any unkown dataset.

I tried two criteria: The elbow method and the silhouette coefficient.

For the silhouette coefficient I got for 1 to 20 clusters values from 0.059 to 0.117 which is (in my opinion) extremely low (heard about a normal of about 0.7).

For the elbow method I used the inertia_ (sum of squared distances) of the kmeans and appended it to a list for each iteration (also from 1 to 20). I got values between 21782 for k=1 and 15323 for k=20.

Now I am not really sure how to interpret these values. Is my data not that separable?

EDIT: Thank you for the answer, as this seems to be a problem of data/preprocessing, here is how i process the data:

data = pd.read_csv('/home/dev/Desktop/TD_DM.csv', parse_dates=True, index_col=['DATE'], low_memory=False)
data[['QUANTITY']] = data[['QUANTITY']].apply(pd.to_numeric, errors='coerce')
data_extracted = data.groupby(['DATE','ARTICLENO'])['QUANTITY'].sum().unstack()
print(data_extracted.index)
data_extracted = data_extracted.fillna(value=np.nan)
#data_extracted.index = pd.to_datetime(data_extracted.index.str[:-2], errors="coerce")
data_resampled = data_extracted.resample('W-MON', label='left', loffset=pd.DateOffset(days=1)).sum()
print(data_resampled)

Here is how the printed data_resampled looks like:

2017-12-19           13.0         2600.0            0.0            0.0   
2017-12-26           28.0         2840.0            0.0            0.0   
2018-01-02           34.0         4840.0            0.0            0.0   
2018-01-09           35.0         6140.0            0.0            0.0   
2018-01-16            6.0         5800.0            0.0            0.0   
2018-01-23            3.0         5980.0            0.0            0.0   
2018-01-30            0.0         6100.0            0.0            0.0   
2018-02-06           24.0         5020.0            0.0            0.0   
2018-02-13           60.0         6380.0            0.0            0.0   
2018-02-20           47.0         6220.0            0.0            0.0   
2018-02-27           73.0         5460.0            0.0            0.0   
2018-03-06           69.0         5780.0            0.0            0.0   
2018-03-13           33.0         5520.0            0.0            0.0   
2018-03-20           36.0         5540.0            0.0            0.0   
2018-03-27           27.0         5360.0            0.0            0.0   
2018-04-03           28.0         4920.0            0.0            0.0   
2018-04-10           31.0         5520.0            0.0            0.0   
2018-04-17           46.0         5660.0            1.0           21.0   
2018-04-24           26.0         5040.0           18.0           40.0   
2018-05-01           52.0         5540.0           18.0           40.0   
2018-05-08           36.0         5440.0            3.0           26.0   
2018-05-15           36.0         5720.0            5.0           18.0   
2018-05-22           52.0         4360.0            0.0           22.0   
2018-05-29           52.0         4760.0            0.0           18.0

The column headers would be the corresponding article number.

The next step is to locate one full year:

data_extracted = data_resampled.loc['2016-01-01' : '2016-12-31']

Then i start the preprocessing to remove columns with too many NaN's or 0's:

max_nan_count = 5
#there are headers at the 'bottom' too, so remove them
data_extrcated = data_extracted.iloc[:, :-1]
data_extracted = data_extracted.drop(data_extracted.columns[data_extracted.apply(lambda col: col.isnull().sum()  max_nan_count)], axis=1)
data_pct_change = data_extracted.astype(float).pct_change(axis=0).replace([np.inf, -np.inf], np.nan).fillna(0)
data_pct_change.drop([col for col, val in data_pct_change.sum().iteritems() if val == 0 ], axis=1, inplace=True)
print(data_pct_change)

And this is how the percentual change looks like:

2016-01-03       0.000000       0.000000       0.000000       0.000000   
2016-01-10       0.284091       0.062500       0.181548       0.252427   
2016-01-17       0.011799       0.117647       0.110831       0.211886   
2016-01-24       0.008746       0.807018       0.092971       0.140725   
2016-01-31      -0.020231      -0.411003       0.056017      -0.186916   
2016-02-07      -0.014749      -0.087912       0.033399      -0.098851   
2016-02-14       0.218563       0.138554       0.136882       0.229592   
2016-02-21      -0.233415      -0.343915      -0.322742      -0.296680   
2016-02-28       0.448718       0.661290       0.535802       0.439528   
2016-03-07      -0.057522      -0.048544      -0.107717       0.012295   
2016-03-14       0.009390      -0.030612       0.234234      -0.062753   
2016-03-21      -0.039535       0.000000      -0.068613       0.032397   
2016-03-28       0.232446       0.210526       0.153605       0.165272   
2016-04-04       0.001965       0.008696      -0.077446       0.028725   
2016-04-11      -0.133333      -0.185345      -0.148748      -0.219895   
2016-04-18       0.108597       0.174603       0.216263       0.199105   
2016-04-25       0.091837      -0.207207      -0.085349       0.016791   
2016-05-02      -0.052336       0.454545       0.149300      -0.023853   
2016-05-09       0.218935       0.000000       0.036536       0.058271   
2016-05-16       0.190939       0.210938      -0.080940       0.156306

Now i have an additional normalization step:

normalized_modeling_data = preprocessing.normalize(data_pct_change, norm='l2', axis=0)
normalized_data_headers = pd.DataFrame(normalized_modeling_data, columns=data_pct_change.columns)
normalized_modeling_data = normalized_modeling_data.transpose()

And this normalized_modeling_data is used for the kmeans clustering. The result of the k-means clustering looks very logical and reliable, so is there any mistake in my code?

Topic scikit-learn python k-means

Category Data Science

Capo Pavel Mestre · Accepted Answer · 2022年4月11日 17:42

Using the elbow method, you can also try to determine the number of clusters quantitatively in an automatic way (as opposed to doing it by eye using this method), if you introduce the quantity called the "elbow strength". Basically, it is based on the derivative of the elbow-plot with some more information-enhancing tricks. More details about the elbow strength can be found in the supplementary information of the following publication:

https://iopscience.iop.org/article/10.1088/2632-2153/abd87c

Has QUIT--Anony-Mousse · Accepted Answer · 2018年8月6日 19:32

The Silhouette values definitely are very bad. Most likely, the data is not suitable for the clustering method you chose - or Silhouette is not appropriate. (But you probably used k-means, which is fine for Silbouette). Improve your preprocessing of the data!

Inertia values cannot be compared across data sets, because they are highly data dependant. If you scale your data by 10, you inertia values will be 100x larger. Since we don't have your data, we have absolutely no chance of interpreting these values: they could be huge, or tiny.

Low silhouette coefficient

About