Low silhouette coefficient
I am doing a kmeans clustering on a dataset of selling values of articles.
Each article has 52 selling values (one per week). I am trying to automatically calculate the optimum amount of clusters for any unkown dataset.
I tried two criteria: The elbow method and the silhouette coefficient.
For the silhouette coefficient I got for 1 to 20 clusters values from 0.059 to 0.117 which is (in my opinion) extremely low (heard about a normal of about 0.7).
For the elbow method I used the inertia_ (sum of squared distances) of the kmeans and appended it to a list for each iteration (also from 1 to 20). I got values between 21782 for k=1 and 15323 for k=20.
Now I am not really sure how to interpret these values. Is my data not that separable?
EDIT: Thank you for the answer, as this seems to be a problem of data/preprocessing, here is how i process the data:
data = pd.read_csv('/home/dev/Desktop/TD_DM.csv', parse_dates=True, index_col=['DATE'], low_memory=False)
data[['QUANTITY']] = data[['QUANTITY']].apply(pd.to_numeric, errors='coerce')
data_extracted = data.groupby(['DATE','ARTICLENO'])['QUANTITY'].sum().unstack()
data_extracted = data_extracted.fillna(value=np.nan)
#data_extracted.index = pd.to_datetime(data_extracted.index.str[:-2], errors="coerce")
data_resampled = data_extracted.resample('W-MON', label='left', loffset=pd.DateOffset(days=1)).sum()
Here is how the printed data_resampled looks like:
2017-12-19 13.0 2600.0 0.0 0.0
2017-12-26 28.0 2840.0 0.0 0.0
2018-01-02 34.0 4840.0 0.0 0.0
2018-01-09 35.0 6140.0 0.0 0.0
2018-01-16 6.0 5800.0 0.0 0.0
2018-01-23 3.0 5980.0 0.0 0.0
2018-01-30 0.0 6100.0 0.0 0.0
2018-02-06 24.0 5020.0 0.0 0.0
2018-02-13 60.0 6380.0 0.0 0.0
2018-02-20 47.0 6220.0 0.0 0.0
2018-02-27 73.0 5460.0 0.0 0.0
2018-03-06 69.0 5780.0 0.0 0.0
2018-03-13 33.0 5520.0 0.0 0.0
2018-03-20 36.0 5540.0 0.0 0.0
2018-03-27 27.0 5360.0 0.0 0.0
2018-04-03 28.0 4920.0 0.0 0.0
2018-04-10 31.0 5520.0 0.0 0.0
2018-04-17 46.0 5660.0 1.0 21.0
2018-04-24 26.0 5040.0 18.0 40.0
2018-05-01 52.0 5540.0 18.0 40.0
2018-05-08 36.0 5440.0 3.0 26.0
2018-05-15 36.0 5720.0 5.0 18.0
2018-05-22 52.0 4360.0 0.0 22.0
2018-05-29 52.0 4760.0 0.0 18.0
The column headers would be the corresponding article number.
The next step is to locate one full year:
data_extracted = data_resampled.loc['2016-01-01' : '2016-12-31']
Then i start the preprocessing to remove columns with too many NaN's or 0's:
max_nan_count = 5
#there are headers at the 'bottom' too, so remove them
data_extrcated = data_extracted.iloc[:, :-1]
data_extracted = data_extracted.drop(data_extracted.columns[data_extracted.apply(lambda col: col.isnull().sum() max_nan_count)], axis=1)
data_pct_change = data_extracted.astype(float).pct_change(axis=0).replace([np.inf, -np.inf], np.nan).fillna(0)
data_pct_change.drop([col for col, val in data_pct_change.sum().iteritems() if val == 0 ], axis=1, inplace=True)
And this is how the percentual change looks like:
2016-01-03 0.000000 0.000000 0.000000 0.000000
2016-01-10 0.284091 0.062500 0.181548 0.252427
2016-01-17 0.011799 0.117647 0.110831 0.211886
2016-01-24 0.008746 0.807018 0.092971 0.140725
2016-01-31 -0.020231 -0.411003 0.056017 -0.186916
2016-02-07 -0.014749 -0.087912 0.033399 -0.098851
2016-02-14 0.218563 0.138554 0.136882 0.229592
2016-02-21 -0.233415 -0.343915 -0.322742 -0.296680
2016-02-28 0.448718 0.661290 0.535802 0.439528
2016-03-07 -0.057522 -0.048544 -0.107717 0.012295
2016-03-14 0.009390 -0.030612 0.234234 -0.062753
2016-03-21 -0.039535 0.000000 -0.068613 0.032397
2016-03-28 0.232446 0.210526 0.153605 0.165272
2016-04-04 0.001965 0.008696 -0.077446 0.028725
2016-04-11 -0.133333 -0.185345 -0.148748 -0.219895
2016-04-18 0.108597 0.174603 0.216263 0.199105
2016-04-25 0.091837 -0.207207 -0.085349 0.016791
2016-05-02 -0.052336 0.454545 0.149300 -0.023853
2016-05-09 0.218935 0.000000 0.036536 0.058271
2016-05-16 0.190939 0.210938 -0.080940 0.156306
Now i have an additional normalization step:
normalized_modeling_data = preprocessing.normalize(data_pct_change, norm='l2', axis=0)
normalized_data_headers = pd.DataFrame(normalized_modeling_data, columns=data_pct_change.columns)
normalized_modeling_data = normalized_modeling_data.transpose()
And this normalized_modeling_data is used for the kmeans clustering. The result of the k-means clustering looks very logical and reliable, so is there any mistake in my code?
Topic scikit-learn python k-means
Category Data Science