The actual results and results from pickle files are not matching in pandas for DBSCAN clustering

I've built a DBSCAN clustering model. The output result and the result after using the pickle files are not matching.

Based on HD and MC column, I am clustering WT column.

data = HD,MC
Target = WT

Below, for 1st record the cluster is 0.

But after running it from 'pkl' file, it is showing predicted result as [-1]

Dataframe:

      HD         MC             WT         Cluster
      200        Other          4.5        0
      150        Pep            5.6        0
      100        Pla            35         -1
      50         Same           15         0

Code:

 le = preprocessing.LabelEncoder()
 df['MC encoded'] = le.fit_transform(df['MC'])

 col_1 = ['HD','MC encoded']
 data = df[col_1]
 col_2 = ['WT']
 target = df[col_2]
 data = data.fillna(value=0)


 model = DBSCAN(eps=1, min_samples=20).fit(data)
 outliers_df = pd.DataFrame(data)
 print(Counter(model.labels_))

 x = model.fit_predict(target)
 print(Counter(x))

Result:

  Counter({-1: 604, 0: 142, 1: 83, 9: 36, 2: 27, 7: 26, 10: 26, 8: 24, 4: 23, 5: 23, 3: 22, 11: 21, 6: 20, 12: 20, 13: 20})
  Counter({0: 1093, -1: 24})

Code:

  df["Cluster"] = x

  filename1 = '/model.pkl'
  model_df = open(filename1, 'wb')
  pickle.dump(model,model_df)
  model_df.close()

  output = open('/MC.pkl', 'wb')
  pickle.dump(le, output)
  output.close()

  with open('model.pkl', 'rb') as file:  
     pickle_model = pickle.load(file)


  pkl_file = open('MC.pkl', 'rb')
  le_mc = pickle.load(pkl_file) 
  pkl_file.close()


 def testing(HD,MC,WT):
     test = {'HD':[HD],'MC':[MC], 'WT':[WT]} 
     test = pd.DataFrame(test)
     test['MC_encoded'] = le_mc.transform(test['MC'])
     pred_val = pickle_model.fit_predict(test[['HD','MC_encoded']])
     print(pred_val)
     return(pred_val)



      pred_val = testing(200,'Other',4.5)

Result:

    [-1]

Topic pickle dbscan pandas python clustering

Category Data Science


Without looking at anything else :

pred_val = pickle_model.fit_predict(test[['HD','MC_encoded']])

You're training your pickle_model on your test_data by using fit_predict() method. Start by replacing it with .predict() directly so can use the model as it is and not train it on a single sample.


It appears your pickle file isn't being loaded as a pandas dataframe. Why not just use df_pickle = pd.read_pickle('/MC.pkl') – the rest should fall into place after.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.