DBSCAN on textual and numerical columns

Question

DBSCAN on textual and numerical columns

Jazz

2022年4月2日 13:06

I have a dataset which has two columns:

title      price
sentence1  12
sentence2  13

I have used doc2vec to convert the sentences into vectors of size 100 as below:

LabeledSentence1 = gensim.models.doc2vec.TaggedDocument

all_content = []
j=0

for title in query_result['title_clean'].values:
    all_content.append(LabeledSentence1(title,[j]))
    j+=1
print(Number of texts processed: , j)

cores = multiprocessing.cpu_count()

d2v_model = Doc2Vec(dm=1, vector_size=100, negative=5, hs=0, 
min_count=2, sample = 0, workers=cores, alpha=0.025, 
min_alpha=0.001)

d2v_model.build_vocab([x for x in tqdm(all_content)])

all_content  = utils.shuffle(all_content)

d2v_model.train(all_content,total_examples=len(all_content), epochs=30)

So d2v_model.docvecs.doctag_syn0 returns me vectors of all the sentences

I want to now perform clustering using DBSCAN but since I have the other price column which is numeric I am having some trouble fitting the final data to the model. I have a similar problem as described in Stackoverflow, one of my columns has an array of 100 sizes each row, and the other column is just numeric. Hence when I perform dbscan on the data I get the same error.

Is there any smart way to handle such cases? Combining doc2vec output with other numerical columns to prepare it for clustering? Something like this, where both_numeric_categical_columns is the desired input to the model:

clf = DBSCAN(eps=0.5, min_samples=10)
X = clf.fit(both_numeric_categical_columns)
labels=clf.labels_.tolist()

cluster1 = query_result_mini.copy()

cluster1['clusters'] = clf.fit_predict(both_numeric_categical_columns)

Topic doc2vec word-embeddings dbscan categorical-data clustering

Category Data Science

Brian Spiering · Accepted Answer · 2021年2月23日 16:06

You did not mention which package you are using. If you using scikit-learn, sklearn.pipeline.FeatureUnion concatenates results of multiple transformer objects.

Something like this:

from sklearn.cluster             import DBSCAN
from sklearn.pipeline            import FeatureUnion, Pipeline
from skearnsklearn.preprocessing import StandardScaler

pipeline = Pipeline([('feats', FeatureUnion([
                        ('doc2vec', d2v_model,        ['sentence1', 'sentence2']), 
                        ('numeric', StandardScaler(), ['price']) 
                    ])),
                    ('cluster', DBSCAN())  
                    ])

DBSCAN on textual and numerical columns

About