Shall I use ordinal encoding or One-Hot-Encoding when using DBSCAN for content clustering on websites?

I want to cluster the preparation steps on cooking recipes websites in one cluster so I can distinguish them from the rest of the website.

To achieve this I extracted for each text node of the website the DOM path (e.g. body-div-div-table-tr ....) and did a One-Hot-Encoding before I executed the DBSCAN clustering algorithm.

My hope was, that the DBSCAN algorithm recognizes also not only 100% identical DOM-paths as 1 common cluster, because sometimes one preparation step is e.g. in a tag and the others are not. But even though I tried a lot to vary epsilon and MinPoints paramters, it does not recognice then all as one cluster.

My question: Is One-Hot-encoding maybe the wrong way, because a dom path is not really 100% categorical but maybe a kind of ordinal? Because the more common DOM path elements two DOM-paths have the more likely is, that they are building one common cluster.

Topic one-hot-encoding feature-engineering feature-scaling dbscan feature-selection

Category Data Science


With the information you provide, I suggest you can use both and test based upon a metric like Siluhette Score or Davis Bouldin index to measure the quality of your clusters when using either of both preprocessing techniques. In this way, you will have an objective metric to compare between the two preprocessing techniques and keep the one that maximizes the chosen metric since it means you are taking the preprocessing that generates better-formed clusters

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.