Question about Similarity vs Dissimilarity Matrix

Right now, I'm working on coming up with a similarity vs dissimilarity matrix for a set of data points for a clustering algorithm. My question is if I want to use one of the many clustering algorithms given in $R$, such as the K-Medoids algorithm, does it require a similarity or dissimilarity matrix as its parameter?

What's the difference between the two?

If I use the Gower Distance from the Daisy function in R, does it output a similarity or dissimilarity matrix?

Also, let's assume that I have $n$ features and they are all categorical (this is just an example) I a custom distance function where when comparing two data points $G$ and $H$, I use the formula $$\sum_i^nX_i$$ where $X_i = 1 $ if feature $i$ of $G$, $G_i$ and feature i of $H$, $H_i$ are equal to each other. So, $$X_i=1$$if and only if $G_i=H_i$ for feature $i$ for all of the $n$ categorical features. Will this result in getting a similarity or dissimilarity matrix?

Also, as mentioned above, if I want to use one of the many clustering algorithms given in $R$, such as the K-Medoids algorithm, does it require a similarity or dissimilarity matrix as its parameter?

In general, does the similarity or dissimilarity matrix get used for these?

Topic distance similarity k-means clustering bigdata

Category Data Science


In many machine learning packages dissimilarity, which is a distance matrix, is a parameter for clustering (sometimes semi-supervised models).

However the real parameter is type of the distance. You need to tune distance type parameter like k in kmeans. (You need to optimize the distance type according to your business objective).

Check https://en.wikipedia.org/wiki/Distance for distance types. Additionally in some cases, correlation is used for similarity.


A similarity is larger if the objects are more similar.

A dissimilarity is larger if the objects are less similar.

This sounds trivial, but if you get the sign wrong, you suddenly search for the worst rather than the best solution...

It's easy to see that a distance is always a dissimilarity.

K-medoids could be implemented for similarities, but I am not aware of any implementation that does not expect the data to be a dissimilarity. It may be fine to simply pass -similarity to many implementations. Because all they care for is to minimize a sum of dissimilarities, which can trivially be shown then to be equivalent to maximizing the sum of similarities.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.