Clustering with sets as values

I have gathered a large amount of qualitative data and am now looking to cluster it so as to make sense of it. For this, I am using Biolab's Orange.

In my data, specific values may co-occur in a given feature, or they may not. I am wondering how I could cluster the data (either in Orange or other software), where values that co-occur would be seen as two values, rather than one string.

To make matters clearer, imagine I have a feature X, with the possible values A and B. The values can occur in the following way: A, B, A and B.

Question: how can I cluster my data without "A and B" being treated as a separate string, but rather "A" and "B" co-occurring?

Topic orange3 orange clustering

Category Data Science


There are plenty of well-established methods for this.

Read up on Jaccard index. Actually I don't like the current Wikipedia article much, because I consider the computer vision example quite stupid, for example. I think the discussion should rather be based on the original biological species use case.


It would appear that you have a data processing task. You could use two columns 'A Occurrence' and 'B Occurrence'. If that value occurs for the record that index would contain a 1. If that value does not occur for that record the index would contain a 0.

Example:

X | A Occurrence | B Occurrence
A | 1 | 0
A and B | 1 | 1

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.