High Performance Classification or Similarity Algorithim for Mixed Data Types?

I have a database holding 10-ish features that describe different breeds of dogs. They are mostly categorical features, but some provide ranges for values. Here's a demo representation of the database, showing the mixture:

|Breed|Min_Height|Max_Height|Min_Weight|Max_Weight|sub_cat|is_friendly|
|---------------------------------------------------------------------|
|Dober|20        |20        |40        |52        |sport  |FALSE      |
|Pood |15        |25        |35        |45        |water  |TRUE       |
...

As you can see, the data is mixed and the ranges have some overlap from entry to entry.

Say I receive an input of:

|height|weight|sub_cat|is_friendly|
|---------------------------------|
|16    |43    |water  |TRUE       |

I need to calculate the 5 most similar breeds in the database, and give the user a probability of it being each of those 5.

The pain point is the ranges provided, the way they overlap, and the mixed datatype nature of the problem. Gower's Distance caught my eye, but the ranges are throwing me off.

I thought about just calculating the mean for the ranges and calculating the similarity between input and the means, but in the case of the min_height and max_height in the example database above, the mean would be the same between entries! So that won't work.

Something that allows me to assign weights to features would be nice too (ex : we are confident in accuracy of the input weight value, not as much in height, so add some favor to weight when making our similarity calculation).

How should I approach this classification problem?

Topic supervised-learning classification python similarity clustering

Category Data Science


You have many option, one would be to train a classification model where you encode categorical features with one-hot encoding and let the algorithm do the rest.

With similarity measures I would probably consider some kind of weighted similarity. The simple option is to calculate a similarity score for the numerical features, then combine this with the binary similarity of the categorical features. For example if sub_cat and is_friendly are as important as the dimensions, you could take the simple mean:

final_sim_score = .33 dim_score + .33 sub_cat_score + .33 friendly_score

Where sub_cat_score and friendly_score are 0 (different value) or 1 (same value).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.