Mixed Data Type Classification / Neighbor Algorithm
Here is a hypothetical simplified dataframe of my problem, which would be low dimensional (20ish features), containing some made-up information about certain dog breeds:
Breed | Min_Weight | Max_Weight | Min_Height | Max_Height | is_friendly | grp |
---|---|---|---|---|---|---|
Husky | 10 | 20 | 30 | 35 | True | working |
Poodle | 8 | 17 | 15 | 30 | False | terrier |
The algorithm would receive some information about a dog, and it would need to identify k-closest dog breeds based on the input data. It needs to be high performance.
Example: algorithm receives an unknown breed with data:
Weight | Height | is_friendly | grp |
---|---|---|---|
18 | 23 | 1 | terrier |
Returns: n closest breeds from our sample dataframe, and the closeness
What sort of algorithm/model makes sense here, with multiple types of variables, ranges (min and max height, guessing I will need to generate data to fill in these ranges), and Boolean values?
Also, is there an approach to weight certain characteristics (ex: we are confident in the measurement of the unknown dogs weight so have that invoke more influence when choosing a breed, not confident about height, so lessen the influence, etc.)? How should I approach this problem?
Topic k-nn machine-learning-model classification algorithms clustering
Category Data Science