EM clustering with missing and misspelling data
I am currently working on a project that requires me to cluster the unlabeled input. The records contain personal information such as name, DOB, height, sex, etc. We need to cluster the same person in one group, here is the sample data:
+------------------------------------+
| Record1 Record2 |
+------------------------------------+
| First Name 'Harry' 'Harry' |
| Middle Name 'Jay' 'J' |
| Last Name 'Potter' 'Potter' |
| DOB Month 1 1 |
| DOB Day 1 1 |
| DOB Year 1993 1993 |
| Dr License 'A1234567' Null |
| Sex 1 1 |
| Address Hogwarts' 'Hagwarts' |
+------------------------------------+
Those two records should be clustered in the same group.
I am still in the beginning phase of the project. I want to try with EM clustering using Gaussian Mixture model. But I do not know how I should preprocess the string, or should I do it and how to deal with the Null value.
Topic expectation-maximization clustering machine-learning
Category Data Science