EM clustering with missing and misspelling data

I am currently working on a project that requires me to cluster the unlabeled input. The records contain personal information such as name, DOB, height, sex, etc. We need to cluster the same person in one group, here is the sample data:

+------------------------------------+
|        Record1     Record2         |
+------------------------------------+
| First Name   'Harry'     'Harry'   |
| Middle Name  'Jay'       'J'       |
| Last Name    'Potter'    'Potter'  |
| DOB Month    1           1         |
| DOB Day      1           1         |
| DOB Year     1993        1993      |
| Dr License   'A1234567'  Null      |
| Sex          1           1         |
| Address      Hogwarts'  'Hagwarts' |
+------------------------------------+

Those two records should be clustered in the same group.
I am still in the beginning phase of the project. I want to try with EM clustering using Gaussian Mixture model. But I do not know how I should preprocess the string, or should I do it and how to deal with the Null value.

Topic expectation-maximization clustering machine-learning

Category Data Science


Convert your key-Value Table to a table in wide format (1 row per person) , then calculate the distance matrix using the Jaccard Distance (which can convert categorical values into numeric values)

library(dummies)
dummy_dat <- dummy.data.frame(my_data)

jaccard_dist <- dist(dummy_dat, method = "binary")

#  Distance Matrix
jaccard_dist

Gaussian mixture modeling is not appropriate to use on text.

Text just is not Gaussian.

Choose appropriate distribution assumptions instead.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.