EM clustering with missing and misspelling data

Question

EM clustering with missing and misspelling data

roy

2018年7月19日 18:02

I am currently working on a project that requires me to cluster the unlabeled input. The records contain personal information such as name, DOB, height, sex, etc. We need to cluster the same person in one group, here is the sample data:

+------------------------------------+
|        Record1     Record2         |
+------------------------------------+
| First Name   'Harry'     'Harry'   |
| Middle Name  'Jay'       'J'       |
| Last Name    'Potter'    'Potter'  |
| DOB Month    1           1         |
| DOB Day      1           1         |
| DOB Year     1993        1993      |
| Dr License   'A1234567'  Null      |
| Sex          1           1         |
| Address      Hogwarts'  'Hagwarts' |
+------------------------------------+

Those two records should be clustered in the same group.
I am still in the beginning phase of the project. I want to try with EM clustering using Gaussian Mixture model. But I do not know how I should preprocess the string, or should I do it and how to deal with the Null value.

Topic expectation-maximization clustering machine-learning

Category Data Science

knb · Accepted Answer · 2018年6月19日 16:39

Convert your key-Value Table to a table in wide format (1 row per person) , then calculate the distance matrix using the Jaccard Distance (which can convert categorical values into numeric values)

library(dummies)
dummy_dat <- dummy.data.frame(my_data)

jaccard_dist <- dist(dummy_dat, method = "binary")

#  Distance Matrix
jaccard_dist

Has QUIT--Anony-Mousse · Accepted Answer · 2018年6月18日 20:03

1

Has QUIT--Anony-Mousse answered at 2018年6月18日 20:03

Gaussian mixture modeling is not appropriate to use on text.

Text just is not Gaussian.

Choose appropriate distribution assumptions instead.

EM clustering with missing and misspelling data

About