Clustering with hierarchical data dependencies

I am currently looking into how to cluster data with hierarchical dependencies. An example of a problem that I want to cluster: we would like to cluster cities to identify similar characteristics with respect to inhabitants. As input data, I have some characteristics such as the age, weight, height and sex of the inhabitants. Each city will therefore be modeled by a vector :

 ______________                                          _      _  
                 number of people aged 20 years old     |  x_1   | 
                 number of people aged 21 years old     |  x_2   | 
    age                                                 |        | 
                                                        |        | 
                                                        |        | 
 ______________  number of people aged 79 years old     |  x_k   | 
                 number of people of weight of 55kg     |        | 
                 number of people of weight of 56kg     |        | 
                                                        |        | 
    weight                                              |        | 
                number of people of weight of 100kg     |        | 
 ______________ number of people of weight of 111kg     |        | 
                number of people of height of 1.55m     |        | 
                number of people of height of 1.56m     |        | 
    height                                              |        | 
                                                        |        | 
                number of people of height of 2.02m     |        | 
 ______________ number of people of height of 2.03m     |        | 
    sexe        number of male inhabitant               |        | 
 ______________ number of female inhabitant             |_ x_n  _| 

If I want to use k-means the input data are not independent, there is a strong correlation between different ages, different heights, etc ... Moreover, it seems illogical to me to have different dimensions for variables representing the same thing.

I'm not sure if there are any methods to deal with this kind of problem or if it's just a way to write it differently.

Topic unsupervised-learning clustering machine-learning

Category Data Science


Your data is currently organized as counts. You'll need a distance metric that is designed for count data. One example is chi-square distance metric.

After picking a distance metric, you can pick a clustering algorithm.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.