Find changes in variables into two states

I have a dataframe like this:

dframe - structure(list(c(60, 91, 377, 419, 893, 905), c(-0.6647, -0.0275000000000001, 
-0.6311, 0.1328, -0.4559, -1.0208), c(-1.6964, -1.3851, -1.1428, 
-1.4191, -1.2979, -1.441), c(4.1104, 2.998, 3.4623, 1.9545, 3.5166, 
3.9912), c(-1.6663, -1.0789, -1.6608, -1.0137, -1.4022, -1.6189
), c(0.902, 0.5417, 0.2651, -0.4998, 0.72, 1.0902), c(0.061, 
-0.1321, -0.6613, -0.9655, -0.3879, -0.3222), c(0.6573, -1.8156, 
-1.1072, -1.6147, -1.7412, -0.8048), c(-1.6561, 3.3495, 3.1694, 
4.7327, 3.7275, 3.0135), c(0.2499, -1.5437, -1.3843, -1.8279, 
-1.487, -1.133), c(1.1265, 0.2224, 0.5074, 0.9983, 0.4906, 0.3672
), structure(c(3, 1, 3, 1, 1, 3), label = "TwoStep Cluster Number", labels = c(`Outlier Cluster` = -1), class = "haven_labelled"), 
    structure(c(2, 3, 1, 3, 3, 1), label = "TwoStep Cluster Number", labels = c(`Outlier Cluster` = -1), class = "haven_labelled")), .Names = c("id", 
"colA", "colB", "colC", "colD", "colE", "colA_new", "colB_new", 
"colC_new", "colD_new", "colE_new", NA, NA), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

dframe

   id    colA    colB   colC    colD    colE colA_new colB_new colC_new colD_new colE_new NA NA
1  60 -0.6647 -1.6964 4.1104 -1.6663  0.9020   0.0610   0.6573  -1.6561   0.2499   1.1265  3  2
2  91 -0.0275 -1.3851 2.9980 -1.0789  0.5417  -0.1321  -1.8156   3.3495  -1.5437   0.2224  1  3
3 377 -0.6311 -1.1428 3.4623 -1.6608  0.2651  -0.6613  -1.1072   3.1694  -1.3843   0.5074  3  1
4 419  0.1328 -1.4191 1.9545 -1.0137 -0.4998  -0.9655  -1.6147   4.7327  -1.8279   0.9983  1  3
5 893 -0.4559 -1.2979 3.5166 -1.4022  0.7200  -0.3879  -1.7412   3.7275  -1.4870   0.4906  1  3
6 905 -1.0208 -1.4410 3.9912 -1.6189  1.0902  -0.3222  -0.8048   3.0135  -1.1330   0.3672  3  1

id is unique. I want to find for every pair of variables such as colA-colA_new, colB-colB_new etc. how the changes in scores change in the first and new column.

How can I model it?

Topic market-basket-analysis markov-process statistics

Category Data Science


Before doing anything more advanced, I would check the correlation of new value / old value pairs. If there seems to be a linear relationship, or some kind of correlation, that would be easy to solve. If that is not the case and the "new" value for each column depends on every other column in a nontrivial way, you may have to use more complicated methods.

Such a problem is called as a Multivariate Regression Analysis, and if you have a large enough dataset, you can try using using a neural network. I would have the old values for each variable as the inputs, and the new ones as outputs. A fully connected neural network - with size depending on the size of your dataset - could be a good first step. But to be honest, I don't have much experience with such models and don't know any tutorials available online. Another way might be to train separate regression models for each output. In such a setting, you can look for correlations with different inputs and perhaps eliminate any useless ones.

By the way, what are the last two columns title NA? If they carry information, I would add them to the input layer.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.