How to do target encoding when data has repeated rows?
How can I do encoding for a category when data has repeated rows? Can I do target encoding? Or Is there another encoding I can use?
I want to figure how to include a categorical variable in a model to predict a numerical variable Y. Because I am working with some legislative data, my challenge is my category code is over 4000 unique values, those values that cannot be easily grouped(*), and they can have repeats. In fact, anecdotally I know that both that code is a strong predictor of Y and repeats in a code per row will have a stronger impact on Y.
My attempt at a solution: In fact, I tried to do this myself thinking that maybe something as simply as mean target times the count of each code (Table B * Table C). However, I am not sure if this would be valid type of encoding.
A = sample legislative data to predict ethics variable Y, the values have been anonymized
+-----+-----------+------------+
| row | Category | Y variable |
+-----+-----------+------------+
| 2 | 15-11-2 | 370 |
| 2 | 15-11-106 | 370 |
| 2 | 15-5-44 | 370 |
| 2 | 15-11-41 | 370 |
| 3 | 15-7-23 | 363 |
| 3 | 15-7-23 | 363 |
| 3 | 15-11-37 | 363 |
| 3 | 15-11-37 | 363 |
| 3 | 15-5-21 | 363 |
| 6 | 15-11-106 | 287 |
| 6 | 15-5-21 | 287 |
| 7 | 16-5-70 | 931 |
| 7 | 15-5-23 | 931 |
| 7 | 15-5-41 | 931 |
| 9 | 15-11-7 | 336 |
| 9 | 15-2-30 | 336 |
| 9 | 40-11-76 | 336 |
| 9 | 15-2-2 | 336 |
| 9 | 15-11-131 | 336 |
| 9 | 15-11-131 | 336 |
| 10 | 15-2-30 | 350 |
| 10 | 15-11-131 | 350 |
+-----+-----------+------------+
B = The Mean Values for the unique rows
+-----------+-----------------------+
| Category | Average of Y variable |
+-----------+-----------------------+
| 15-11-106 | 334.714 |
| 15-11-2 | 298.994 |
| 15-11-37 | 251.931 |
| 15-11-41 | 399.210 |
| 15-11-7 | 288.021 |
| 15-2-2 | 234.533 |
| 15-2-30 | 303.127 |
| 15-5-21 | 317.999 |
| 15-5-23 | 252.109 |
| 15-5-41 | 300.882 |
| 15-5-44 | 499.405 |
| 15-5-70 | 277.306 |
| 15-7-23 | 244.976 |
| 40-11-76 | 261.938 |
| 15-11-131 | 319.799 |
+-----------+-----------------------+
Table C The counts of each category
+-----+-----------+-------------------+
| row | Category | Count of Category |
+-----+-----------+-------------------+
| 2 | 15-11-106 | 1 |
| 2 | 15-11-2 | 1 |
| 2 | 15-11-41 | 1 |
| 2 | 15-5-44 | 1 |
| 3 | 15-11-37 | 2 |
| 3 | 15-5-21 | 1 |
| 4 | 15-7-23 | 2 |
| 4 | 16-8-18 | 1 |
| 5 | 16-7-1 | 1 |
| 6 | 16-8-2 | 1 |
| 6 | 15-11-106 | 1 |
| 7 | 15-5-21 | 1 |
| 7 | 15-5-23 | 1 |
| 8 | 15-5-41 | 1 |
| 8 | 16-5-70 | 1 |
| 8 | 16-11-37 | 1 |
| 8 | 16-7-21 | 1 |
| 9 | 15-11-131 | 2 |
| 9 | 15-11-7 | 1 |
| 9 | 15-2-2 | 1 |
| 9 | 15-2-30 | 1 |
| 9 | 40-11-76 | 1 |
| 10 | 15-11-131 | 1 |
| 10 | 15-2-30 | 1 |
| 10 | 40-11-76 | 1 |
| | | |
+-----+-----------+-------------------+
Footnote: (*) What I mean by "the category code cannot be easily grouped" is that 15-11-2 and 15-11-1 cannot be group under 15-11 or 15. I have also tried clustering.
Topic categorical-encoding feature-engineering encoding
Category Data Science