How to do target encoding when data has repeated rows?

How can I do encoding for a category when data has repeated rows? Can I do target encoding? Or Is there another encoding I can use?

I want to figure how to include a categorical variable in a model to predict a numerical variable Y. Because I am working with some legislative data, my challenge is my category code is over 4000 unique values, those values that cannot be easily grouped(*), and they can have repeats. In fact, anecdotally I know that both that code is a strong predictor of Y and repeats in a code per row will have a stronger impact on Y.

My attempt at a solution: In fact, I tried to do this myself thinking that maybe something as simply as mean target times the count of each code (Table B * Table C). However, I am not sure if this would be valid type of encoding.

A = sample legislative data to predict ethics variable Y, the values have been anonymized

+-----+-----------+------------+
| row | Category  | Y variable |
+-----+-----------+------------+
|   2 | 15-11-2   |        370 |
|   2 | 15-11-106 |        370 |
|   2 | 15-5-44   |        370 |
|   2 | 15-11-41  |        370 |
|   3 | 15-7-23   |        363 |
|   3 | 15-7-23   |        363 |
|   3 | 15-11-37  |        363 |
|   3 | 15-11-37  |        363 |
|   3 | 15-5-21   |        363 |
|   6 | 15-11-106 |        287 |
|   6 | 15-5-21   |        287 |
|   7 | 16-5-70   |        931 |
|   7 | 15-5-23   |        931 |
|   7 | 15-5-41   |        931 |
|   9 | 15-11-7   |        336 |
|   9 | 15-2-30   |        336 |
|   9 | 40-11-76  |        336 |
|   9 | 15-2-2    |        336 |
|   9 | 15-11-131 |        336 |
|   9 | 15-11-131 |        336 |
|  10 | 15-2-30   |        350 |
|  10 | 15-11-131 |        350 |
+-----+-----------+------------+

B = The Mean Values for the unique rows

+-----------+-----------------------+
| Category  | Average of Y variable |
+-----------+-----------------------+
| 15-11-106 | 334.714               |
| 15-11-2   | 298.994               |
| 15-11-37  | 251.931               |
| 15-11-41  | 399.210               |
| 15-11-7   | 288.021               |
| 15-2-2    | 234.533               |
| 15-2-30   | 303.127               |
| 15-5-21   | 317.999               |
| 15-5-23   | 252.109               |
| 15-5-41   | 300.882               |
| 15-5-44   | 499.405               |
| 15-5-70   | 277.306               |
| 15-7-23   | 244.976               |
| 40-11-76  | 261.938               |
| 15-11-131 | 319.799               |
+-----------+-----------------------+

Table C The counts of each category

+-----+-----------+-------------------+
| row | Category  | Count of Category |
+-----+-----------+-------------------+
|   2 | 15-11-106 |                 1 |
|   2 | 15-11-2   |                 1 |
|   2 | 15-11-41  |                 1 |
|   2 | 15-5-44   |                 1 |
|   3 | 15-11-37  |                 2 |
|   3 | 15-5-21   |                 1 |
|   4 | 15-7-23   |                 2 |
|   4 | 16-8-18   |                 1 |
|   5 | 16-7-1    |                 1 |
|   6 | 16-8-2    |                 1 |
|   6 | 15-11-106 |                 1 |
|   7 | 15-5-21   |                 1 |
|   7 | 15-5-23   |                 1 |
|   8 | 15-5-41   |                 1 |
|   8 | 16-5-70   |                 1 |
|   8 | 16-11-37  |                 1 |
|   8 | 16-7-21   |                 1 |
|   9 | 15-11-131 |                 2 |
|   9 | 15-11-7   |                 1 |
|   9 | 15-2-2    |                 1 |
|   9 | 15-2-30   |                 1 |
|   9 | 40-11-76  |                 1 |
|  10 | 15-11-131 |                 1 |
|  10 | 15-2-30   |                 1 |
|  10 | 40-11-76  |                 1 |
|     |           |                   |
+-----+-----------+-------------------+

Footnote: (*) What I mean by "the category code cannot be easily grouped" is that 15-11-2 and 15-11-1 cannot be group under 15-11 or 15. I have also tried clustering.

Topic categorical-encoding feature-engineering encoding

Category Data Science


Yes, you can do target encoding. watch this video from @2:18 to @3:50. Hope this helps.

Update: Table B is enough to encode, you have mean values of y among each category. If a responce has like this [15-11-106 ,15-11-2], then you can replace those values with them [334.714, 298.994], now you can apply any statistic to get final encoding value, whether it could be mean/mode/median etc... of those values.

The mean target times the count of each code (Table B * Table C) is nothing but the sum of the target variable for each unique category(instead of mean of target variable you are going to choose sum of target variable). I don't think it's a good strategy.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.