Regression with a feature which has its own depth

Question

Regression with a feature which has its own depth

ffxx68

2021年3月26日 15:52

I'm relatively new to ML/Statistical Analysis, and I'm facing a dataset structured like this

person_id, pay, task, hours
1, 560, A, 3
1, 560, B, 5
2, 650, A, 7
3, 520, C, 6
3, 520, A, 2
...

meaning person 1 is cumulatively paid 560 to perform task A 3 hrs and task B 5 hrs; person 2 paid 650 for task A 7 hrs; person 3 paid 520 for task C 6 hrs and A 2 hrs, etc. I hope it's clear.

I'd like to perform a regression, where my X plane is (task, hours) and Y is the per-person pay, but I haven't figured out yet how to approach such a problem. My tool box would be based on python+scikit-learn, preferably. But a generic discussion would be useful as well.

This is like

person_id, pay, tasks
[1, 560, [[A, 3], [B, 5]]
[2, 650, [[A, 7]]
[3, 520, [[C, 6], [A, 2]]
...

where person_id is a high cardinality feature which can be easily neglected, the Y label is pay, while the tasks (X) feature has its own structure, with fixed shape (2 dimensions here), but not a predetermined depth, although limited in size (maybe 5-10 possible different tasks). I can't understand how to fit this in a regression schema, with such a structured feature data. Should I flatten tasks out, by explicitly having all possible values (A hours, B hours, C hours,... etc) as different columns, or is a more general approach possibile?

Moreover, this is a simplified version of my problem, to make the description simple enough, but it could include even more dimensions in the tasks structure, in which case the number of flattened task features would easily explode, to account for all possible combinations.

Any help welcome and appreciated! Thanks

Topic feature-construction regression feature-extraction feature-selection

Category Data Science

Regression with a feature which has its own depth

About