Classification with feature not available at time of model creation

Question

Classification with feature not available at time of model creation

EricA

2021年12月27日 03:08

I have problem statement to predict the probability of solving a task depending on multiple features for e.g. when the task was created, the time needed to work on a task, etc Please find a dummy snippet attached

task_id  date_time_open    time_needed   day_created  time_created   status 

aa      12/09/2019             20 hrs     Tuesday        3 pm      done  
cc      17/10/2019             4 hrs      Friday        10 pm      not_done

I know I can run a classification model to identify the class. However, things complicate when I add a time dimension to it since the data set now gets an added feature which highly impacts the status

The task was scanned at suppose 7 pm and a new feature added for 7 pm

task_id  date_time_tsk_open    time_needed     day_created  time_created    status_7pm     status            

    aa      12/09/2019               20 hrs     tuesday       3pm              done           done 
    cc      17/10/2019               4 hrs      friday        10 pm            done           not_done
    dd      19/10/2019               6 hrs      friday        2 pm             done          done 
    ff      19/10/2019               9 hrs      Monday        4 pm             not_done      not_done

The task id was again scanned at a fixed interval of 1 hr and added new features to data

task_id  date_time_tsk_open    time_needed     day_created  time_created    status_8pm     status            

    aa      12/09/2019               20 hrs     tuesday       3pm              done         done 
    cc      17/10/2019               4 hrs      friday        10 pm            not_done     not_done
    dd      19/10/2019               6 hrs      friday        2 pm             done            done 
    ff      19/10/2019               9 hrs      Monday        4 pm             not_done        not_done

The final prediction of status == resolved / un_resolved in my understanding should be based on features including status_7pm and status_8pm.

How should the data structure for training such a classification model look like to generate a prediction at time 9 pm for sample task ff respectively

  task_id  date_time_tsk_open    time_needed     day_created  time_created    status_7pm status_8pm     status            


    ff      19/10/2019               9 hrs      Monday        4 pm            not_done           not_done      not_done

I assume the classification model should be trained on all status_1, status_2 ....status_8pm to classify status. Or would the model be trained every time in memory once it gets a new column updated status every hour

Topic time prediction classification

Category Data Science

neal · Accepted Answer · 2021年3月13日 12:37

There are several different ways to formulate the "probability of solving a task" problem, as either a classification or regression. Each formulation would require you to transform your data a little bit so that it matches up with what the model is trying to do. Here are some ideas:

Predicting how long a new task will take overall

For model training, you could only train on tasks that are completed and remove tasks that are still being worked on. The reasoning here is that tasks that are still not_done (by definition) do not yet have "how long they were worked on." Whenever a new task is created, you can predict how long it will take overall.

For example, Task A is created at 6pm and you predict it needs 2 hours. Then, at 8pm, the system should think it's 'done' (or close!).

An alternative regression problem here is to predict how much more work a task needs, given how much it has been worked on already. In this case you are predicting things like "this task needs 15 more minutes" instead of "this task will take 45 minutes overall."

Predicting whether the task will be completed in the next hour

Since you are scanning tasks every hour, it sounds like you may be interested in knowing "will this task be finished by the time the next scan happens?" which means reformulating the problem into a binary classification one. For each task, you can have several training examples:

Task A was opened at 7pm, not finished at 8pm (0)
Task A was opened at 7pm, was finished at 9pm (1)
And so on

Then - at 8pm you want to predict for all all not_done tasks to see which ones should be done by 9pm.

lcrmorin · Accepted Answer · 2020年2月16日 10:50

It seems the simplest way to go would be to build a line at each time step after 'time created'. With 'status_n-1', 'status_n'. That will allow to deal with the notion of time in rows. You also might want to :

Deal with time relatively : instead of considering status at a given hour, you probably want to work with status since the creation of the task.
You will need to deal with task ponderation (longer tasks will get more rows) one way or another. You may add some ponderation in your model, based on 1/(expected length).
Add some features : to me it is unclear what feature you use for prediction. As is, you seems to be trying to predict time of completion based on start time / expected length. You won't learn much except for which tasks takes more than expected / some daily effects. I think this can be achieved more efficiently / more clearly with simple statistics.

Classification with feature not available at time of model creation

About