What's the best way to do classification basing on two given datasets (annual data and daily data)?

I want to do binary-classification basing on two given dataset, one is annual statistical data of a company and has the label I should be able to predict like this:

company_id | year | annual sales | something else... | label
0          | 2017 |  2000320     |   ...             |   0
0          | 2018 |  4002530     |   ...             |   0
0          | 2019 |  800050      |   ...             |   1
1          | 2017 |  1024380     |   ...             |   1
1          | 2018 |  7085521     |   ...             |   0
1          | 2019 |  4525252     |   ...             |   0
2          | 2017 |  25258770    |   ...             |   0
2          | 2018 |  95402000    |   ...             |   1
2          | 2019 |  8605200     |   ...             |   0

And the other dataset is daily statistical data of a company:

    company_id | year | date(MM-dd) | daily sales  | something else... 
    0          | 2017 | 12-02       | 5210         |   ...             
    0          | 2017 | 12-03       | 3542         |   ...             
    0          | 2017 | 12-04       | 8575         |   ...             
    0          | 2017 | 12-06       | 1254         |   ...             
    0          | 2017 | ...         | ...          |   ...             
    0          | 2018 | 12-01       | 1352         |   ...   
    0          | 2018 | 12-02       | 4856         |   ... 
    0          | 2018 | ...         | ...          |   ...           
    0          | 2019 | 12-01       | 4583         |   ...  
    0          | 2019 | ...         | ...          |   ...            
    1          | 2017 | 12-01       | 5210         |   ...   
    1          | 2017 | ...         | ...          |   ...            
    1          | 2018 | 12-01       | 5202         |   ...   
    1          | 2018 | ...         | ...          |   ...           
    1          | 2019 | 12-01       | 8675         |   ...       
    1          | 2019 | ...         | ...          |   ...       

I am wondering what's the best way to fully utilize these data to predict the label of each company?

Or is there any related topic I may refer to? I am willing to do some searching on that.

I am considering left join the annual dataset on the daily dataset, but this will result that many rows have the same value in the annual features and the size of dataset grows dramatically.

Topic finance classification data-mining

Category Data Science


At some time, I'll also require the assistance of programmers. He initially approached a private individual, but he worked on the project for a long time with no results. I needed to go a step further and work with a single global company https://www.avenga.com/industries/pharma-life-sciences/. The guys worked quickly and efficiently to complete the job! The fast support was really appreciated; the manager aids with all elements of the program's operation. I'm happy with how well we worked together. Excellent value for money!


your use case isn't entirely clear but if i may make some assumptions

  • the company ID in both tables refer to the same company (so ID 0 is the same company in both table)
  • you already have a good idea of feature engineering and know which algo to use for your final classification BUT i am going to treat it like i am modeling it

if i were you, i would use 2 parallel models

  • one model takes in the annual input data and passes it through a bunch of dense / feedforward layers ( after necessary pre processing)
  • model # 2 will take the entire data of 1 year for a company and pre process it in a way that can be fed into an LSTM / RNN / GRU ..basically the daily data is nothing but a sequence of time steps that adds up to a year (like words in a seq forming a sentence and , more imp, a context ) ..now pass the rnn output through a dense layer or 2
  • finally concatenate outputs of both (using add / mean / simple concatenate) and then pass through a dense layer or 2 with softmax at the end .. basically the model will backpropagate the error and ensure relevant weights given to each model (so they are trained together)..but based on the variety in the data one of the models just might end up with a very small weight so you will also need to validate it by training these 2 models separately and ensure the sum is greater than the parts

Since the daily dataset does not contain labels, you could aggregate the daily data into annual and then do the join. It sounds like a (binary) classification problem, which can be done using methods such as logistic regression. You will however have to handle missing values caused by the left join, one method would be imputing them. Or just doing an inner join if the missing data is random without patterns (e.g. companies of certain type don't have missing data more often than the other types) and if there's enough data that is not missing.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.