How to deal with attributes that can vary arbitrarily for each sample?

Question

How to deal with attributes that can vary arbitrarily for each sample?

Jardeson Barbosa

2018年10月25日 20:09

Let's say we are trying to classify cars into five different categories. For this, we have a lot of samples described by color, brand, model, year of manufacture and so on. For instance, imagine something like this:

cars

+-----+-------+---------+-------------+--+---+---------------------+ | id | color | brand | model | | year of manufacture | +-----+-------+---------+-------------+--+---+---------------------+ | ... | ... | ... | ... | ... | ... | | 319 | Black | Ferrari | Dino 246 GT | ... | 1967 | | 320 | Gray | Ferari | 250 GTE | ... | 1960 | | 321 | Red | Ford | Mustang | ... | 1969 | | 322 | Black | Jaguar | E-Type | ... | 1961 | | ... | ... | ... | ... | ... | ... | +-----+-------+---------+-------------+-----+---------------------+

Also, in this context, we have more attributes which we know that, somehow, should be relevant while classifing these cars. These attributes are in another database (or table) and they describe historical changes (the relevant ones) applied to the car. For simplicity, let's assume that each modification is associated with a code (but we also have another attributes related that may be relevant, like date of modification and estimated cost, for example).

cars_changes

+-----+--------+-------------+-----------------+------------+-------+ | id | car_id | change_code | description | date | cost | +-----+--------+-------------+-----------------+------------+-------+ | ... | ... | ... | ... | ... | ... | | 17 | 319 | CLR-93AA | New painting | 2009-11-18 | 800 | | 18 | 319 | ENG-77TS | Change engine | 2011-06-04 | 3,000 | | 19 | 319 | GAS-19BV | Add gas as fuel | 2016-02-23 | 1,739 | | 17 | 319 | CLR-93AA | New painting | 2017-09-18 | 1,100 | | 20 | 321 | CLR-92BD | New painting | 2012-03-17 | 930 | | 21 | 321 | GAS-19BV | Add gas fuel | 2016-05-11 | 1,385 | | ... | ... | ... | ... | ... | ... | +-----+--------+-------------+-----------------+------------+-------+

As you can perceive, observe that: - a car does not necessarily had changes; - there is not a maximum amount of changes that a car may have made; - a change (with same code) can occur many times in the same car; - consider that the possibility of different changes in this context are of high dimension categories (let's say about 2 thousands!)

Also, consider that there is a classifier based on the first table and our goal is to improve this task (of classifying), somehow, using the information from the second database. Considering that the most important information for this task are present on the second table, how we can use this information in a appropriate way for this problem?

Topic spatial-transformer preprocessing

Category Data Science

How to deal with attributes that can vary arbitrarily for each sample?

cars

cars_changes

About