"ValueError: Data cardinality is ambiguous" in model ensemble with 2 different inputs

I am trying a simple model ensemble with 2 different input datasets and 1 output. I want to get predictions for one dataset and hope model will extract some useful features from the second one. I get an error:

ValueError: Data cardinality is ambiguous: x sizes: 502, 1002 y sizes: 502 Make sure all arrays contain the same number of samples.

But I want it to fit for smaller dataset.

Architecture is like this:

Code:

import pandas as pd
from keras.models import Model
from keras.layers import Input, Dense, Flatten
from keras.layers.merge import concatenate
from keras.utils import plot_model

train_1d_X = pd.read_csv('train1d_x.csv').values
train_12h_X = pd.read_csv('train12h_x.csv').values
train_1d_y = pd.read_csv('train1d_y.csv').values
train_12h_y = pd.read_csv('train12h_y.csv').values

#model 1d
input_1d = Input(shape=train_1d_X.shape)
dense_1d_1 = Dense(16, activation='relu')(input_1d)

#model 12h
input_12h = Input(shape=train_12h_X.shape)
dense_12h_1 = Dense(16, activation='relu')(input_12h)

#merge
merge = concatenate([dense_1d_1, dense_12h_1], axis=1)

hidden1 = Dense(32, activation='relu')(merge)
output = Dense(1, activation='linear')(hidden1)

model = Model(inputs=[input_1d, input_12h], outputs=[output])

model.compile(loss='mse', optimizer='adam')
model.fit([train_1d_X, train_12h_X], train_1d_y, epochs=10, verbose=2)

Topic ensemble keras tensorflow regression deep-learning

Category Data Science


Some solutions are better worked out when we understand the underlying layer behind it. In the field of NLP which utilizes Recurrent Neural Networks, ideally allows varying input sizes. Even in Convolutional Neural Networks which is commonly used for images allows varying input sizes. Research discussion on dealing with varying inputs

You have two choices to either downsize your input samples to a fixed size or to up sample your input samples so both the input samples match. For up sizing, you can pad the sequences with zeros to ensure both the inputs have fixed size. Data cardinality issue resolved by using pad_sequences

For CNN models where the neural network graph for multiple inputs is as shown below neural network graph for multiple inputs

Code sample for multiple inputs example for CNN as mentioned

Do take a look at the below links for better understanding and make your call on best approach to solving your problem


I haven't used tensorflow much lately (more of a pytorch guy), but my interpretation of that error is that your X tensor has two dimensions and your y tensor only has 1, so it's not clear which dimension of X is supposed to align with y. Try adding an empty dimension to your labels and see if that fixes the issue:

train_1d_y = tf.expand_dims(train_1d_y, 1) 

You can not feed your network with two inputs with different number of samples, and this also does not make sense.

You have 2 inputs with shape (502,) and (1002,) (You have said you want to extract features also from your second dataset). Let's consider the batch size is 1 for the sake of simplicity. So the model takes one sample each time to move it through layers.


Problem:

Now, regarding you have 502 and 1002 samples, the question is, which one of them should be selected as the input pair? For example, the first sample in your first data set, associated with which sample in your second dataset?

Reason:

Creating input pairs, is the reason that model expects to get inputs in the same number of samples, and it will consider the first sample in your first dataset is associated with the first sample in your second dataset.


Solution:

So, you should take a subset of your second dataset in a way each sample in your second dataset, corresponds to the same ordered sample in your first dataset. Take care of your samples order. If you shuffle the first dataset, you should shuffle the second dataset in the same order.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.