How to extract embeddings of categorical variables

Question

How to extract embeddings of categorical variables

user62198

2022年4月20日 20:06

I am little bit confused about encoding categorical variables. There are other posts/blogposts on this issue but none is talking about the problem I am facing.

I have a dataset with mixed variables (i.e, numerical as well as categorical). Some of the categorical variables has a lot of categories (close to 100). So instead of using One Hot encoders, I am looking into using embeddings.

My goal is to: Use the embeddings of the categorical variables and extract them and put them together with my numerical variables to form the design/training matrix and then run an autoencoder on it (the reason to run autoencoder is to detect anomalous data ).

I have put together a code snippet (see below) but what confuses me is to how to extract the embeddings to put it together with my numerical variables. I need to know because when new data shows up, I need the embeddings to run it through prediction to isolate any odd datapoint from the reconstruction error.

from keras.models import Model
from keras.layers import Input, Dense, Concatenate, Reshape, Dropout
from keras.layers.embeddings import Embedding

inputs = []
embeddings = []

# for categorical variables
for cat in cat_model_vars:
    inp = Input(shape=(1,))
    inputs.append(inp)
    emb = Embedding(cat_sizes[cat]+1,cat_embedding_size[cat], input_length=1,name=f{cat}_embedding)(inp)
    emb = Reshape(target_shape=(cat_embedding_size[cat],))(emb)
    embeddings.append(emb)

# for continuous/numerical variables

inp_num = Input(shape=(len(num_model_vars),),name='cont_vars')
emb_num = Dense(10)(inp_num)
inputs.append(inp_num)
embeddings.append(emb_num)

# concatenate the embeddings, this is going to be input to our Autoencoder model
output = Concatenate()(embeddings)

#%%

# build a stacked Autoencoder below and feed the output as it's input and output

en_x = Dense(50, activation = 'relu')(output)
en_x = Dense(32, activation = 'relu')(en_x)
en_x = Dense(16, activation = 'relu')(en_x)
en_x = Dense(4, activation = 'relu')(en_x)

de_x = Dense(16, activation = 'relu')(en_x)
de_x = Dense(32, activation = 'relu')(de_x)
de_x = Dense(50, activation = 'relu')(de_x)

output = Dense(25, activation = 'relu')(de_x)

stacked_ae_model = Model(inputs, output)
stacked_ae_model.compile(loss='mean_squared_error',optimizer='Adam',
                         metrics=['mse','mape'])

NEED TO EXTRACT THE EMBEDDINGS THAT ARE USED AS AN INPUT TO THE STACKED AUTOENCODER (NOT DONE YET)

Now I will have to run the training and prediction. But I am not sure how to extract the embeddings so that I can use those for training.

Any help would be greatly appreciated.

Thanks and regards.

Topic data-science-model embeddings python

Category Data Science

mirik · Accepted Answer · 2021年7月23日 16:15

It you have a model with multiple inputs you have to pass the dataframe into model as a dictionary of rows. Split the dataframe into 2 dataframes - categorical and numerical.
Categorical dataframe convert into a dictionary by train_df.to_dict('list') and merge into it the numerical dataframe as 1 multidimensional feature with name "cont_vars".
You should also name all input columns:

for cat in cat_model_vars:
   inp = Input(shape=(1,), name=cat)

inp_num = Input(shape=(len(num_model_vars),),name='cont_vars')

Update dataframe (not tested):

train_df2 = train_df[cat_model_vars]
train_df2['cont_vars'] = train_df[cont_model_vars].values.tolist()
train_df = train_df2

Example of how to convert pandas dataframes to tensors:

test_data = tf.data.Dataset.from_tensor_slices(test_df.to_dict('list')).batch(BATCH_SIZE).cache()
train_data = tf.data.Dataset.from_tensor_slices((train_df.to_dict('list'), labels)).shuffle(10000).batch(BATCH_SIZE, drop_remainder=True).cache()

Finally, train and predict the model:

model.fit(train_data, epochs=100)
y_pred_test = model.predict(test_data, batch_size=BATCH_SIZE)

How to extract embeddings of categorical variables

About