How to extract embeddings of categorical variables
I am little bit confused about encoding categorical variables. There are other posts/blogposts on this issue but none is talking about the problem I am facing.
I have a dataset with mixed variables (i.e, numerical as well as categorical). Some of the categorical variables has a lot of categories (close to 100). So instead of using One Hot encoders
, I am looking into using embeddings.
My goal is to: Use the embeddings of the categorical variables and extract them and put them together with my numerical variables to form the design/training matrix and then run an autoencoder on it (the reason to run autoencoder is to detect anomalous data ).
I have put together a code snippet (see below) but what confuses me is to how to extract the embeddings to put it together with my numerical variables
. I need to know because when new data shows up, I need the embeddings to run it through prediction to isolate any odd datapoint from the reconstruction error.
from keras.models import Model
from keras.layers import Input, Dense, Concatenate, Reshape, Dropout
from keras.layers.embeddings import Embedding
inputs = []
embeddings = []
# for categorical variables
for cat in cat_model_vars:
inp = Input(shape=(1,))
inputs.append(inp)
emb = Embedding(cat_sizes[cat]+1,cat_embedding_size[cat], input_length=1,name=f{cat}_embedding)(inp)
emb = Reshape(target_shape=(cat_embedding_size[cat],))(emb)
embeddings.append(emb)
# for continuous/numerical variables
inp_num = Input(shape=(len(num_model_vars),),name='cont_vars')
emb_num = Dense(10)(inp_num)
inputs.append(inp_num)
embeddings.append(emb_num)
# concatenate the embeddings, this is going to be input to our Autoencoder model
output = Concatenate()(embeddings)
#%%
# build a stacked Autoencoder below and feed the output as it's input and output
en_x = Dense(50, activation = 'relu')(output)
en_x = Dense(32, activation = 'relu')(en_x)
en_x = Dense(16, activation = 'relu')(en_x)
en_x = Dense(4, activation = 'relu')(en_x)
de_x = Dense(16, activation = 'relu')(en_x)
de_x = Dense(32, activation = 'relu')(de_x)
de_x = Dense(50, activation = 'relu')(de_x)
output = Dense(25, activation = 'relu')(de_x)
stacked_ae_model = Model(inputs, output)
stacked_ae_model.compile(loss='mean_squared_error',optimizer='Adam',
metrics=['mse','mape'])
NEED TO EXTRACT THE EMBEDDINGS THAT ARE USED AS AN INPUT TO THE STACKED AUTOENCODER (NOT DONE YET)
Now I will have to run the training and prediction. But I am not sure how to extract the embeddings so that I can use those for training.
Any help would be greatly appreciated.
Thanks and regards.
Topic data-science-model embeddings python
Category Data Science