Anomaly detection without any knowledge about structure
I have an interesting question, my code needs to be able to handle structured data where I don't know much about the structure at development time. I know the samples follow a schema that can be nested and the leafs contain some basic primitives like floats, integers and categoricals. I will have access to the schema at training time. I want to train a model that can detect anomalies or more directly whether or not a sample comes from the original distribution of data.
One way would be to just flatten the data and train an isolation forest. From my intuition, this breaks with highly correlated features. Another approach would be to build a neural network architecture that contains some of the structure and primitive types of the passed schema. You could build an autoencoder approach and measure reconstruction error, which I feel like would handle correlated features a lot better because these correlations are captured by the encoder-decoder network. A possible issue would be how to build this reconstruction error metric with different primitives.
Does anyone have any ideas on this or possibly papers that handle related challenges?
EDIT:
I will clarify some things, sorry for being vague. At the time of designing the algorithm I don't know the structure of the dataset involved but at the moment of training the model this structure is known. The fact that the schema could be nested is not so relevant, let's ignore that part and assume we have a flat, tabular dataset where we know the dtypes. It's likely these dtypes are mixed so I'm looking for a way to reason about anomalies on this set.
Topic anomaly
Category Data Science