Data preprocessing framework/library alternatives

I am currently working on some python machine learning projects that are soon to be deployed to production. As such, in our team we are interested in doing this the most correct way, following MLOps principles.

Specifically, I am currently researching the step of data preprocessing and how to implement it in a robust way against training-serving skew. I've considered Tensorflow Transform, that after a single run of some defined preprocessing steps, generates a graph artifact that can be reused after training. Although a downside of using it would be the need to stick with Tensorflow data formats. Is there any good alternative?

The only similar examples of frameworks/libraries that I've found until now are Keras preprocessing layers and sklearn preprocessing pipelines. I have searched on a lot of sites and blogs but still haven't found a similar kind of discussion.

Topic mlops tensorflow preprocessing python machine-learning

Category Data Science


It is crucial to measure the final result reached with prepropressing as best as possible.

Therefore, there is a lot of different options depending on the datasets and depending on the algorithms/models.

For instance, some models needs data normalization, some models needs logarithm or other transformation to improve the final results. Sometimes, you can have missing values that could require range of uncertainty. Sometimes, you can have NA values that could be replaced by outliers. Categorical data could be transformed to binary or scaled values.

There are plenty of data preprocessing books, but they are mainly for general purposes.

Consequently, I recommend to focus on the algorithms/models you want to apply and adapt prepropressing techniques accordingly. If you give more information about the algorithms or models, it would be possible to give you more hints about related preprocessing techniques.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.