ML for data processing. What are the options?

Currently I am working on improving a stage on a data processing pipeline. The source data has a large number of fields and is getting normalized into a simpler entity. This entails that in many cases a destination field value may be copied from arbitrary input fields, according to the context.

My idea was to regress a binary output sources-destinations matrix that associates the possible source fields to the possible source destinations.

I was wondering: is this a problem that has been tackled before? Is there anything in scientific literature that is worth noting?

Topic matrix data preprocessing data-cleaning machine-learning

Category Data Science


What you are describing could be modeled as bipartite graph, one set of nodes connects to another set of nodes. Thus, it becomes a bipartite graph matching problem.

As the size of each graph grows it quickly becomes intractable, then approximate nearest neighbor search might be useful.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.