Correct approach to usage of class labels in cell imaging data

As part of a group project at university, we are given a series of videos of cell cultures over a 24 hour period. A number of these cells (the "knockout" cells) have had a particular gene removed, which is often absent or mutated in malignancy. We are using a blob detection algorithm to identify the cell centers and radii and further processing to match cells frame-to-frame to build up individual paths, which we then use to calculate various features. We are aiming to train a binary classifier that can identify the potentially cancerous cells given one of these path-derived feature vectors.

Our training data consists of immunofluorescence-tagged videos, where the knockout cells are tagged red and the normal or "control" cells tagged green. These are our "labels" so to speak, and we use our blob detection algorithm twice, once on the red channel and once on the green channel, to separate the two classes in the data. Our test data will consist of grey-scale videos where this tagging isn't present.

My team mate has claimed that separating the red and green channels in the training data is "unfair" since, as well as discriminating between the two classes, the tagging sometimes makes it easier to distinguish between individual cells. This is because it is possible for cells to have a significant overlap between one another, making the individual centers somewhat ambiguous. Where a knockout cell and a control cell are overlapping, separating the two channels removes this ambiguity. Since the separation of the colour channels is only possible in the training data, it may be argued that our "labels" are in fact not only identifying which class an individual cell belongs to, but are are also implicitly adding extra information into the training data by sometimes removing the overlapping ambiguity.

Here's an illustration of what I'm talking about:

If we were to take the red and green channels separately, the blob detection algorithm can easily recover the two cells. Otherwise, it just sees one. In this way, the colour labels are implicitly providing extra information for the blob detection.

My question is, is it okay to use this extra information provided by the labels in the training data by separating the two channels as part of the training process?

EDIT: Added illustration.

Topic training labels classifier binary

Category Data Science


It is commonly called [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning) if the algorithm is given features during training that are not present during testing. Data leakage will result in high performance during training and greatly reduced performance during testing. Most often the goal of machine learning is high performance during testing, thus only train the model with features that are present during testing.

In your case, train the model with only grey-scale videos.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.