Organizing datasets, dataset version control, MLOps and other questions

I am currently looking into structuring data and work flows for my ML end to end pipeline.

I therefore have multiple problems, and ideally I am looking for one platform that can do all:

  1. Visualize and organize multiple datasets. ideally something like the Kaggle datset webinterface
  2. Do dataset exploration to quickly visualize errors in data, biases in annotations etc.
  3. Annotate images and potentially point clouds
  4. commenting functionality for all features
  5. Keep track of who annotated what on what date
  6. dataset version control to keep track of changes to annotations, new images added etc, with options for tags like production or release
  7. List item
  8. Be able to log and tag specific trained models: production etc.
  9. Be able to organize training and prediction experiments
  10. Have traceability on what training runs or models used which dataset version

I have been able to find individual platforms to solve part of these problems, but not a single end-to-end platform

Visualizing datasets and annotating: Remo

Pros:

  • Can vizualize multiple datasets
  • It is possible to annotate in the webinterface

Cons:

  • Data has to be uploaded to the platform, instead of just linking it to a stored location
  • No commenting or discussions about annotations
  • Not possible to version control data

Image annotation and data versioning: V7Labs

Pros:

  • possible to annotate images
  • possible to comment
  • possible to track and do versioning on datasets

Cons:

  • Pricey

Log Experiments, etc: Weights and Biases

Pros:

  • Easy way of keeping track of experiments
  • Tracking of datasets and matching with experiements and trained models

Cons:

  • pricey

Its going to be very expensive if i have to subscribe to multiple platforms, as well as time consuming to keep the tracking of data and annotations connected between the annnotation tool and an experiement MLOps tool like WandB.

Is there a magical tool that I have missed?

Topic image annotation version-control dataset

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.