Loading models from external source

I have a 500MB model which I am commiting to Git. That is a really bad practice since for newer model versions the repository will be huge. As well, It will slow down all builds for deployments.

I thought of using another repository that contains all the models and then fetch them in running time.

Does anybody know a clean approach or alternative?

Topic data-engineering machine-learning-model python

Category Data Science


I had also run into this problem several times, so I've created an open source modelstore Python library which seeks to tackle the problem of simplifying the best practices around versioning, storing, and downloading models from different cloud storage providers.

The modelstore library unifies the versioning and saving of an ML model into a single upload() command, and also provides a download() function to get that model back from storage. Here is (broadly) what it looks like - full documentation is available:

from modelstore import ModelStore

# To save the model in s3
modelstore = ModelStore.from_aws_s3(os.environ["AWS_BUCKET_NAME"])

model, optim = train() # Replace with your code

# Here's a pytorch example - the library currently supports 9 different ML frameworks
model_store.pytorch.upload(
   "my-model-domain",
   model=model,
   optimizer=optim
)

The upload() command will create a tar archive containing your model and some meta-data about it, and upload it to a specific path in your storage.

You can later download the latest model by using:

model_path = modelstore.download(
   local_path="/path/to/download/to", # Replace with a path
   domain="my-model",
)

Note: there are options available like MLFlow's artifact storage which is great if you can set up and maintain a tracking server.


In most cases, you would use a file-storage solution such as Amazon S3 or Google Cloud and many others, which provide designated solutions for large object storage and retrieval.

You would then ideally want to update your code to retrieve the model directly from the file storage. Whether this download needs be done on every run or only once (storing the model locally for future runs), should be decided based on your specific needs.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.