VGG16 (pytorch) training issues using a very large dataset vs a smaller dataset
I've constructed a simple VGG16 layer model from the original Simonyan Zisserman paper for use on a DBT (Digital Breast Tomo.) data challenge. As a starter model, I chose to make the 3x244x244 image patches via pre-processing (DICOM loading was pretty slow) and saved to disk. For a 3 label classification setup, with ~2000 image patches for training, it produces okay results (~80% accuracy). As part of the refinement, more patches were added with several data augmentations, which resulted in ~25x the amount of training data. The output of the model was also changed to be a binary classification (no cancer or cancer).
When training with this much larger dataset, I was expecting to see possibly worse accuracies (gets ~4% training accuracy), but I'm running into issues with it never getting past the amount of data in one epoch. I'm using the dataloader with a split between training, validation, and test (.7,.2,.1). What is an effective strategy for using this amount of training data? Is the extremely large amount of data just churning through in the first epoch for hours? I'm using google colab pro, so I have to be cognizant of my GPU time.
Category Data Science