Using large CNNs (e.g., ResNet) in convolutional autoencoders for image representation learning
I am confused about which CNNs are generally used inside autoencoder architectures for learning image representations. Is it more common to use a large existing network like ResNet or VGG, or do most people write their own smaller networks? What are the pros and cons of each? If people are using a large network like ResNet or VGG, does the decoder mirror the same steps taken by the encoder, or can a more simple decoding network be used? I am having a hard time finding papers where people describe which networks they use inside their autoencoders. Any help would be greatly appreciated! Thank you!
Topic vgg16 representation cnn autoencoder computer-vision
Category Data Science