Information bottleneck and deep neural network

I learned about the information bottleneck view of deep learning. But in a nutshell, what does this tell us?

I don't see what the role is of depth in this approach as long as it is larger than 2 or 3. Is there a rigorous theory? Or just some hypothesis or heuristic explanations on deep neural net?

I saw the author's talk on YouTube. But, probably my ignorance, I don't really get the main point and the implication is. I can see a lot of explanations on graphs on the video, but honestly, I don't get it.

Any comments, suggestions, opinions will be very appreciated.

Topic information-theory deep-learning neural-network

Category Data Science


Current Statistical Learning Theory treats a learning algorithm like a "black-box", analysing its input versus outputs. Besides, it is usually criticised for lack of non-vacuous bounds (Despite the non-vacuous bound proved by Diziugaite and Roy).

Information Bottleneck Theory brings an Information-Theoretic perspective to the learning problem that allows us to analyse what happens during training with information measures. When you do that, IBT predicts a phase transition between two distinct phases of training (a fitting phase, where the model rapidly fits to the data;and a compression phase when the model forgets irrelevant information of the dataset, trying to avoid overfitting).

It is not a proven rigorous theory as Tishby himself admits (see the video in DeepMath conference 2020) and this lack of rigour has provoked a great deal of criticism (see Saxe et al "On the IBT". They are not alone in the criticism). A more rigorous approach that can be seen as in the realm of IBT is taken by Stefano Soatto and Alessandro Achille in their research group in California (see Emergence of Invariance and Disentangling in Deep Representations).

Still, it is an "emerging field" where rigour is been built. The interesting aspect of IBT is that it gives new meaning (a narrative) for what is happening during training. In this narrative, there is no generalization paradox (see Zhang,Bengio, et al. "Understanding... rethink generalisation"), as what matters is not the number of parameters of a model, but the amount of information it has about the training dataset.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.