Intuitively, why do Non-monotonic Activations Work?

The swish/SiLU activation is very popular, and many would argue it has dethroned ReLU. However, it is non-monotonic, which seems to go against popular intuition (at least on this site: example 1, example 2).

Reading the swish paper, the justification that the authors give is that non-monotonicity increases expressivity and improves gradient flow... [and] may also provide some robustness to different initializations and learning rates.

The authors provide an image to back up this claim, but at best this argument for non-monotonicity is vague (at worst it's a misleading way of saying we have no clue).

Does anyone have a better justification as to why non-monotinicty is working in swish's favor?

Topic activation-function deep-learning neural-network machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.