How to derive Evidence Lower Bound in the paper "Zero-Shot Text-to-Image Generation"?
Can someone share the derivation of Evidence Lower Bound in this paper ?
Zero-Shot Text-to-Image Generation
The overall procedure can be viewed as maximizing the evidence lower bound (ELB) (Kingma Welling, 2013; Rezende et al., 2014) on the joint likelihood of the model distribution over images x, captions y, and the tokens z for the encoded RGB image. We model this distribution using the factorization ${p_\theta,_\psi(x, y, z) = p_\theta(x | y, z)p_\psi(y, z)}$, which yields the lower bound: ${\ln p_\theta,_\psi(x, y) E_{z∼q_\phi(z | x)}\ln p_\theta(x | y, z) − \beta D_{KL}(q_\phi(y, z | x), p_\psi(y, z))}$