How does the character convolution work in ELMo?
When I read the original ELMo paper (https://arxiv.org/pdf/1802.05365.pdf), I'm stumped by the following line:
The context insensitive type representation uses 2048 character n-gram convolutional filters followed by two highway layers (Srivastava et al., 2015) and a linear projection down to a 512 representation.
The Srivastava citation only seems to relate to the highway layer concept. So, what happens prior to the biLSTM layer(s) in ELMo? As I understand it, one-hot encoded vectors (so, 'raw text') are passed to a convolutional filter and a linear projection? How should I think of input and output dimensions? I get the feeling that perhaps there used to be a detailed explanation somewhere on allennlp.org (or perhaps their github repo), but it has perhaps been deemed outdated or unnecessary since?
I hope the question makes sense.
Topic allennlp convolutional-neural-network nlp
Category Data Science