How does the character convolution work in ELMo?

Question

How does the character convolution work in ELMo?

Kaare

2021年7月14日 22:04

When I read the original ELMo paper (https://arxiv.org/pdf/1802.05365.pdf), I'm stumped by the following line:

The context insensitive type representation uses 2048 character n-gram convolutional filters followed by two highway layers (Srivastava et al., 2015) and a linear projection down to a 512 representation.

The Srivastava citation only seems to relate to the highway layer concept. So, what happens prior to the biLSTM layer(s) in ELMo? As I understand it, one-hot encoded vectors (so, 'raw text') are passed to a convolutional filter and a linear projection? How should I think of input and output dimensions? I get the feeling that perhaps there used to be a detailed explanation somewhere on allennlp.org (or perhaps their github repo), but it has perhaps been deemed outdated or unnecessary since?

I hope the question makes sense.

Topic allennlp convolutional-neural-network nlp

Category Data Science

noe · Accepted Answer · 2021年7月14日 22:04

One way to understand how ELMo's character convolutions work is by directly inspecting the source code.

There, in the forward method, you can see that the input to the network is a tensor of dimensions (batch_size, sequence_length, 50), where 50 is the maximum number of characters per word. Therefore, before passing the text to the network, it is segmented in words, and each character is encoded as an integer value.

This is what happens in the forward method before the highway layers:

The tensor gets prepended and appended sentence boundary tokens (beginning-of-sentence (BOS), end-of-sentence (EOS)).
The tensor goes through an embedding layer (this is somewhat similar to one-hot encoding and a matrix multiplication, see this answer). This gets us a vector for each character.
The tensor goes through different 1D convolutions of configurable kernel sizes.
The resulting activation maps are concatenated and passed as input to the highway networks

This architecture was proposed by kim et al. (2015), and is summarized well in one of the figures of the paper:

How does the character convolution work in ELMo?

About