How does Wav2Vec 2.0 feed output from Convolutional Feature Encoder as input to the Transformer Context Network

I was reading the Wav2Vec 2.0 paper and trying to understand the model architecture, but I have trouble understanding how audio raw inputs of variable lengths can be fed through the model, especially from the Convolutional Feature Encoder to the Transformer Context Network.

During fine-tuning (from what I have read), even though audio raw inputs within a batch will be padded to the length of the longest input in the batch, the length of inputs can differ across batches. Therefore this implies that the output from the Convolutional Feature Encoder will have varying lengths across batches.

However, the Transformer Context Network has a fixed input dimension; the BASE Wav2Vec 2.0 model uses a transformer of model dimension 768. This means that the output from the Convolutional Feature Encoder must somehow be manipulated to become dimension 768 in order to feed into the Transformer.

How is this manipulation done? The HuggingFace's Wav2Vec model (see below) showed that there is a Wav2Vec2FeatureProjection layer between the Convolutional Feature Encoder (a.k.a., Wav2Vec2FeatureExtractor) and Transformer Context Network (a.k.a., Wav2Vec2Encoder). The Wav2Vec2FeatureProjection contains a linear layer that takes input with dimension 512 and output dimension 768. How is the input dimension 512 determined when the raw inputs can have varying lengths across batches?

Wav2Vec2ForCTC(
  (wav2vec2): Wav2Vec2Model(
    (feature_extractor): Wav2Vec2FeatureExtractor(
      (conv_layers): ModuleList(
        ...
        (6): Wav2Vec2NoLayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False)
        )
      )
    )
    (feature_projection): Wav2Vec2FeatureProjection(
      (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (projection): Linear(in_features=512, out_features=768, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): Wav2Vec2Encoder(
      (pos_conv_embed): Wav2Vec2PositionalConvEmbedding(
        (conv): Conv1d(768, 768, kernel_size=(128,), stride=(1,), padding=(64,), groups=16)
        (padding): Wav2Vec2SamePadLayer()
      )
    ...

Topic speech-to-text

Category Data Science


They key is that the 768 dimensional vector of the transformer is the size of a single input... let me explain.

  • You start with a variable length audio input
  • This is passed through a temporal CNN network, which will give you outputs, called $z_1$ to $z_T$ by the paper - where T is variable across the batch and is the number of timesteps in a particular audio input (different for different audio inputs).
  • Each of these $z$ are passed as 768 dimensional vectors to the transformer, i.e. there are $T$ 768-dim vectors being passed to your transformer

As you can imagine, practically you would need to pass a mask to the transformer too, so it knows what the variable $T$ is for the different inputs in your batch.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.