How does Wav2Vec 2.0 feed output from Convolutional Feature Encoder as input to the Transformer Context Network
I was reading the Wav2Vec 2.0 paper and trying to understand the model architecture, but I have trouble understanding how audio raw inputs of variable lengths can be fed through the model, especially from the Convolutional Feature Encoder to the Transformer Context Network.
During fine-tuning (from what I have read), even though audio raw inputs within a batch will be padded to the length of the longest input in the batch, the length of inputs can differ across batches. Therefore this implies that the output from the Convolutional Feature Encoder will have varying lengths across batches.
However, the Transformer Context Network has a fixed input dimension; the BASE Wav2Vec 2.0 model uses a transformer of model dimension 768. This means that the output from the Convolutional Feature Encoder must somehow be manipulated to become dimension 768 in order to feed into the Transformer.
How is this manipulation done? The HuggingFace's Wav2Vec model (see below) showed that there is a Wav2Vec2FeatureProjection layer between the Convolutional Feature Encoder (a.k.a., Wav2Vec2FeatureExtractor) and Transformer Context Network (a.k.a., Wav2Vec2Encoder). The Wav2Vec2FeatureProjection contains a linear layer that takes input with dimension 512 and output dimension 768. How is the input dimension 512 determined when the raw inputs can have varying lengths across batches?
Wav2Vec2ForCTC(
(wav2vec2): Wav2Vec2Model(
(feature_extractor): Wav2Vec2FeatureExtractor(
(conv_layers): ModuleList(
...
(6): Wav2Vec2NoLayerNormConvLayer(
(conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False)
)
)
)
(feature_projection): Wav2Vec2FeatureProjection(
(layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(projection): Linear(in_features=512, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): Wav2Vec2Encoder(
(pos_conv_embed): Wav2Vec2PositionalConvEmbedding(
(conv): Conv1d(768, 768, kernel_size=(128,), stride=(1,), padding=(64,), groups=16)
(padding): Wav2Vec2SamePadLayer()
)
...
Topic speech-to-text
Category Data Science