My native language is a regional language and few people speak it. I have some assignements in a machine learning course and i was thinking about doing some natural languge processing on my native language but i don't know where to start since there is almost no research about this language ( no corpus , no research papers , ... ) and i'm new to machine learning. I want to start doing everything from bottom and i want to do …
I'm working on NMT model which the input and the target sentences are from the same language (but the grammar differs). I'm planning to pre-train and use BERT since I'm working on small dataset and low/under resource language. so is it possible to feed BERT to the seq2Seq encoder/decoder?
I was wondering how useful the encoder's hidden state is for an attention network. When I looked into the structure of an attention model, this is what I found a model generally looks like: x: Input. h: Encoder's hidden state which feeds forward to the next encoder's hidden state. s: Decoder's hidden state which has a weighted sum of all the encoder's hidden states as input and feeds forward to the next decoder's hidden state. y: Output. With a process …
I am trying to implement early stopping to my model where I am performing Machine Translation using Seq2Seq with attention. I am mostly used to writing my own models in steps, something like this: for activation in activations: for layer1 in layers1: for optimizer in optimizers: # define model model_vanilla_lstm = Sequential() model_vanilla_lstm.add(LSTM(layer1, activation=activation, input_shape=(n_step, n_features))) model_vanilla_lstm.add(Dense(1)) #compile model model_vanilla_lstm.compile(optimizer=optimizer, loss='mse') #Early Stopping earlyStop=EarlyStopping(monitor="val_loss",mode='min',patience=5) # fit model history = model_vanilla_lstm.fit(X, y, epochs=epoch, validation_data=(X_test,dataset_test['Close']) , verbose=1, callbacks=[earlyStop]) #Summary of the model …
What is Bits per Character (bpc) metric which has been used to measure the model accuracy with reference to text8 and enwiki8 datasets. I encountered the term bpc in transformer -XL paper here. How different is it from the perplexity as a metric?
The original seq2seq paper reversed the input sequence and cited multiple reasons for doing so. See: Why does LSTM performs better when the source target is reversed? (Seq2seq) But when using attention, is there still any benefit to doing this? I imagine since the decoder has access to the encoder hidden states at each time step, it can learn what to attend to and the input can be fed in the original order.
Each year, the Workshop on Statistical Machine Translation (WMT) holds a conference that focuses on new tasks, papers, and findings in the field of machine translation. Let's say we are talking about the parallel dataset Newscommentary. There is the Newscommentary in WMT14, WMT15, WMT16 and so on. How much does the dataset differ from each conference? Is it possible to read this somewhere?
I am trying to compare the A: Transformer-based architecture for Neural Machine Translation (NMT) from the Attention is All You Need paper, with B: an architecture based on Bi-directional LSTM's in the encoder coupled with a unidirectional LSTM in the decoder, which attends to all the hidden states of the encoder, creates a weighted combination and uses this along with decoder (unidirectional) LSTM output to produce final output word. My question is what might be the advantages of Architecture A …
I am trying to build a translation model in pytorch. Following this post on pytorch I downloaded the multi30k dataset and spacy models for English and German. python -m spacy download en python -m spacy download de import torchtext import torch from torchtext.data.utils import get_tokenizer from collections import Counter from torchtext.vocab import Vocab, build_vocab_from_iterator from torchtext.utils import download_from_url, extract_archive import io url_base = 'https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/' train_urls = ('train.de.gz', 'train.en.gz') val_urls = ('val.de.gz', 'val.en.gz') test_urls = ('test_2016_flickr.de.gz', 'test_2016_flickr.en.gz') train_filepaths = [extract_archive(download_from_url(url_base + …
Using the Bahdanau attention layer on Tensorflow for time series prediction, although conceptually it is similar to NLP applications. This is how the minimal example code for a single layer looks like. import tensorflow as tf dim=7 Tq=5 # Number of future time steps to predict Tv=13 # Number of historic lag timesteps to consider batch_size=2**4 query=tf.random.uniform(shape=(batch_size, Tq, dim)) value=tf.random.uniform(shape=(batch_size, Tv, dim)) key=tf.random.uniform(shape=value.shape) layer=tf.keras.layers.AdditiveAttention(use_scale=True, causal=True) output, score=layer(inputs=[query, value, key], return_attention_scores=True) The score obtained in the last line seems to be …
What's the general tradeoff between choosing BPE vs WordPiece Tokenization? When is one preferable to the other? Are there any differences in model performance between the two? I'm looking for a general overall answer, backed up with specific examples.
I'm currently analysing the paper Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation (Post, Vilar 2018): https://arxiv.org/abs/1804.06609 I have understanding problems how the data is processed. For example: the paper is writing about beams, banks and hypothesises and I have no idea what these terms mean. How would you describe these terms and are there any tutorial sources you would recommend for understanding the dynamic beam allocation?
After reading the paper, Attention is all you need, I have two questions: 1. What is the need of a multi-head attention mechanism? The paper says that: "Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions" My understanding is that it helps in anaphora resolution. For example:- "The animal didn't cross the street because it was too ..... (tired/wide)". Here "it" can refer to animal or street based on the last word. …
I am working on a project on Neural Machine Translation in the English-Irish domain. I am not an expert and have researched entirely on my own for a technology exhibition so apologies if my question is simple. I am trying to parse all of my English corpus to constituency trees. Of course, the format of a sentence when using the Stanford Parser is something like: (ROOT (S (NP (VBG cohabiting) (NNS partners)) (VP (MD can) (VP (VB make) (NP (NP …
I am trying to train an NMT model where the source side is roman text of Asian languages from social media, and target side is English. Note that since roman text is not native to Asia, the romanizations done by people to type on the Internet are very personal and hence a bit noisy, but easily intelligible to native speakers. The following is an example for writing a Hindi sentence in different ways: Vaise bhi mere paas jo bhi hai …
I am wondering, do we really need <unk> tokens? Why do we limit our vocabulary? Is it for speed? Accuracy? If we disable all limitations, what do you predict happens?
I am trying to make a model that is capable of translating a sentence into a new and a better form. I would like the model to change the tone and also give it some character. I am using this in my web app UI, simply allowing the users to witness new description as they refresh the page. For example, "You are logged out" -> "Looks like you have logged out". Something of such sort, any idea on this?
I have been working on a project and we were trying to convert a PSD (Adobe Photoshop) file to a HTML for web applications as well as a Layout XML for android. We worked our way to generate basic skeletal html/xml but hit a wall for complex scenarios such as identifying separate divs and components. Our initial approach was to standardize the PSD and get metadata about each component from PSD but due to it's limitations we could only add …
I can't see how BERT makes predictions without using a decoder unit, which was a part of all models before it including transformers and standard RNNs. How are output predictions made in the BERT architecture without using a decoder? How does it do away with decoders completely?