Model Joint Probability of N Words Appearing Together in a Sentence
Assume that we have a large corpus of texts to train with. Given N words as input, I want to model the joint probability $p(x_1, x_2, ..., x_N)$ of these words appearing together in a sentence. More specifically, the N words are not required to be ordered or contiguous, and words other than given words can appear in the sentence. There is no restriction on the number of times each of N words can appear in the sentence. I did some research and below are some possible directions.
Popular RNN models like LSTM offer conditional probability $p(x_t | x_{t-1}, ..., x_0)$, which may shed light on the joint probability I am modeling. However, RNN models take sequential inputs and require the N given words to be ordered. Also, if I am to use RNN models, I need to use Markov chain rule to calculate the joint probability, the N words would need to be continuous, and no words other than given words would be allowed.
Topic models like LSA(Latent Semantic Analysis) or LDA(Latent Dirichlet Allocation) may help me find topics from corpus, and model joint probability based on topics that each of N words belong to. Words from same topic are assigned higher joint probability. These models require me to manually set the number of topics. As I am not really familiar with these models, I am not sure how good they will perform.
A neural network alternative to LDA is a Restricted Boltzmann Machine, where topics are learnt as hidden neurons. The input to this RBM would be a vector the size of my whole dictionary. Each entry is either 1(present), 0(not present), or -1(don’t know). During training the probability $p(h|v_{train})$ is learnt, where $h$ is my learnt topics and $v$ are training data. At test time, the test input $v_{test}$ will have all N given words’ value equal to 1, while all other words equal to -1. In this way we can learn the probability $p(v_{test}|h)$.
I feel like this is a really fundamental problem in language modeling, but have not yet been able to find recent research on this. Could anyone please give any pointers on this problem? Does any of the above proposed solutions look feasible? Any help is appreciated!
Topic rnn rbm lda nlp machine-learning
Category Data Science