I am thinking about using hierarchical dirichlet process to model a patent dataset. I've seen that HDP uses a base distribution and assumes that every topic comes from that base distribution. The problem is: first I'm wondering what are the main results from the HDP procedure (in the case of LDA we obtain two matrices that we can use to construct word clouds and graphs but in this case I'm not sure about the results) and what is the exact …
In short, the question is: I have two sets of words per document. I would like to extract two sets of topics per document corresponding to sets of words. To be more precise: Document(d) can be modelled as a union of two sets of words (WordSetA, WordSetB), where WordSetA union WordSetB is all words in (d) The goal is to find two sets of topics related to the sets of words (TopicSetA and TopicSetB), where TopicSetA is a mixture of …
I am trying to figure out the default $\alpha$ & $\eta$ values used by mallet LDA, but there is not a lot of information on this. I did find a couple of answers, with no proper references, saying that symmetric $\alpha$ can be calculated with 5.0/num_topics? Why is that? Why can't I use 1.0/num_topics to calculate the symmetric $\alpha$, just like in standard LDA? Can someone please help me understand and link me to references? Thanks in advance.
I have come across Latent Dirichlet Allocation (LDA) on multiple occasions while reading about sentiment analysis and recommender systems. Where can I find good reading material which explains the concept in depth, especially by taking an example?
On Wikipedia Dirichlet Process page, regarding the connection between the Chinese restaurant process and the Dirichlet process it's state the following If one associates draws from the base measure H with every table, the resulting distribution over the sample space S is a random sample of a Dirichlet process. What does it mean to: Associate draws from the base measure H with every table? It doesn't make any sense to me.
I have two sets of topics obtained from two different sets of news paper articles. In other words, Cluster_1 = ${x_1, x_2, ..., x_n}$ includes the main topics of 'X' news paper set and Cluster_2 = ${y_1, y_2, ..., y_n}$ includes the main topics of 'Y' news paper set. Now I want to find clusters in the two sets that are similar/related by considering the cluster attributes as given in the example below. Example 1, **X1 in Cluster_1** is mostly …
One need to provide LDA with a predefined number of latent topics. Let say I have a text corpus in which I hypothesize there are 10 major topics, all composed of 10 minor subtopics. My objective is to be able to define proximity between documents. 1) How do you estimate the number of topics in practice ? Empirically ? With another method like Hierarchical Dirichlet Process (HDP) ? 2) Do you build several models ? For major and minor topics …