How would you describe cluster 2 from this output of a run of the EM program?

My description: Cluster 2 consists of 9511 instances, the age is around 42 (ranges between 29.7207 and 54.5257). Considering Age, Cluster 2 is very well separated from Cluster 1, with a distance of 18.9513. On the other hand, Cluster 2 and Cluster 0 are very close though, their centroids are withihn a distance of around 0.8248. What else could be added?
Category: Data Science

Is there a Gaussian Mixture Model for data with opposing pairs?

I have a classification problem with data that comes in pairs. A pair consists of two datapoints (A,B) or (B,A), each datapoint containing 20 features. After receiving about 30 pairs, my goal is to separate the A and B classes using a GMM using feature similarity. For each datapoint, it is not known beforehand to what class it belongs, but it is however known that is of the opposite class as the other datapoint in its pair. Is there any …
Category: Data Science

How to derive Evidence Lower Bound in the paper "Zero-Shot Text-to-Image Generation"?

Can someone share the derivation of Evidence Lower Bound in this paper ? Zero-Shot Text-to-Image Generation The overall procedure can be viewed as maximizing the evidence lower bound (ELB) (Kingma & Welling, 2013; Rezende et al., 2014) on the joint likelihood of the model distribution over images x, captions y, and the tokens z for the encoded RGB image. We model this distribution using the factorization ${p_\theta,_\psi(x, y, z) = p_\theta(x | y, z)p_\psi(y, z)}$, which yields the lower bound: …
Category: Data Science

Which latent variable model is better to find hidden variable?

Currently, I am exploring the concept of latent variable for regression type datasets. I have gone through literature of few of the methods and models used in finding latent variable, like: EM algorithms, Partial least square regression, Latent semantic analysis, Mixed Effect models (linear-nonlinear), HMM, and there are many more!! For Example: volume DataFrame head is length width volume 0 1.395702 4.822958 40.821677 1 5.761620 9.912682 242.571731 2 3.444930 2.111199 18.904144 3 6.236642 7.609429 425.838818 4 7.270517 1.106117 39.883937 In …
Category: Data Science

Active learning with mixture model cluster assignments - am I injecting bias here?

Suppose I have a dataset of people's phone numbers and heights, and I'm interested in learning the parameters $p_{girl}$, $p_{boy}=1-p_{girl}$, $\mu_{boy}$, $\mu_{girl}$, and overall $\sigma$ governing the distribution of peoples' heights. I don't have labels for boys or girls yet, but if I really want to, I can call the phone number and ask if the person is a boy or girl. Procedure: Fit a Gaussian mixture model to heights via EM. Assign the greater of the $\mu$s to be …
Category: Data Science

How to find the feature regions where each label is the most expected when using decision trees?

Given a decision tree for classification for example this one: What is the way to find the feature domain (petal and sepal width and length) where a sample would most likely occur in the feature space for each class? It is clear here that for Setosa it is when petal length is less or equal to 2.45. However, where I am confused is how to think in more complex cases. For example, let's take Versicolor: I am hesitating between 2 …
Category: Data Science

E-step for EM algorithm for document clustering

I have a code for the E-step in the EM algorithm for Document Clustering in the version of hard-EM algorithm. I'm trying to implement the E-step for soft-EM algorithm. Here is my code for Hard-EM: E.step <- function(gamma, model, counts){ N <- dim(counts)[2] # number of documents K <- dim(model$mu)[1] for (n in 1:N){ for (k in 1:K){ gamma[n,k] <- log(model$rho[k,1]) + sum(counts[,n] * log(model$mu[k,])) } logZ = logSum(gamma[n,]) gamma[n,] = gamma[n,] - logZ } gamma <- exp(gamma) return (gamma) …
Category: Data Science

N-Gram Linear Smoothing

In slide 61 of the NLP text, to smooth out the n-gram probabilities, we need to find the lambdas the miximazies a probability to held-out set given in terms of M(λ1, λ2, ...λ_k). What does this notation mean please? Also, it says that "One way is to use the EM algorithm, an iterative learning algorithm that converges on locally optimal λs". Can someone refer me to a good example? Say the training text is "Sam I am Sam I do …
Category: Data Science

Gaussian Mixture Models Clustering

When using the EM algorithm in Gaussian Mixture Models (GMM), in the E-step, we take each x set in the training dataset to calculate and update the "weight" and parameters of each Gaussian distribution of the clusters (M-step). I have read that we do this until it converges. I am a little confused here. Does that mean it loops through the whole training dataset X every time in "one step" of the EM algorithm? Or is "one step" corresponding to …
Category: Data Science

Best python library for training using Hidden Marov model with Gaussian Mixture

I would like to train my data using HMM- GMM (Baum Welch approach with gaussian Mixture) to find the best parameters suited for my data. Note : My data is continuous and not discrete. I tried with hmmlearn from scikit learn, but i believe it is not supporting continuous HMM-GMM model, but i tried with discrete data, it woks fine. I tried to use pomegranate, but i cannot able to understand the document, also i am not sure whether it …
Category: Data Science

Does feature normalization improve performance of Hidden Markov Models?

For training a Hidden Markov Model (HMM) on a multivariate, continuous time series, is it preferable to scale the data somehow? Some pre-processing steps may be: Normalize to 0-mean and unit-variance Scale to [-1, 1] interval Scale to [0, 1] interval With neural networks, the rationale behind scaling is to get an "un-squished" error surface that is easier to navigate in. HMMs use the Baum-Welch algorithm, which is a variation on the Expectation Maximization (EM) algorithm, to learn parameters. Is …
Category: Data Science

Can Expectation Maximization estimate truth and confusion matrix from multiple noisy sources?

Suppose we have $m$ sources, each of which noisily observe the same set of $n$ independent events from the outcome set $\{A,B,C\}$. Each source has a confusion matrix, for example for source $i$: $$C_i = \begin{bmatrix} 0.98 & 0.01 & 0.07 \\ 0.01 & 0.97 & 0.00 \\0.01 & 0.02 & 0.93\end{bmatrix} $$ where each column relates to the truth, and each row relates to the observation. Eg. if the true event is $B$ then source $i$ will get it …
Category: Data Science

How to interpret the mean for output clusters for expected-maximization?

I am trying to cluster data using scikit's expectation-maximization. So I created two different data sets from a normal distribution which is I have shown in the graph below. The mean for each of the distribution is: Mean of distr-1: 0.0037523503071361197 Mean of distr-2: -0.4384554574756237 But after I run the EM using scikit, I get the mean as follows: Mean after EM: [[-0.12327634 0.39188704] [-1.31191255 -4.4292102 ]] How am I supposed to interpret this mean? I am trying to create …
Category: Data Science

Code or Package to cluster sequences (or time series) of different lengths based on HMM?

Is there any existing code or packages in Python, R, Java, Matlab, or Scala that implements the sequence clustering algorithms in any of the following 2 papers? 1) 'Clustering Sequences with Hidden Markov Models' by Padhraic Smyth (1997): https://papers.nips.cc/paper/1217-clustering-sequences-with-hidden-markov-models.pdf The paper gives a probabilistic model-based approach to clustering sequences (or time series), using hidden Markov models (HMM). 2) 'Visual Cluster Exploration of Web Clickstream Data' by Jishang Wei, Zeqian Shen, Neel Sundaresan, Kwan-Liu Ma (2012): http://www.cs.tufts.edu/comp/250VIS/papers/VAST2012-ClickStream.pdf The paper is quite …
Category: Data Science

Statistical machine translation word alignment for FR-ENG and ENG-FR: what is p(e) and p(f)?

I'm currently trying to implement this paper, but am struggling to understand some of the math here. I'm pretty sure I understand how to implement the E-step, but for the M-step, I'm confused on how to compute the M-step. It says just before section 3.1 that $p_1(x, z; \theta_1) = p(e)p(a, f|e; \theta_1)$, and then the same for $p_2$ but with $e$ and $f$ swapped. The second part of this makes sense to me, but what is $p(e)$ or $p(f)$? …
Category: Data Science

EM clustering with missing and misspelling data

I am currently working on a project that requires me to cluster the unlabeled input. The records contain personal information such as name, DOB, height, sex, etc. We need to cluster the same person in one group, here is the sample data: +------------------------------------+ | Record1 Record2 | +------------------------------------+ | First Name 'Harry' 'Harry' | | Middle Name 'Jay' 'J' | | Last Name 'Potter' 'Potter' | | DOB Month 1 1 | | DOB Day 1 1 | | DOB …
Category: Data Science

Does K-Means' objective function imply distance metric is Euclidean

The objective/loss function of K-Means algorithm is to minimize the sum of squared distances, written in a math form, it looks like this: $$J(X,Z) = min\ \sum_{z\in Clusters}\sum_{x \in data}||x-z||^2$$ If we have different distance metric, for instance, cosine (I realize there's a conversion between cosine and Euclidean but let's forget it for now), manhattan etc, does it mean we will have a different loss function? That is, the traditional K-Means based on expectation maximization won't be working right? Because …
Category: Data Science

How to compare the performance of different number of mixing components for EM algorithm?

I am reading about the EM (Expectation-Maximization) algorithm in a machine learning book. At the end remark of the chapter, the authors mentioned that we cannot decide the "optimality" of the number of components (# of mixtures of Gaussians distributions) based on each model's log likelihood at the end--since models with more parameters will inevitably describe the data better. Therefore, my questions are 1) How do we compare the performance of each model using a different number of components? 2) …
Category: Data Science

Hidden Markov Models: Linking states to labels after EM training

The tl;dr version first: I have the following problem: I implemented Baum Welch for ergodic HMMs. I do it like this: I pass the model two number C1 and C2, it builds a fully connected state machine with C1 states and C2 emissions. I map all tokens from my training data onto the range [0, C2) and each label the HMM is supposed to assign a token during inference onto [0, C1). Then the HMM goes ahead and does Baum …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.