There was some data set I worked with which I wanted to solve non negative least squares (NNLS) on and I wanted a sparse model. After a bit of experiementing I found that what worked the best for me was using the following loss function: $$\min_{x \geq 0} ||Ax-b|| + \lambda_1||x||_2^2+\lambda_2||x||_1^2$$ Where the L2 squared penalty was implemented by adding white noise with a standard deveation of $\sqrt{\lambda_1}$ to $A$ (which can be showed to be equivelent to ridge regression …
I have been reading about weight sparsity and activity sparsity with regard to convolutional neural networks. Weight sparsity I understood as having more trainable weights being exactly zero, which would essentially mean having less connections, allowing for a smaller memory footprint and quicker inference on test data. Additionally, it would help against overfitting (which I understand in terms of smaller weights leading to simpler models/Ockham's razor). From what I understand now, activity sparsity is analogous in that it would lead …
I am trying to train an autoencoder for dimensionality reduction and hopefully for anomaly detection. My data specifications are as follows. Unlabeled 1 million data points 9 features I am trying to reduce it to 2 compressed features so I can have better visualization for clustering. My autoencoder is as follows where latent_dim = 2 and input_dim = 9 class Autoencoder(tf.keras.Model): def __init__(self,latent_dim,input_dim): super(Autoencoder32x, self).__init__() self.latent_dim = latent_dim self.input_dim = input_dim self.dropout_factor = 0.5 self.encoder = Sequential([ # Dense(16, activation='relu', …
Sparse methods such as LASSO contain a parameter $\lambda$ which is associated with the minimization of the $l_1$ norm. Higher the value of $\lambda$ ($>0$) means that more coefficients will be shrunk to zero. What is unclear to me is that how does this method decides which coefficients to shrink to zero? If $\lambda = 0.5$ then does it mean that those coefficients whose values are less than or equal to 0.5 will become zero? So in other words, whatever …
I didn't got the meaning or the difference between sparse and dense corpra here in this sentence "the reason is that Skip-gram works better over sparse corpora like Twitter and NIPS, while CBOW works better over dense corpora "
I have a 2M x 2000 sparse matrix where rows represent an item and columns represent dimensions. I want to understand whether there are meaningful clusters in the data and I started to explore the dimensions to transform and normalise the data. Of the 2000 attributes to an item, many are co-variant (rho > .5). Are there clustering techniques that handle co-variants well automatically, without having to remove them manually?
I have 3D structure data of molecules. I represented the atoms as points in a 100*100*100 grid and applied a gaussian blur to counter the sparseness. (nearly all of the grid cells contain zeros) I am trying to build an autoencoder to get a meaningful "molecule structure to vector" encoder. My current approach is to use convolutional and max-pooling layers, then flatten and a few dense layers to get a vector representation. Then I reshape and increase the dimension again …
I have just learned that a general framework in constrained optimization is called "proximal gradient optimization". It is interesting that the $\ell_0$ "norm" is also associated with a proximal operator. Hence, one can apply iterative hard thresholding algorithm to get the sparse solution of the following $$\min \Vert Y-X\beta\Vert_F + \lambda \vert \beta \vert_0$$ If so, why people are still using $\ell_1$? If you can just get the result by non-convex optimization directly, why are people still using LASSO? I …