How to keep only significant weights in an ANN

My weights are store in a two dimensional matrix. Row i refers to node i in preceding layer and columns in that row are the neurons node i is connected to. I only want to keep some nodes. How do I pick 3 max weights and store it in a separate array while keeping track of which neuron it belonged to. Moreover, is it tested in theory that some weights contribute more than the others?

Topic neural-network

Category Data Science

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks is a seminal paper from a couple of years ago that demonstrates that there is a much smaller subnetwork within any neural network that is doing most of the work. They describe their approach to pruning a neural network in the paper. However, they are keeping the best weights, not the best nodes.

The important factor affecting a neurons output is not the weights, but the balance of weights (i.e. how each input is weighted relative to the other inputs).

I suggested that if you’re looking to keep the most important weights, you take the approach in the above paper. If you want to keep the best neurons, look at the relative importance of each neuron by looking at the relative weight assigned to it by neurons in the following layer.

Yes, some weights contribute more then the others but how you're going to get the significance of each neuron and its node weights?

I believe there exist a simple solution to learn and get significant weight matrix from any ANN architecture. This can be achieved by using L1 regularization to train your ANN model. L1 regularization keeps the weight matrix sparse by assigning the weightage to the most significant features from input matrix by enhancing the importance of the weight values where required and reducing the other weights values for non important Features.

Built-in feature selection1 : It is frequently mentioned as a useful property of the L1-norm, which the L2-norm does not. This is actually a result of the L1-norm, which tends to produces sparse coefficients (explained below). Suppose the model have 100 coefficients but only 10 of them have non-zero coefficients, this is effectively saying that “the other 90 predictors are useless in predicting the target values”. L2-norm produces non-sparse coefficients, so does not have this property.

L1 regularization Explained 2:

In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.

I'll address your last question first:

is it tested in theory that some weights contribute more than the others?

When learning about NNs I thought it would be better to have a really large number of nodes in the first hidden layer, and then reduce down over subsequent hidden layers. It is conceivable that many of the weights in this configuration are approaching zero, but this misses two main ideas: the values sent into the activation function are additive, so even small values can still make a difference if the input is large, whereas, 0 times anything is 0; the nodes are not representative or assigned a specific responsibility, so I assume, unless you start with exactly the same initial weights, a node may learn different features every time you train the model.

Subsequently, I have come to the conclusion that, if many of the weights are approaching zero, I probably have too many nodes in my hidden layer. This does not necessarily take the activation function into account, but it is how I approach model architecture.

Having covered that, you could just convert the matrix to a sparse matrix, sort the sparse matrix based on values and delete everything below your cutoff. This would accomplish what you asked I believe.

However, I would ask why you want to save just three of the weights? It is inconceivable to me how reducing the size of the weight matrices is of any benefit. If you can get the same performance under this concept, I think you should address your model architecture. I am not aware of any specific research that would address whether this is an acceptable practice, but it just seems like a waste of effort to train a model and then throw out a majority of the model. Simplify the model.



Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.