I'll address your last question first:
is it tested in theory that some weights contribute more than the others?
When learning about NNs I thought it would be better to have a really large number of nodes in the first hidden layer, and then reduce down over subsequent hidden layers. It is conceivable that many of the weights in this configuration are approaching zero, but this misses two main ideas: the values sent into the activation function are additive, so even small values can still make a difference if the input is large, whereas, 0 times anything is 0; the nodes are not representative or assigned a specific responsibility, so I assume, unless you start with exactly the same initial weights, a node may learn different features every time you train the model.
Subsequently, I have come to the conclusion that, if many of the weights are approaching zero, I probably have too many nodes in my hidden layer. This does not necessarily take the activation function into account, but it is how I approach model architecture.
Having covered that, you could just convert the matrix to a sparse matrix, sort the sparse matrix based on values and delete everything below your cutoff. This would accomplish what you asked I believe.
However, I would ask why you want to save just three of the weights? It is inconceivable to me how reducing the size of the weight matrices is of any benefit. If you can get the same performance under this concept, I think you should address your model architecture. I am not aware of any specific research that would address whether this is an acceptable practice, but it just seems like a waste of effort to train a model and then throw out a majority of the model. Simplify the model.
HTH