Updating Weight Using Updates on Related Data

Suppose $$ x=Ay $$

The $x$ is $M\times 1$, $y$ is $N \times 1$ and $A$ is $M\times N$

We have the data $x$ and would like to know what $y$ is.

However, the matrix $A$ is too large for pseudo-inverse. And thus we would like to approximate $A^{-1}$ using machine learning as it is possible to parallelize it.

Here for parallelization, we divide the given problem into: $$ x^l = A^l y $$ where $x = [x^1 , x^2,\dots,x^L]^T$ and similarly, $A = [A^1,A^2,\dots,A^L]^T$

So, $x^l$ is $\frac{M}{L}\times 1$ and $A^l$ is $\frac{M}{L}\times N$

The problem could thus be arranged as $$y = W^l x^l$$ and the $W^l$ could be trained using gradient descent.

$$ W^l_{i+1} = W^l_{i} + \eta\left(y_i-W^l_ix^l_i\right){x_i^l}^T $$

Here, $W^l$ is $N \times \frac{M}{L}$ and $W = [ W^1,W^2,\dots,W^L ] $

Here we can directly use updation for every $l^\mathrm{th}$ block. However, as diffusion techniques converge to lower MSD, I would like to employ them.

However, the $W^l$ and $W^k$ aren't directly related and thus diffusion techniques can't be directly applied to them but will require matrix operations before combining them. However, if we could've afforded inverse matrix operations, we won't have moved to gradient descent.

Could you give a solution where we don't need to apply inverse operations for employing diffusion techniques?

Topic mathematics gradient-descent distributed parallel

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.