Confusion with L2 Regularization in Back-propagation

In a very simple language, this is L2 regularization $\hspace{3cm}$$Loss_R$ = $Loss_N + \sum w_i^2$ $Loss_N$ - Loss without regularization $Loss_R$ - Loss with regularization When implementing [Ref], we simply add the derivative of the new penaty to the current delta weight, $\hspace{3cm}$$dw = dw_N + constant*w$ $dw_N$ - Weight delta without regularization What I think - L2 regularization is achieved with the last step only i.e. the weight is penalized. My question is - Why do we then add …
Category: Data Science

ML, Statistics and Mathematics

I have just started getting my hands wet in ML and every time I try delving deeper into the concepts/code, I face the challenges of the mathematics and its cryptic notations. Coming from a Computer Science background, I do understand bit of them but majority goes tangent. Say, for example below formulae from this page - I try and really want to understand them but somehow get confused and leave it everytime. Can you please suggest how to start with …
Category: Data Science

How would I check the validity of covariates in my linear model on several hundred datasets?

I have this linear model with predictors that I need to prove are statistically significant and pass the necessary lm assumptions. I know for a single dataset, I can use various LM tests, but the problem is I have several hundred datasets which cannot be combined. Coefficients may be different for each dataset, but I just need to prove(or disprove) that the covariates can be used for lm across all models. I'm assuming I shouldn't run tests on each LM …
Category: Data Science

Understanding SGD for Binary Cross-Entropy loss

I'm trying to describe mathematically how stochastic gradient descent could be used to minimize the binary cross entropy loss. The typical description of SGD is that I can find online is: $\theta = \theta - \eta *\nabla_{\theta}J(\theta,x^{(i)},y^{(i)})$ where $\theta$ is the parameter to optimize the objective function $J$ over, and x and y come from the training set. Specifically the $(i)$ indicates that it is the i-th observation from the training set. For binary cross entropy loss, I am using …
Category: Data Science

Derivative of MSE Cost Function

The gradient descent: $\theta_{t+1}=\theta_t-a\frac{\partial}{\partial \theta_j}J(\theta)$ But specifically about $J$ cost function (Mean Squared Error) partial derivative: Consider that: $h_\theta(x)=\theta_0+\theta_1x$ $\frac{\partial}{\partial\theta_j}J(\theta) = \frac{\partial}{\partial\theta_j}\frac{1}{2}(h_{\theta}(x)-y)^2$ $\ \ \ \ \ \ \ \ \ \ \ \ =2\frac{1}{2}(h_{\theta}(x)-y)*\frac{\partial}{\partial\theta_j}(h_{\theta}(x)-y)$ $\ \ \ \ \ \ \ \ \ \ \ \ = (h_{\theta}(x)-y)*\frac{\partial}{\partial\theta_j}(\sum_{i=0}^{n}\theta_ix_i-y_i)$ $\ \ \ \ \ \ \ \ \ \ \ \ = (h_{\theta}(x)-y)x_j$ It´s not clear to me how $x_j$ is calculated: $\frac{\partial}{\partial\theta_j}(\sum_{i=0}^{n}\theta_ix_i-y) = x_j $ Can anyone help me …
Category: Data Science

Is it possible for the (Cross Entropy) test loss to increase for a few epochs while the test accuracy also increases?

I came across the question stated in the title: When training a model with the cross-entropy loss function, is it possible for the test loss to increase for a few epochs while the test accuracy also increases? I think that it should be possible, as the Cross Entropy loss is a measure of the "distance" between some 1-hot encoded vector to my model's predicted probabilities, and not a direct measure of my model's accuracy. But I was unable to find …
Category: Data Science

bias variance decomposition for classification problem

It is given that: MSE = bias$^2$ + variance I can see the mathematical relationship between MSE, bias, and variance. However, how do we understand the mathematical intuition of bias and variance for classification problems (we can't have MSE for classification tasks)? I would like some help with the intuition and in understanding the mathematical basis for bias and variance for classification problems. Any formula or derivation would be helpful.
Category: Data Science

Decision boundary in a classification task

I have 1000 data points from the bivariate normal distribution $\mathcal{N}$ with mean $(0,0)$ and variance $\sigma_1^2=\sigma_2^2=10$ with the covariances being $0$. Also there are 20 more points from another bivariate normal distibution with mean $(15,15)$ with variance $\sigma_1^2=\sigma_2^2=1$ and with the covariances being $0$ again. I used the least squares method to calculate the parameters of the decision bounday $\theta_0 + \theta_1 x_1 + \theta_2 x_2=0$, that is $$\theta = (X^T X)^{-1}(X^Ty)$$ where $y$ is a column matrix with …
Category: Data Science

When is the sum of models the model of the sum?

The response variable in a regression problem, $Y$, is modeled using a data matrix $X$. In notation, this means: $Y$ ~ $X$ However, $Y$ can be separated out into different components that can be modeled independently. $$Y = Y_1 + Y_2 + Y_3$$ Under what conditions would $M$, the overall prediction, have better or worse performance than $M_1 + M_2 + M_3$, a sum of individual models? To provide more background, the model used is a GBM. I was surprised …
Category: Data Science

using logsumexp in softmax

I saw this equation in somebody's code which is an alternative approach to implementing the softmax in order to avoid underflow by division by large numbers. softmax = e^(matrix - logaddexp(matrix)) = E^matrix / sumexp(matrix) logsumexp = scipy.special.logsumexp(matrix, axis=-1, keepdims=True) softmax = np.exp(matrix - logsumexp) I understand that when you log equations that use division you would then subtract, i.e. log(1/2) = log(1) - log(2). However, in the implantation of the code above, shouldn't they also log the matrix in …
Category: Data Science

Growth Edge in Link Prediction

I have 2 CSV files representing edge in social networks in 2 consecutive generations. I am trying to predict future edges. My initial tough is to train a linear regression on the first generation with some indicators like Adar Index or Cosine Similarity between the node of the edge I am trying to predict. I can not run all the combinations possible between 2 nodes, so I was wondering how many edges I need to add between 2 generations? Is …
Category: Data Science

How propagate the error delta in backpropagation in convolutional neural networks (CNN)?

My CNN has the following structure: Output neurons: 10 Input matrix (I): 28x28 Convolutional layer (C): 3 feature maps with a 5x5 kernel (output dimension is 3x24x24) Max pooling layer (MP): size 2x2 (ouput dimension is 3x12x12) Fully connected layer (FC): 432x10 (3*12*12=432 max pooling layer flattened and vectorized) After making the forward pass, I calculate the error delta in the output layer as: $\delta^L = (a^L-y) \odot \sigma'(z^L) (1)$ Being $a^L$ the predicted value and $z^L$ the dot product …
Category: Data Science

Geometric Deep Learning - G-Smoothing operator on polynomials

(Note: My question resolves about a problem stated in the following lecture video: https://youtu.be/ERL17gbbSwo?t=413 Hi, I hope this is the right forum for these kind of questions. I'm currently following the lectures of geometric deep learning from (geometricdeeplearning.com) and find the topics fascinating. As I want to really dive in I wanted to also follow up on the questions they state towards the students. In particular my question revolves around creating invariant functions using the G-Smoothing operator (To enforce invariance, …
Category: Data Science

Structured policies in dynamic programming: solving a toy example

I am trying to solve a dynamic programming toy example. Here is the prompt: imagine you arrive in a new city for $N$ days and every night need to pick a restaurant to get dinner at. The qualities of the restaurants are iid according to distribution $F$ (assume [0,1]). The goal is to maximize the sum of the qualities of the restaurants that you get dinner at over the $N$ days. Every day you need to choose whether you go …
Category: Data Science

Does T-test requires Standard deviation or variance for calculation

Might be a novice question, but the main difference between a t-test and z-test, I was able to understand, is that the z-test calculation requires the SD value of the sample where as in a t-test, we do not have SD, apart from high and low sample size. But when calculating the t-test value, the formula requires the SD value as well. So what is the difference between a t and z test? Can someone please clear this up?
Category: Data Science

Efficient Searching for a basis of information as a hyperparameter in a large possible hyperparameter space

I have a set of inputs, let's call them 'I', that can be fed through a complicated group of functions to produce/calculate a wide variety of outputs (let's call them 'O'). I want to find a subset of outputs (let's call them 'O-prime') within 'O' that contain sufficient information to form a basis in order to find/reconstruct a point in the 'I'-space accurately. In other words I want to pick 'O-prime' such that I am able to uniquely identify any …
Category: Data Science

Strategies for complicated inverse function approximation

I have a dataset G. There is a complicated set of mathematical functions I can use to calculated the values 'W' for any given point in G. f(G) $\rightarrow$ W To the best of my knowledge these functions f are not analytically invertible in closed form, so I want to use machine learning to attempt the inverse problem, to calculate/approximate the value of a point in G for any given point in W. f$^{-1}$(W) $\rightarrow$ G. I am assuming here …
Category: Data Science

Confused on Naive Bayes classifier

In the last part of Andrew Ng's lectures about Gaussian Discriminant Analysis and Naive Bayes Classifier, I am confused as to how Andrew Ng derived $(2^n) - 1$ features for Naive Bayes Classifier. First off, what does he mean by features in the context he was describing? I initially thought that the features were characteristics of our random vector, $x$. I know that for the total possibilities of $x$ it is $2^n$ but I do not understand how he was …
Category: Data Science

Reinforcement Learning - PPO: Why do so many implementations calculate the returns using the GAE? (Mathematical reason)

There are so many PPO implementations that use GAE and do the following: def compute_gae(next_value, rewards, masks, values, gamma=0.99, tau=0.95): values = values + [next_value] gae = 0 returns = [] for step in reversed(range(len(rewards))): delta = rewards[step] + gamma * values[step + 1] * masks[step] - values[step] gae = delta + gamma * tau * masks[step] * gae returns.insert(0, gae + values[step]) return returns ... advantage = returns - values ... critic_loss = (returns - value).pow(2).mean() Soure: https://github.com/higgsfield/RL-Adventure-2/blob/master/3.ppo.ipynb, I …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.