mathematics

Confusion with L2 Regularization in Back-propagation

10xAI

2022年6月3日 17:22

In a very simple language, this is L2 regularization $\hspace{3cm}$$Loss_R$ = $Loss_N + \sum w_i^2$ $Loss_N$ - Loss without regularization $Loss_R$ - Loss with regularization When implementing [Ref], we simply add the derivative of the new penaty to the current delta weight, $\hspace{3cm}$$dw = dw_N + constant*w$ $dw_N$ - Weight delta without regularization What I think - L2 regularization is achieved with the last step only i.e. the weight is penalized. My question is - Why do we then add …

Topic: mathematics regularization backpropagation

Category: Data Science

ML, Statistics and Mathematics

ranit.b

2022年5月25日 07:00

I have just started getting my hands wet in ML and every time I try delving deeper into the concepts/code, I face the challenges of the mathematics and its cryptic notations. Coming from a Computer Science background, I do understand bit of them but majority goes tangent. Say, for example below formulae from this page - I try and really want to understand them but somehow get confused and leave it everytime. Can you please suggest how to start with …

Topic: data-science-model mathematics statistics machine-learning

Category: Data Science

How would I check the validity of covariates in my linear model on several hundred datasets?

Gave_birth_to_data

2022年5月24日 12:47

I have this linear model with predictors that I need to prove are statistically significant and pass the necessary lm assumptions. I know for a single dataset, I can use various LM tests, but the problem is I have several hundred datasets which cannot be combined. Coefficients may be different for each dataset, but I just need to prove(or disprove) that the covariates can be used for lm across all models. I'm assuming I shouldn't run tests on each LM …

Topic: mathematics prediction regression statistics

Category: Data Science

Understanding SGD for Binary Cross-Entropy loss

Coinman

2022年5月19日 18:07

I'm trying to describe mathematically how stochastic gradient descent could be used to minimize the binary cross entropy loss. The typical description of SGD is that I can find online is: $\theta = \theta - \eta *\nabla_{\theta}J(\theta,x^{(i)},y^{(i)})$ where $\theta$ is the parameter to optimize the objective function $J$ over, and x and y come from the training set. Specifically the $(i)$ indicates that it is the i-th observation from the training set. For binary cross entropy loss, I am using …

Topic: sgd mathematics multilabel-classification gradient-descent machine-learning

Category: Data Science

Derivative of MSE Cost Function

Adilson Medronha

2022年5月18日 01:59

The gradient descent: $\theta_{t+1}=\theta_t-a\frac{\partial}{\partial \theta_j}J(\theta)$ But specifically about $J$ cost function (Mean Squared Error) partial derivative: Consider that: $h_\theta(x)=\theta_0+\theta_1x$ $\frac{\partial}{\partial\theta_j}J(\theta) = \frac{\partial}{\partial\theta_j}\frac{1}{2}(h_{\theta}(x)-y)^2$ $\ \ \ \ \ \ \ \ \ \ \ \ =2\frac{1}{2}(h_{\theta}(x)-y)*\frac{\partial}{\partial\theta_j}(h_{\theta}(x)-y)$ $\ \ \ \ \ \ \ \ \ \ \ \ = (h_{\theta}(x)-y)*\frac{\partial}{\partial\theta_j}(\sum_{i=0}^{n}\theta_ix_i-y_i)$ $\ \ \ \ \ \ \ \ \ \ \ \ = (h_{\theta}(x)-y)x_j$ It´s not clear to me how $x_j$ is calculated: $\frac{\partial}{\partial\theta_j}(\sum_{i=0}^{n}\theta_ix_i-y) = x_j $ Can anyone help me …

Topic: mathematics deep-learning machine-learning

Category: Data Science

Is it possible for the (Cross Entropy) test loss to increase for a few epochs while the test accuracy also increases?

BladesV

2022年5月13日 19:09

I came across the question stated in the title: When training a model with the cross-entropy loss function, is it possible for the test loss to increase for a few epochs while the test accuracy also increases? I think that it should be possible, as the Cross Entropy loss is a measure of the "distance" between some 1-hot encoded vector to my model's predicted probabilities, and not a direct measure of my model's accuracy. But I was unable to find …

Topic: mathematics training loss-function deep-learning accuracy

Category: Data Science

bias variance decomposition for classification problem

IamTheRealFord

2022年5月13日 10:03

It is given that: MSE = bias$^2$ + variance I can see the mathematical relationship between MSE, bias, and variance. However, how do we understand the mathematical intuition of bias and variance for classification problems (we can't have MSE for classification tasks)? I would like some help with the intuition and in understanding the mathematical basis for bias and variance for classification problems. Any formula or derivation would be helpful.

Topic: bias mathematics variance classification

Category: Data Science

Decision boundary in a classification task

thenac

2022年5月12日 16:04

I have 1000 data points from the bivariate normal distribution $\mathcal{N}$ with mean $(0,0)$ and variance $\sigma_1^2=\sigma_2^2=10$ with the covariances being $0$. Also there are 20 more points from another bivariate normal distibution with mean $(15,15)$ with variance $\sigma_1^2=\sigma_2^2=1$ and with the covariances being $0$ again. I used the least squares method to calculate the parameters of the decision bounday $\theta_0 + \theta_1 x_1 + \theta_2 x_2=0$, that is $$\theta = (X^T X)^{-1}(X^Ty)$$ where $y$ is a column matrix with …

Topic: linearly-separable mathematics classification machine-learning

Category: Data Science

When is the sum of models the model of the sum?

Sam Castillo

2022年5月10日 21:03

The response variable in a regression problem, $Y$, is modeled using a data matrix $X$. In notation, this means: $Y$ ~ $X$ However, $Y$ can be separated out into different components that can be modeled independently. $$Y = Y_1 + Y_2 + Y_3$$ Under what conditions would $M$, the overall prediction, have better or worse performance than $M_1 + M_2 + M_3$, a sum of individual models? To provide more background, the model used is a GBM. I was surprised …

Topic: mathematics supervised-learning regression statistics

Category: Data Science

using logsumexp in softmax

zipline86

2022年4月29日 11:08

I saw this equation in somebody's code which is an alternative approach to implementing the softmax in order to avoid underflow by division by large numbers. softmax = e^(matrix - logaddexp(matrix)) = E^matrix / sumexp(matrix) logsumexp = scipy.special.logsumexp(matrix, axis=-1, keepdims=True) softmax = np.exp(matrix - logsumexp) I understand that when you log equations that use division you would then subtract, i.e. log(1/2) = log(1) - log(2). However, in the implantation of the code above, shouldn't they also log the matrix in …

Topic: softmax mathematics

Category: Data Science

Growth Edge in Link Prediction

Raphael Bellahsen

2022年4月28日 13:56

I have 2 CSV files representing edge in social networks in 2 consecutive generations. I am trying to predict future edges. My initial tough is to train a linear regression on the first generation with some indicators like Adar Index or Cosine Similarity between the node of the edge I am trying to predict. I can not run all the combinations possible between 2 nodes, so I was wondering how many edges I need to add between 2 generations? Is …

Topic: mathematics graphs machine-learning

Category: Data Science

How propagate the error delta in backpropagation in convolutional neural networks (CNN)?

Julen

2022年4月26日 16:01

My CNN has the following structure: Output neurons: 10 Input matrix (I): 28x28 Convolutional layer (C): 3 feature maps with a 5x5 kernel (output dimension is 3x24x24) Max pooling layer (MP): size 2x2 (ouput dimension is 3x12x12) Fully connected layer (FC): 432x10 (3*12*12=432 max pooling layer flattened and vectorized) After making the forward pass, I calculate the error delta in the output layer as: $\delta^L = (a^L-y) \odot \sigma'(z^L) (1)$ Being $a^L$ the predicted value and $z^L$ the dot product …

Topic: mathematics deep-learning neural-network

Category: Data Science

How do I calculate the range of a F1-score from a confusion matrix of 3 class,A,B,C

Bhabesh Roy

2022年4月24日 13:07

Is there any support function to calculate the average F1-score range?

Topic: mathematics confusion-matrix machine-learning

Category: Data Science

Geometric Deep Learning - G-Smoothing operator on polynomials

luxoar

2022年4月24日 06:37

(Note: My question resolves about a problem stated in the following lecture video: https://youtu.be/ERL17gbbSwo?t=413 Hi, I hope this is the right forum for these kind of questions. I'm currently following the lectures of geometric deep learning from (geometricdeeplearning.com) and find the topics fascinating. As I want to really dive in I wanted to also follow up on the questions they state towards the students. In particular my question revolves around creating invariant functions using the G-Smoothing operator (To enforce invariance, …

Topic: mathematics theory deep-learning

Category: Data Science

Structured policies in dynamic programming: solving a toy example

learningowl

2022年4月22日 20:15

I am trying to solve a dynamic programming toy example. Here is the prompt: imagine you arrive in a new city for $N$ days and every night need to pick a restaurant to get dinner at. The qualities of the restaurants are iid according to distribution $F$ (assume [0,1]). The goal is to maximize the sum of the qualities of the restaurants that you get dinner at over the $N$ days. Every day you need to choose whether you go …

Topic: dynamic-programming mathematics optimization

Category: Data Science

Does T-test requires Standard deviation or variance for calculation

Chris

2022年4月20日 11:01

Might be a novice question, but the main difference between a t-test and z-test, I was able to understand, is that the z-test calculation requires the SD value of the sample where as in a t-test, we do not have SD, apart from high and low sample size. But when calculating the t-test value, the formula requires the SD value as well. So what is the difference between a t and z test? Can someone please clear this up?

Topic: hypothesis-testing mathematics pvalue statistics

Category: Data Science

Efficient Searching for a basis of information as a hyperparameter in a large possible hyperparameter space

wigeon

2022年4月18日 15:37

I have a set of inputs, let's call them 'I', that can be fed through a complicated group of functions to produce/calculate a wide variety of outputs (let's call them 'O'). I want to find a subset of outputs (let's call them 'O-prime') within 'O' that contain sufficient information to form a basis in order to find/reconstruct a point in the 'I'-space accurately. In other words I want to pick 'O-prime' such that I am able to uniquely identify any …

Topic: hyperparameter-tuning mathematics functions hyperparameter neural-network

Category: Data Science

Strategies for complicated inverse function approximation

wigeon

2022年4月18日 15:23

I have a dataset G. There is a complicated set of mathematical functions I can use to calculated the values 'W' for any given point in G. f(G) $\rightarrow$ W To the best of my knowledge these functions f are not analytically invertible in closed form, so I want to use machine learning to attempt the inverse problem, to calculate/approximate the value of a point in G for any given point in W. f$^{-1}$(W) $\rightarrow$ G. I am assuming here …

Topic: mathematics machine-learning

Category: Data Science

Confused on Naive Bayes classifier

Alpha code

2022年4月11日 12:02

In the last part of Andrew Ng's lectures about Gaussian Discriminant Analysis and Naive Bayes Classifier, I am confused as to how Andrew Ng derived $(2^n) - 1$ features for Naive Bayes Classifier. First off, what does he mean by features in the context he was describing? I initially thought that the features were characteristics of our random vector, $x$. I know that for the total possibilities of $x$ it is $2^n$ but I do not understand how he was …

Topic: mathematics bayesian gaussian naive-bayes-classifier machine-learning

Category: Data Science

Reinforcement Learning - PPO: Why do so many implementations calculate the returns using the GAE? (Mathematical reason)

Johannes

2022年4月6日 16:01

There are so many PPO implementations that use GAE and do the following: def compute_gae(next_value, rewards, masks, values, gamma=0.99, tau=0.95): values = values + [next_value] gae = 0 returns = [] for step in reversed(range(len(rewards))): delta = rewards[step] + gamma * values[step + 1] * masks[step] - values[step] gae = delta + gamma * tau * masks[step] * gae returns.insert(0, gae + values[step]) return returns ... advantage = returns - values ... critic_loss = (returns - value).pow(2).mean() Soure: https://github.com/higgsfield/RL-Adventure-2/blob/master/3.ppo.ipynb, I …

Topic: policy-gradients actor-critic mathematics reinforcement-learning machine-learning

Category: Data Science

About