I have a regression problem I'm trying to build a model for: Predicting sales per person (>= 0) depending on some variables. I'm running different model types and gave deep neural networks a try. The loss functions I'm using are mean squared error and mean absolute error (or sometimes a mix). I often run into this issue though, that despite mse and mae are being optimized, I end up with a very strong bias in the prediction, e.g. sum(training_all_predictions) / …
I'm confused about the difference between "ethics" and "bias" when those concepts are discussed in the context of Machine Learning (ML). In my understanding, ethical issue in ML is pretty much exactly the same thing as "bias": say, the model discriminates people of color and this is the same as to say that the model is biased. In short, "ethics is always a bias, but it is not necessarily true that a bias is always an ethical issue". Is this …
I have this simple model, which tries to predict constant $[1, 1, .. 1, 0, ..., 0]$ vector regardless of input. I found that model predicts it successfully if trained on input in $[0,10]$ range, however model's predictions are always $[0...0]$ vectors if model is trained on input in $[750, 770]$ range. I was thinking model should converge to high bias weights and still be able to predict constant vector even for larger training inputs. Maybe anyone can advice what …
It is given that: MSE = bias$^2$ + variance I can see the mathematical relationship between MSE, bias, and variance. However, how do we understand the mathematical intuition of bias and variance for classification problems (we can't have MSE for classification tasks)? I would like some help with the intuition and in understanding the mathematical basis for bias and variance for classification problems. Any formula or derivation would be helpful.
I have trained an XGBClassifier to classify text issues to a rightful assignee (simple 50-way classification). The source from where I am fetching the data also provides a datetime object which gives us the timestamp at which the issue was created. Logically, the person who has recently worked on an issue (say 2 weeks ago) should be a better suggestion instead of (another) person who has worked on similar issue 2 years ago. That is, if there two examples from …
I wonder how to check if the protected variables in fairness either encoded in the other features (non-protected). Or if they are not sufficiently correlated with target variables so adding them does not improve performance in predication(classification)?. If there is a Python tutorial showing that , it will be useful. Regards,
I read this and have an ambiguity. I try to understand well how to calculate the derivative of Loss w.r.t to bias. In this question, we have this definition: np.sum(dz2,axis=0,keepdims=True) Then in Casper's comment, he said that the The derivative of L (loss) w.r.t. b is the sum of the rows $$ \frac{\partial L}{\partial Z} \times \mathbf{1} = \begin{bmatrix} . &. &. \\ . &. &. \end{bmatrix} \begin{bmatrix} 1\\ 1\\ 1\\ \end{bmatrix} $$ But actually, using axis=0, is it not …
I have this code: X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1) model = LinearRegression().fit(X_train, y_train) from mlxtend.evaluate import bias_variance_decomp print(y_train.min(), y_train.max(), y_test.min(), y_test.max()) #for your understanding of the data: 7283 517924 11510 450000 avg_expected_loss, avg_bias, avg_var = bias_variance_decomp( model, X_train, y_train.ravel(), X_test, y_test.ravel(), loss='mse', random_seed=1) print('Average expected loss: %.3f' % avg_expected_loss) print('Average bias: %.3f' % avg_bias) print('Average variance: %.3f' % avg_var) The result is: Average expected loss: 542162695.679 Average bias: 529311955.129 Average variance: 12850740.550 To me, these values …
Suppose I have an input matrix $\mathbf X\in \mathbb R^{(D+1)\times N}$ where $N$ is number of samples $D$ is dimension of an input vector $x$ and extra $1$ dimension is for bias where all bias entries are $1$. If I want to normalize all inputs by subtracting mean and dividing by standard deviation how should I handle bias entries? Should they stay same as $1$
I was wondering if I can visualize with the example the fact that for all points $x$ on the separating hyperplane, the following equation holds true: $$w^T.x+w_0=0\quad\quad\quad \text{... equation (1)}$$ Here, $w$ is a weight vector and $w_0$ is a bias term (perpendicular distance of the separating hyperplane from the origin) defining separating hyperplane. I was trying to visualize in 2D space. In 2D, the separating hyperplane is nothing but the decision boundary. So, I took following example: $w=[1\quad 2], …
I am reading the paper Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings here is the pdf. On page 6, we read: Step 1: Identify gender subspace. Inputs: word sets W , defining sets D_1 , ..., D_m. However, they paper before and after this statement does not mention what these defining sets are? Can anyone give me a definition or description of these sets? Thank you.
I'm currently working on a binary classification problem. My training dataset is rather small with only 1000 elements. (I don't know if it is relevant : my problem is similar to the "spam filtering" problem where a data can also be "likely" to be categorized as spam but i simplified it as a black or white issue, and use the probability given by the models to assign a likelihood score) Among those 1000 elements: 70% are from the class 1 …
Based on the deeplearningbook: $$MSE = E[(\theta_m^{-} - \theta)^2]$$ $$equals$$ $$Bias(\theta_m^{-})^2 + Var(\theta_m^{-})$$ where m is the number of samples in training set, $\theta$ is the actual parameter in the training set and $\theta_m^{-}$ is the estimated parameter. I can't get to the second equation. Further, I don't understand how the first expression is gained. Note: $Bias(\theta_m^{-})^2 = E(\theta_m^{-2}) - \theta^2$ Also how bias and variance evaluated in classification.?
I often read about the fact, that the amount of data to train and get a generalizing model for a deep learning algorithm is much higher in comparison, e.g. to a support vector machine. It makes sense, because of the huge amount of parameters in a deep learning approach, which potentially leads to overfitting. However: Are there any systematic studies on this? Do deep learning approaches really need more data? Best regards, Gesetzt
I have a Regression Model with Train MAPE as 6% and Test MAPE as 15%. This appears to me as a clear case of over fitting. But can I still use this model assuming 15% Error is not a bad number after-all. Is this there a flaw in this thinking?
This topic confuses me. In the literature or articles, when talking about bias and variance in automatic learning, specifically in cross-validation, do they refer to the high bias (underfitting) and high variance (overfitting) in the model? Or do they refer to the bias and variance of the predictions obtained in the iterations of the cross-validation? How to handle each case?
I am making some ML methods (RF, RNN, MLP) to predict a time series value 'y' based on features 'X' and not the time series 'y' itself. My question is regarding the bias I might be including since I am doing a simple random train-test-split for the fit and evaluation process, so I am using data from different days (past and future) and not spliting by time. Is it valid for this prediction process, or even that I am not …
My goal is to calculate backpropagation(Especially the backpropagation of the bias). For example, X, W and B are python numpy array, such as [[0,0],[0,1]] , [[5,5,5],[10,10,10]] and [1,2,3] for each. And, suppose dL/dY is [[1,2,3],[4,5,6]]. How to calculate dL/dB? Answer should be [5, 7, 9]. why is it calculated that way?
I a trying to understand this learning curve of a classification problem. But I am not sure what to infer. I believe that I have overfitting but I cannot sure. Very low training loss that’s very slightly increasing upon adding training examples. "Gradually decreasing validation loss (without flattening) upon adding training examples". However, I do not see any gap at the end of the lines something that is usually can be found in an overfitting model On the other hand, …
I am trying to predict next 10 days by looking into the last 60 days. So tried to implement an LSTM layer. Before jumping into the question, I want to clarify a few points. Firstly, this is a Multiple Parallel Input and Multi-Step Output problem as it is described in the link. I collected the data of the last 5 years of all funds available in my country from this address. I refined my data as much as possible. Of …