I have built a LightGBM based machine learning model on data of molecules of two classes. The distribution is as follows. Class 0 has 5933 data points and class 1 has 4696. The train test accuracy I get on this data is around 87% and 82% respectively. The roc_auc_score is around 81.5%. But when I try to evaluate model performance on an entirely new dataset which model has never seen before with class label 0 and 1 both having 94 …
My task is the following: To input drug combinations and output renal failure-related symptoms from the drug combinations. Both the drug combinations and renal-failure related symptoms are represented as one-hot encoded (for example, someone getting symptom 1 and symptom 3 out of a total of 4 symptoms is represented as [1,0,1,0]). So far, I have ran the data through the following models and they have produced this interesting graph. The left-hand graph depicts the training and validation loss of the …
Given a time-series prediction with a Recurrent Neural Network (doesn't matter if LSTM/GRU/...), a forecast might look like this: to_predict (orange) was fed to the model, predicted (purple) is the forecast resulting from the RNN-model and correct (dashed blue) is how it should have benn forecasted correctly. As can be seen, to_predict (as well as all the training-data) is quite "spiky", while the forecast is much smoother. The smoothness is presumably the result of the models architecture etc.; anyhow, my …
Stats newbie here. I have a small dataset of 646 samples that I've trained a reasonably performant model on (~99% test and val accuracy). To complicate things a little bit, the classes are somewhat unbalanced. It's a binary classification problem. Here is my confusion matrix on training data [[387 1] [ 1 73]] on testing data: [[74 1] [ 0 10]] on validation data: [[85 1] [ 0 13]] Training Specificity: .986 Testing Specificity: .909 Validation Specificity: .928 My thoughts …
Say I have an instance space with 4 features and I know that a decision tree with 8 nodes can represent the target function I want to learn. I want to give an upper bound on the size of the sample set needed in order to achieve a true error of at most x%. I found this theorem in a text book. Let $\mathcal{H}$ be an hypothesis class and let $\epsilon, \delta > 0$. If a training set $S$ of …
If one is training a basic FFNN (Feed-Forward Neural Network), one would apply regularizations like dropout, l1, l2 and gaussian noise, so that the model is robust and gives better results for unseen data. But my question is, once the model gives fairly good results, isn't it advisable to remove the reguarizations then train the model again for some time, so that its predictions are more accurate?
I'm using PytorchGeometric to train a graph convolutional network for regression over nodes problem (the graph models physical phenomena in the network of sensors; the network of sensors is actually the network of measurements distributed across the power grid (powers, currents, voltages), and the goal of the GNN is to predict some unmeasured variables in the graph.). In the training dataset there graphs with different topologies (i.e. different edge_index tensors), and each of which has input and label tensors, which …
I had this question during an interview that I wasn't able to answer, even after researching on the internet. Which of the following can affect an artificial neural network’s ability to generalize??? absence of biais learning bias size of output layer no.of examples And please can you explain a little why? Thank you
So I have this data let's say of size (2000,11), and I want to do perform a binary classification based on these eleven features. There is a class Imbalance between the two categories so I balance the classes using Random Oversampling to ensure the classifier will be generalized. The data is splint into train and test set, then I create a pipeline with the following stuff: (StandardScalar(),SelectKBest(f_classif), SVC()). Then I use GridSearch CV on the Training with cross validation of …
In NLP, which type of models (generative or discriminative) is more sensitive to the amount of data to generalize better? references? This is related to the way those two types capture the data probability (join-prob. vs conditional prob.)?
I currently have a data set shown below: 0.35535 0.32226 0.35594 0.38433 0.32773 0.34685 0.35475 0.37606 0.42278 0.34502 0.45573 0.54538 0.35488 0.40833 0.43780 0.48279 0.34622 0.32314 0.36684 0.41292 0.32893 0.35636 0.36386 0.38715 0.35892 0.33035 0.41856 0.47302 0.33769 0.37625 0.38597 0.42510 0.32681 0.31423 0.35694 0.38962 0.32438 0.34359 0.34893 0.36110 0.31092 0.30892 0.32405 0.33759 0.31260 0.31992 0.32202 0.33002 I am trying to use machine learning to find a formula $f(n) = k$ where $n$ would be each of these data points and …
This question is based on the following intuition: To my understanding adversarial attacks work, because the model is stuck in a local minima and the adversarial attack finds this with gradient descent. Could this be used to train a Neural Network that is able to generalize better? This way the model would be trained on exactely the examples it completely misunderstands. Intuitively it feels like the teacher trying to find where the student misunderstood the topic and than correcting it …
I have an imbalanced dataset (2:1 ratio) with about 60 patients and 80 features. I performed Recursive Feature Elimination (RFE) and stratified cross validation to reduce the features to 15 and I get an AUC of 0.9 with Logistic regression and/or SVM. I don't fully trust the AUC I got because I think it will not generalize correctly because of such a small positive class. So, I was thinking on oversampling (K-means + PCA) the minority class and re-run the …
Is it always possible to generalize an overfitted model? I know there are ways to handle overfitting, but can there be scenarios where overfitting cannot be handled in machine learning?
I'm referring to Kaggle feature creation exercise . The data frame contains a column(MSSubClass) that contains these unique values: 'One_Story_1946_and_Newer_All_Styles', 'Two_Story_1946_and_Newer', 'One_Story_PUD_1946_and_Newer', 'One_and_Half_Story_Finished_All_Ages', 'Split_Foyer', 'Two_Story_PUD_1946_and_Newer', 'Split_or_Multilevel', 'One_Story_1945_and_Older', 'Duplex_All_Styles_and_Ages', 'Two_Family_conversion_All_Styles_and_Ages', 'One_and_Half_Story_Unfinished_All_Ages', 'Two_Story_1945_and_Older', 'Two_and_Half_Story_All_Ages', 'One_Story_with_Finished_Attic_All_Ages', 'PUD_Multilevel_Split_Level_Foyer', 'One_and_Half_Story_PUD_All_Ages' and they generalize the values into following values: 'One', 'Two', 'Split', 'Duplex', 'PUD' (by splitting from the first word). Should this kind of generalization is needed if I only use Random forests as my algorithm to make predictions? It seems this kind of generalization losses …
Im implementing a random forrest for a 6 class classification and witnessing a strange phenomenon. I have 10 percent of my set sectioned out as a pseudo validation set. Im training 50 percent of the training items (training items being 90 percent of the whole set) per tree randomly selected. Now my oob error is almost the mirror image of my validation error. Im using averaged f1 error (ie average of the f1 error per class). As more trees are …
Training data: $\mathcal {T} =\{(2,1),(3,2),(4,6),(0,0),(1,1)\}$ you already computed a predictor for the output using linear regression by least squares, where you used the first 3 samples as training samples: $f(X) = -4.5 + 2.5X$ Approximate the generalization error using the validation set approach, i.e. on the remaining validation set. How I started: $\text{Error = Irreducible Error + Bias$^2$ + Variance .}$ $\text{$EGE(f, x_0) $=$σ^2_ε$ + $[E_T (f_T (x_0)) − f_{exact}(x_0)]^2$ + $E_T(f_T (x_0) − E_T (f_T (x_0)))^2$ }$ How to …
In my current research project I'm using the Deep Q-learning algorithm. The setup is as follows: I'm training the model (using Deep Q-learning) on a static dataset made up of experiences extracted from N levels of a given game. Then, I want to use the trained model to solve M new levels of the same game, i.e., I want to test the generalization ability of the agent on new levels of the same game. Currently, I have managed to find …
I have a industrial dataset containing labeled machine data for fault classification(3 classes: 1 ok, 2 for faults). The problem is that i have less (~16) different machines, thus iam currently having instance shift problems: The accuracies on the training set is perfect but validation on holdout instances fails. As background information, the machine data is time series, where i extracted statistical (domain specific) features from (14 in total). This features are my dataset for classification. I tried different model …