As far as I know, mini-batch can be used to reduce the variance of the gradient, but I am also considering if we can achieve the same result if we use the decreasing step size and only single sample in each iteration? Can we compare the convergence rate of them?
Suppose we have a dataset of two classes (0 and 1) divided into over 12k mini-batches where the first half of the dataset (over 6k mini-batches) belong to class 0, and the other half belongs to class 1. What will happen if a model is trained on this dataset without shuffling the samples?
If I do online learning in a setting where I have a HUGE amount of data, is that faster than doing minibatch learning (even if I optimize my batch size for GPU use, that is, use a multiple of 32 examples per minibatch)? Details: I have 12600 time series examples, each with 24 time steps, and each time step has 972196 binary labels. This is a multilabel problem. Assuming float32 numbers: loading the entire dataset should take about 1095 GB …
What is the technical name for a "batch element" in machine learning? Given a batch of data (size: batchSize*numberOfFeatures), what is the technical name used to refer to an element within the batch (data[batchElementIndex,:])?
I am reading this paper on session-based recommenders with RNNs: https://arxiv.org/abs/1511.06939. During the training phase, the authors apply what they call "session-parallel mini-batches," as depicted in the image below: What is not clear to me is how they take items from different sessions, and feed them into the network while maintaining separate hidden states for each session. The explanation that I could come up with is maintaining as many networks as the number of parallel sessions, and use one network …
Vowpal Wabbit (VW) uses online normalization as explained here [1]. When running VW with multiple workers, workers synchronize their models with an AllReduce at the end of each epoch. Is it possible or is there any code/paper that explores the idea of doing online learning with multiple workers in a parameter server setting? [1] https://arxiv.org/abs/1305.6646
I have a loss function that's a weighted cross entropy loss for binary classification def BinaryCrossEntropy_weighted( y_true, y_pred, class_weight ): y_true= y_true.astype(np.float) y_pred = K.clip(y_pred, K.epsilon(), 1 - K.epsilon()) first_term = class_weight[1] * (y_true) * K.log(y_pred + K.epsilon()) second_term = class_weight[0] * (1.0 -y_true) * K.log(1.0 - y_pred + K.epsilon()) loss = -K.mean(first_term + second_term, axis=0) return loss And when I run this loss=BinaryCrossEntropy_weighted( np.array(y),np.array(predict), class_weight ) I got output <tf.Tensor: shape=(1,), dtype=float64, numpy=array([0.16916199])> If one can observe carefully, can …
I'm creating mini-batches to put into a CNN. Is it best to try and get an even mix of classes into each mini-batch (Scenario 1), or can this/should this be a random assortment of my classes (Scenario 2)? Scenario 1: I have 2 classes and a mini-batch size of 32. I should try and have 16 samples from each class in each mini-batch. Scenario 2: Same as 1, but I have a random distribution of samples in each mini-batch. So …
I am trying to make a linear model that predicts user preferences that can be trained in mini batches so that it can be trained incrementally. I think sklearn's partial fit function would work well for this, allowing me to train the linear model as the data comes in gradually. The question I have is whether it is possible to have the model gradually forget the data it was trained on in the past? For example, if for a few …
I have made a convolutional neural network from scratch in python to classify the MNIST handwritten digits (centralized). It is composed of a single convolutional network with 8 3x3 kernels, a 2x2 maxpool layer and a 10 node dense layer with softmax as the activation function. I am using cross entropy loss and SGD. When I train the network on the whole training set for a single epoch with a batch size of 1, I get 95% accuracy. However, when …
Please read the question completely before you mark it as duplicate I was trying to understand the syntax of using an LSTM in PyTorch. I came across the following in PyTorch docs. h_0: tensor of shape $(D * \text{num_layers}, N, H_{out})$ containing the initial hidden state for each element in the batch. Defaults to zeros if (h_0, c_0) is not provided. where: \begin{aligned} N ={} & \text{batch size} \\ L ={} & \text{sequence length} \\ D ={} & 2 \text{ …
The following MCQ question is provided in "Exam Readiness: AWS Certified Machine Learning - Specialty" document. The correct answer has been marked in the document but I am not able to understand why this option is correct. Question: "A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist …
I have a question regarding the intuition behind RMSprop, As shown in the lecture video of Deep Learning Specialization by Andrew Ng, RMSprop helps to reduce the oscillation (the values of the vertical slope b as in the example figure), and speed up the convergence at the minima point through stepping long horizontal axis, This is achieved by update our weights as: $$w:= w - \frac{d_{w}}{\sqrt{S_{dw}}}$$ $$b:= b - \frac{d_{b}}{\sqrt{S_{db}}}$$ So, if initially $W$ is small so $\sqrt{S_{dw}}$ is small, …
Suppose I have two datasets, $X$ and $Y$, of different sizes. I am training two networks together, one which takes inputs $x\in X$, and the other takes inputs $y\in Y$. The two networks share parameters and therefore are trained together. Are there some guidelines on how to chose the batch-sizes for the samples from $X$ vs. those from $Y$? That is, should the the batches from $X$ have the same size as the batches from $Y$? In general, the two …
In deep learning based model training, in general batch of inputs are passed. For example for training a deep learning model with [512] dimensional input feature vector, say for batch size= 4, we mainly pass [4,512] dimenional input. I am curious what are the logical significance of passing the same input after flattening the input across the batch and channel dimenions [2048]. Logically the locality structure will be destroyed but will it significanlty speed up my implementation? And can it …
I am using both minibatch SGD (with momentum) and Adam for training a region proposal network. The library used is KERAS. The batch size in both cases is 5 and initial learning rate is 0.01. The learning rate decay schedule is also same for both optimizers The rpn classification loss steadily reduces in case of SGD with momentum but diverges in case of Adam. The performance of SGD with momentum is noticeably better after about 500 epochs Given that everything …
I used Mini-batch gradient descent to train the model, but i am unable to get the proper loss graph. The loss graph is always showed as a straight line. I know there is something wrong but would anyone be able to guide me from sklearn import metrics error = [] for epoch in range(epochs): for i in range(0,x_train.shape[0],minibatch_size): x_mini = x_train[i:i + minibatch_size-1,:] y_mini = y_train[i:i + minibatch_size-1,:] #feed forward #layer 1 in1 = x_mini@w1 + b1 out1 = sigmoid(in1) …
What's the proper way to train the algorithm with bigger batches or otherwise speed it up? The 'Rainbow Algorithm' is a Deep Q, Reinforcement Learning algorithm with two neural networks that I would like to speed up or scale up during training. You can read the paper here. Training is fairly slow because the observations have to be converted to tensors and updated to the model after each step. It's kind of a special and unique model, so I hope …
In Neural NEtwork Multilayer Perceptron, I understand that the main difference between Stochastic Gradient Descent (SGD) vs Gradient Descent (GD) lies in the way of how many samples are chosen while training. That is, SGD iteratively chooses one sample to perform forward pass followed by backpropagation to adjust the weights, as oppose to GD where the backpropagation starts only after the entire samples have been calculated in the forward pass). My question is: When the Gradient Descent (or even mini-batch …
For a Neural Network, the weight update equation is: However, there are millions of such weights W_i. If I am interested in capturing how much each weight/connection W_i is changing as compared to other weights, I am using the absolute magnitude of gradient summation for each weight W_i: where you are summing the absolute magnitude of gradients for the entirety of 'k' training iterations. number of training iterations (k) = train dataset size / batch size. After computing this summation …