Online vs minibatch training for speed

If I do online learning in a setting where I have a HUGE amount of data, is that faster than doing minibatch learning (even if I optimize my batch size for GPU use, that is, use a multiple of 32 examples per minibatch)? Details: I have 12600 time series examples, each with 24 time steps, and each time step has 972196 binary labels. This is a multilabel problem. Assuming float32 numbers: loading the entire dataset should take about 1095 GB …
Category: Data Science

How are session-parallel mini-batches used for training RNNs for session-based recommender tasks?

I am reading this paper on session-based recommenders with RNNs: https://arxiv.org/abs/1511.06939. During the training phase, the authors apply what they call "session-parallel mini-batches," as depicted in the image below: What is not clear to me is how they take items from different sessions, and feed them into the network while maintaining separate hidden states for each session. The explanation that I could come up with is maintaining as many networks as the number of parallel sessions, and use one network …
Category: Data Science

Vowpal Wabbit Online Normalization -- Possible to parallelize?

Vowpal Wabbit (VW) uses online normalization as explained here [1]. When running VW with multiple workers, workers synchronize their models with an AllReduce at the end of each epoch. Is it possible or is there any code/paper that explores the idea of doing online learning with multiple workers in a parameter server setting? [1] https://arxiv.org/abs/1305.6646
Category: Data Science

Why does neural network need loss as scalar?

I have a loss function that's a weighted cross entropy loss for binary classification def BinaryCrossEntropy_weighted( y_true, y_pred, class_weight ): y_true= y_true.astype(np.float) y_pred = K.clip(y_pred, K.epsilon(), 1 - K.epsilon()) first_term = class_weight[1] * (y_true) * K.log(y_pred + K.epsilon()) second_term = class_weight[0] * (1.0 -y_true) * K.log(1.0 - y_pred + K.epsilon()) loss = -K.mean(first_term + second_term, axis=0) return loss And when I run this loss=BinaryCrossEntropy_weighted( np.array(y),np.array(predict), class_weight ) I got output <tf.Tensor: shape=(1,), dtype=float64, numpy=array([0.16916199])> If one can observe carefully, can …
Category: Data Science

Should mini-batches contain an even mix of classes or can this be random?

I'm creating mini-batches to put into a CNN. Is it best to try and get an even mix of classes into each mini-batch (Scenario 1), or can this/should this be a random assortment of my classes (Scenario 2)? Scenario 1: I have 2 classes and a mini-batch size of 32. I should try and have 16 samples from each class in each mini-batch. Scenario 2: Same as 1, but I have a random distribution of samples in each mini-batch. So …
Category: Data Science

Short term memory for online/incremenetal training a linear model

I am trying to make a linear model that predicts user preferences that can be trained in mini batches so that it can be trained incrementally. I think sklearn's partial fit function would work well for this, allowing me to train the linear model as the data comes in gradually. The question I have is whether it is possible to have the model gradually forget the data it was trained on in the past? For example, if for a few …
Category: Data Science

Why are mini-batches degrading my conv net MNIST classifier?

I have made a convolutional neural network from scratch in python to classify the MNIST handwritten digits (centralized). It is composed of a single convolutional network with 8 3x3 kernels, a 2x2 maxpool layer and a 10 node dense layer with softmax as the activation function. I am using cross entropy loss and SGD. When I train the network on the whole training set for a single epoch with a batch size of 1, I get 95% accuracy. However, when …
Category: Data Science

Hidden state dimensions in Pytorch LSTM

Please read the question completely before you mark it as duplicate I was trying to understand the syntax of using an LSTM in PyTorch. I came across the following in PyTorch docs. h_0: tensor of shape $(D * \text{num_layers}, N, H_{out})$​ containing the initial hidden state for each element in the batch. Defaults to zeros if (h_0, c_0) is not provided. where: \begin{aligned} N ={} & \text{batch size} \\ L ={} & \text{sequence length} \\ D ={} & 2 \text{ …
Category: Data Science

Tuning Batch size and Learning rate in neural net

The following MCQ question is provided in "Exam Readiness: AWS Certified Machine Learning - Specialty" document. The correct answer has been marked in the document but I am not able to understand why this option is correct. Question: "A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist …
Category: Data Science

RMSprop in weight update - what if vertical slopes small and horizontal slopes large?

I have a question regarding the intuition behind RMSprop, As shown in the lecture video of Deep Learning Specialization by Andrew Ng, RMSprop helps to reduce the oscillation (the values of the vertical slope b as in the example figure), and speed up the convergence at the minima point through stepping long horizontal axis, This is achieved by update our weights as: $$w:= w - \frac{d_{w}}{\sqrt{S_{dw}}}$$ $$b:= b - \frac{d_{b}}{\sqrt{S_{db}}}$$ So, if initially $W$ is small so $\sqrt{S_{dw}}$ is small, …
Category: Data Science

Minibatches when training on two datasets of different size

Suppose I have two datasets, $X$ and $Y$, of different sizes. I am training two networks together, one which takes inputs $x\in X$, and the other takes inputs $y\in Y$. The two networks share parameters and therefore are trained together. Are there some guidelines on how to chose the batch-sizes for the samples from $X$ vs. those from $Y$? That is, should the the batches from $X$ have the same size as the batches from $Y$? In general, the two …
Category: Data Science

Sequential batch processing vs parallel batch processing?

In deep learning based model training, in general batch of inputs are passed. For example for training a deep learning model with [512] dimensional input feature vector, say for batch size= 4, we mainly pass [4,512] dimenional input. I am curious what are the logical significance of passing the same input after flattening the input across the batch and channel dimenions [2048]. Logically the locality structure will be destroyed but will it significanlty speed up my implementation? And can it …
Category: Data Science

Minibatch SGD performs better than Adam for Region proposal network training

I am using both minibatch SGD (with momentum) and Adam for training a region proposal network. The library used is KERAS. The batch size in both cases is 5 and initial learning rate is 0.01. The learning rate decay schedule is also same for both optimizers The rpn classification loss steadily reduces in case of SGD with momentum but diverges in case of Adam. The performance of SGD with momentum is noticeably better after about 500 epochs Given that everything …
Category: Data Science

How do i get the loss function graph?

I used Mini-batch gradient descent to train the model, but i am unable to get the proper loss graph. The loss graph is always showed as a straight line. I know there is something wrong but would anyone be able to guide me from sklearn import metrics error = [] for epoch in range(epochs): for i in range(0,x_train.shape[0],minibatch_size): x_mini = x_train[i:i + minibatch_size-1,:] y_mini = y_train[i:i + minibatch_size-1,:] #feed forward #layer 1 in1 = x_mini@w1 + b1 out1 = sigmoid(in1) …
Category: Data Science

Can the 'Rainbow Algorithm' be scaled up and sped up?

What's the proper way to train the algorithm with bigger batches or otherwise speed it up? The 'Rainbow Algorithm' is a Deep Q, Reinforcement Learning algorithm with two neural networks that I would like to speed up or scale up during training. You can read the paper here. Training is fairly slow because the observations have to be converted to tensors and updated to the model after each step. It's kind of a special and unique model, so I hope …
Category: Data Science

How backpropagation through gradient descent represents the error after each forward pass

In Neural NEtwork Multilayer Perceptron, I understand that the main difference between Stochastic Gradient Descent (SGD) vs Gradient Descent (GD) lies in the way of how many samples are chosen while training. That is, SGD iteratively chooses one sample to perform forward pass followed by backpropagation to adjust the weights, as oppose to GD where the backpropagation starts only after the entire samples have been calculated in the forward pass). My question is: When the Gradient Descent (or even mini-batch …
Category: Data Science

Compare rate of change for multiple object/weights

For a Neural Network, the weight update equation is: However, there are millions of such weights W_i. If I am interested in capturing how much each weight/connection W_i is changing as compared to other weights, I am using the absolute magnitude of gradient summation for each weight W_i: where you are summing the absolute magnitude of gradients for the entirety of 'k' training iterations. number of training iterations (k) = train dataset size / batch size. After computing this summation …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.