Is the Cross Entropy Loss important at all, because at Backpropagation only the Softmax probability and the one hot vector are relevant?

Is the Cross Entropy Loss (CEL) important at all, because at Backpropagation (BP) only the Softmax (SM) probability and the one hot vector are relevant? When applying BP, the derivative of CEL is the difference between the output probability (SM) and the one hot encoded vector. For me the CEL output, which is very sophisticated, does not play any roll for learning. I´m expecting a fallacy in my reasoning, so could somebody please help me out?
Category: Data Science

Sample variance matrix normal distribution in R

I'm trying to perform a multinomial logistic regression in R employing the Metropolis-Hastings algorithm, considering a Matrix Normal Distribution as proposal. I'm using the function rmatrixnorm() included in the package LaplacesDemon in order to sample from the proposal distribution. I followed this strategy since I need a vector of parameter $\underline{\beta_{k}}$, with $k=1,\dots,K$ (number of classes involved in the classification). At the end of the Monte-Carlo iterations, my procedure retrieves the sample mean and the sample covariance of the posterior …
Category: Data Science

CNN Eliminate Wrong Results

I extracted images of human faces from the videos, but the model also recorded images without faces. I wrote CNN for emotion classification. In the obvious pictures, the probability is closer to a probability in the softmax function in the last layer, for example, in a photo that is certain to be happy, a probability of 0.95 for the happy class appears, but if there is no face in the picture, it disperses between classes such as 0.3 and 0.2. …
Category: Data Science

Backpropagation with log likelihood cost function and softmax activation

In the online book on neural networks by Michael Nielsen, in chapter 3, he introduces a new cost function called as log-likelihood function defined as below $$ C = -ln(a_y^L) $$ Suppose we have 10 output neurons, when back propagating the error, only the gradient w.r.t. $y^{th}$ output neuron is non-zero and all others are zero. Is that right? If so, how is the below equation (81) true? $$\frac{\partial C}{\partial b_j^L} = a_j^L - y_j $$ I'm getting the expression …
Category: Data Science

Multiclass Classification with Decision Trees: Why do we calculate a score and apply softmax?

I'm trying to figure out why when using decision trees for multi class classification it is common to calculate a score and apply softmax, instead of just taking the averages of the terminal nodes probabilities? Let's say our model is two trees. A terminal node of tree 1 has example 14 in a node with 20% class 1, 60% class 2, and 20% class 3. A terminal node of tree 2 has example 14 in a node with 100% class …
Category: Data Science

using logsumexp in softmax

I saw this equation in somebody's code which is an alternative approach to implementing the softmax in order to avoid underflow by division by large numbers. softmax = e^(matrix - logaddexp(matrix)) = E^matrix / sumexp(matrix) logsumexp = scipy.special.logsumexp(matrix, axis=-1, keepdims=True) softmax = np.exp(matrix - logsumexp) I understand that when you log equations that use division you would then subtract, i.e. log(1/2) = log(1) - log(2). However, in the implantation of the code above, shouldn't they also log the matrix in …
Category: Data Science

Difference in performance Sigmoid vs. Softmax

For the same Binary Image Classification task, if in the final layer I use 1 node with Sigmoid activation function and binary_crossentropy loss function, then the training process goes through pretty smoothly (92% accuracy after 3 epochs on validation data). However, if I change the final layer to 2 nodes and use the Softmax activation function with sparse_categorical_crossentropy loss function, then the model doesn't seem to learn at all and stuck at 55% accuracy (the ratio of the negative class). …
Category: Data Science

Dot product for similarity in word to vector computation in NLP

In NLP while computing word to vector we try to maximize log(P(o|c)). Where P(o|c) is probability that o is outside word, given that c is center word. Uo is word vector for outside word Vc is word vector for center word T is number of words in vocabulary Above equation is softmax. And dot product of Uo and Vc acts as score, which should be higher the better. If words o and c are closer then their dot product should …
Category: Data Science

Train a model when input can contain a smaller options output with the correct output

I have service order lines to charge customers, each line needs to be set to an actual product. If the customer had only one product, so all lines are set to that product. But, if the there are many products, currently a trained employee does the matching, each line to one product. e.g. enforcement fees -----------> backup YYY transmitter text message fees ----------> main XXX transmitter installation installation fee -----------> main XXX transmitter installation I can train all the kinds …
Category: Data Science

How to calculate Temperature variable in softmax(boltzmann) exploration

Hi I am developing a reinforcement learning agent for a continous state/discrete action space. I am trying to use boltmzann/softmax exploration as action selection strategy. My action space is of size 5000. My implementation of boltzmann exploration: def get_action(state,episode,temperature = 1): state_encod = np.reshape(state, [1, state_size]) q_values = model.predict(state_encod) prob_act = np.empty(len(q_values[0])) for i in range(len(prob_act)): prob_act[i] = np.exp(q_values[0][i]/temperature) #numpy matrix element-wise division for denominator (sum of numerators) prob_act = np.true_divide(prob_act,sum(prob_act)) action_q_value = np.random.choice(q_values[0],p=prob_act) action_keys = np.where(q_values[0] == action_q_value) action_key …
Category: Data Science

neural network binary classification softmax logsofmax and loss function

I am building a binary classification where the class I want to predict is present only <2% of times. I am using pytorch The last layer could be logosftmax or softmax. self.softmax = nn.Softmax(dim=1) or self.softmax = nn.LogSoftmax(dim=1) my questions I should use softmax as it will provide outputs that sum up to 1 and I can check performance for various prob thresholds. is that understanding correct? if I use softmax then can I use cross_entropy loss? This seems to …
Category: Data Science

classification using LogSoftmax vs Softmax and calculating precision-recall curve?

In case of binary classification we could get final output using LogSoftmax or Softmax. In case of softmax we get results that add up to 1. I understand that LogSoftmax penalizes more for a wrong classification and few other mathematical advantage. I have binary classification problem with class 1 occurring very rarely (<2% times) my questions: If I am using probability cutoff of 0.5 (predicting to class 1 if prob is above 0.5) with Softmax then will I get same …
Category: Data Science

Distilling the knowledge of a binary cross entropy with sigmoid function model to a softmax model

I have a complex CNN architecture that uses a binary cross-entropy and sigmoid function for classification. However, due to hardware restraints I would like to compress my model using knowledge distillation and unfortunately most papers deals with knowledge distillation using two models with softmax and sparse categorical entropy for the distilling the knowledge of the larger network. I'd like to know if it is possible to use a complex model that uses binary cross entropy and sigmoid function for activation …
Category: Data Science

Using SVM as final layer in Convolutional Neural Network

I am working on the implementation of a hybrid CNN-SVM, where I define the use of SVM in the last layer of CNN as shown in this code: # Flattening cnn.add(tf.keras.layers.Flatten()) # Full Connection cnn.add(tf.keras.layers.Dense(units=128, activation='relu')) cnn.add(Dense(4, kernel_regularizer=tf.keras.regularizers.l2(0.01),activation ='softmax')) cnn.compile(optimizer = 'adam', loss = 'squared_hinge', metrics = ['accuracy']) In the case of CNN (without adding SVM), we can define the last part of CNN as below: def calculate_softmax(data): result = np.exp(data) return result softmax = calculate_softmax(temp) prediction = softmax.argmax() where …
Topic: softmax cnn svm
Category: Data Science

Derivative of a custom loss function with the logistic function

I have costum loss function with $\mu ,p, o, u, v$ as variables and $\sigma$ is the logistic function. I need to derive this loss function. Due to multiple variables in the loss function, I need to use the softmax function which is the generalization of the logistic function? $L = -\frac{1}{N}\sum_{i,j \in S}^{2}{a_j\{y_{i,j}log[\sigma{(\mu + p_i + o_j + u^{T}_{i}v_{j})]} + (1 - y_{i,j})log[1 - \sigma{(\mu + p_i + o_j + u^{T}_{i}v_{j})]}\}}$ As far I understand, it is a multivariate …
Category: Data Science

How to prove Softmax Numerical Stability?

I was playing around with the softmax function and tried around with the numerical stability of softmax. If we increase the exponent in the numerator and denominator with the same value, the output of the softmax stays constant (see picture below where -Smax is added). I cannot figure out how to prove this numerical stability (although I read that it is true). Can anyone help me with the proof?
Category: Data Science

Can a single label be a vector/matrix in a neural network and not a scalar?

My training data consists of individual sentences and each sentence has a few labels (say 10) and each of these labels has a discrete score from 1-10 -- so in essence, a single training example has a label that is not a scalar, but rather a matrix/vector of (10,10) or (1,10*10). Can a softmax adjust the weights in accordance to a label that on its own, is a matrix/vector? I'm looking to fine-tune a model that has this capability. Thanks.
Category: Data Science

Is there a Softmax-like transformation with scale-invariance and linarity?

At the moment I'm using XGBoost to generate a prediction of probabilities with a custom objective-function to build something like an expert system. To do so I need to transform the raw XGBoost predictions into a probability distribution, where every value lies in the range from 0 to 1 and they all sum up to 1. Naturally you start out with the Softmax transformation. But as it turns out this function has some significant drawbacks for this kind of application. …
Category: Data Science

Cross-entropy loss explanation

Suppose I build a neural network for classification. The last layer is a dense layer with Softmax activation. I have five different classes to classify. Suppose for a single training example, the true label is [1 0 0 0 0] while the predictions be [0.1 0.5 0.1 0.1 0.2]. How would I calculate the cross entropy loss for this example?
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.