a question about activaton function on my neural network project

I want to plement a model of neural network using sckitlearn . and I want to know which activation function should I use ? I have 10 input variable and one output . all variable are floats(positive ). and the Output is a pecentage ( 0 to 100). and my model is note linear to the output variable, so i'll creat regression model with one hidden layer!!
Category: Data Science

Intuitively, why do Non-monotonic Activations Work?

The swish/SiLU activation is very popular, and many would argue it has dethroned ReLU. However, it is non-monotonic, which seems to go against popular intuition (at least on this site: example 1, example 2). Reading the swish paper, the justification that the authors give is that non-monotonicity "increases expressivity and improves gradient flow... [and] may also provide some robustness to different initializations and learning rates." The authors provide an image to back up this claim, but at best this argument …
Category: Data Science

Is it possible to implement a vectorized version of a Maxout activation function?

I want to implement an efficient and vectorized Maxout activation function using python numpy. Here is the paper in which "Maxout Network" was introduced (by Goodfellow et al). For example, if k = 2: def maxout(x, W1, b1, W2, b2): return np.maximum(np.dot(W1.T,x) + b1, np.dot(W2.T, x) + b2) Where x is a N*D matrix. Suppose k is an arbitrary value(say 5). Is it possible to avoid for loops when calculating each wx + b? I couldn't come up with any …
Category: Data Science

Relationship between Sigmoid and Gaussing Distribution

I was reading this article where I came across the following statement in the context of "Why do we use sigmoid activation function in Neural Nets?": The assumption of a dependent variable to follow a sigmoid function inherently assumes a Gaussian distribution for the independent variable which is a general distribution we see for a lot of randomly occurring events and this is a good generic distribution to start with. Could someone elaborate on this relationship between the two?
Category: Data Science

Why activation function is not needed during the runtime of an Word2Vec model

In Word2Vec trainable model, there are two different weight matrix. The matrix $W$ from input-to-hidden layer and the matrix $W'$ from hidden-to-output layer. Referring to this article, I understand that the reason we have the matrix $W'$ is basically to compensate for the lack of activation function in the output layer. As activation function is not needed during runtime, there is no activation function in the output layer. But we need to update the input-to-hidden layer weight matrix $W$ through …
Category: Data Science

Multioutput Neural Network for function approximation

I am trying to extend the example here to be capable of handling multiple outputs for function approximations import numpy as np # helps with the math import random as r import plotly.graph_objects as go # full data set x = np.linspace(0, np.pi, 100) y = np.sin(x) # input data p = 1/2 # fraction of data to use in training N = int(len(x)*p) # number of data points in the full set that corresponds to above fraction idx = …
Category: Data Science

Activation Functions in Haykins Neural Networks a comprehensive foundation

In Haykins Neural Network a comprehensive foundation, the piecwise-linear funtion is one of the described activation functions. It is described with: The corresponding shown plot is I don't really understand how this is corrected since the values shown in the graph in the area of -0.5 < v < 0.5 is not v but v+0.5. Am I understanding something wrong, or is there a mistake?
Category: Data Science

Why does using tanh worsen accuracy so much?

I was testing how different hyperparameters would change the output of my multilayer perceptron for a regression problem checkpoint = keras.callbacks.ModelCheckpoint("best_model.h5", save_best_only=True) # Initialising the ANN model = Sequential() # Adding the input layer and the first hidden layer model.add(Dense(32, activation = 'relu', input_dim = X_train.shape[1])) # Adding the second hidden layer model.add(Dense(units = 8, activation = 'relu')) # Adding the output layer model.add(Dense(units = 1)) optimizer = keras.optimizers.Adam(learning_rate=0.01) model.compile(optimizer=optimizer, loss='mean_squared_error') # Fitting the ANN to the Training set history …
Category: Data Science

Is it wrong to use Glorot Initialization with ReLu Activation?

I'm reading that keras' default initialization is glorot_uniform. However, all of the tutorials I see are using relu activation as the go-to for hidden layers, yet I do not see them specifying initialization for those layers as he. Would it be better for these relu layers to use he instead of glorot? As seen in OReilly's Hands-On Machine Learning with Scikit-Learn & Tensorflow: | initialization | activation | +----------------+-------------------------------+ | glorot | none, tanh, logistic, softmax | | he | …
Category: Data Science

Why using the hyperbolic tangent or the sigmoid as activation function on the last layer gaves the same result in accuracy?

The problem I'm making a simple Multilayer Perceptron (MLP), in Keras, that has to do the binary classification from some float type of data. Each single data is a group of three float values (e.g. 32.01, -10.23, -1.01) and is labelled with the value 0 or 1. Every time I do the training process the result of the validation accuracy and validation loss remain at the same value after few training epochs, like 5 or 6. The problem is the …
Category: Data Science

Why is ReLU used as an activation function?

Activation functions are used to introduce non-linearities in the linear output of the type w * x + b in a neural network. Which I am able to understand intuitively for the activation functions like sigmoid. I understand the advantages of ReLU, which is avoiding dead neurons during backpropagation. However, I am not able to understand why is ReLU used as an activation function if its output is linear? Doesn't the whole point of being the activation function get defeated …
Category: Data Science

Activation maps positiv even bevore activation

I was looking at the activation maps of vgg19 in pytorch. I found that all the values of the maps are positive even before I applied the ReLU. This seems very strange to me... If this would be correct (could be that I not used the register_forward_hook method correctly?) why would one then apply ReLu at all? This is my code to produce this: import torch import torchvision import torchvision.models as models import torchvision.transforms as transforms from torchsummary import summary …
Category: Data Science

Should output data scaling correspond to the activation function's output?

I am building an LSTM with keras which have an activation parameter in the layer. I have read that scaling on the output data should match the activation function's output values. Ex: tanh activation outputs values between -1 and 1, therefore the output training (and testing) data should be scaled to values between -1 and 1. So if the activation function is asigmoid the output data should be scaled to values between 0 and 1. Does this hold for all …
Category: Data Science

Applying activation on part of the layer in Keras

Context I am trying to implement the YOLO algorithm in Keras. What I have so far is the following network: i = Input(shape=(image_height,image_width, image_channels)) rescaled = Rescaling(1./255)(i) x = Conv2D(16, (1, 1))(rescaled) x = Conv2D(32, (3, 3))(x) x = LeakyReLU(alpha=0.3)(x) x = MaxPooling2D(pool_size=(2, 2))(x) x = Conv2D(16, (3, 3))(x) x = Conv2D(32, (3, 3))(x) x = LeakyReLU(alpha=0.3)(x) x = MaxPooling2D(pool_size=(2, 2))(x) x = Flatten()(x) x = Dense(256, activation='sigmoid')(x) x = Dense(grid_width * grid_height * anchor_number * (5 + class_count))(x) x …
Category: Data Science

How to prove Softmax Numerical Stability?

I was playing around with the softmax function and tried around with the numerical stability of softmax. If we increase the exponent in the numerator and denominator with the same value, the output of the softmax stays constant (see picture below where -Smax is added). I cannot figure out how to prove this numerical stability (although I read that it is true). Can anyone help me with the proof?
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.