Need a random process/distribution where I can pass a certain level of bias for producing an outcome

My first question here if am not clear please let me know. My objective a startup Sportsbook wants to test its algo to see how it manages game lines for incoming bets placed on a particular game. For example, as bets come in for a particular team the algo checks the book to see if it can cover and when the book is lob-sided it will adjust the line/odds giving the other team more favorable odds to balance the book …
Category: Data Science

Splitting train/test sets by an identifier?

I know sklearn has train_test_split() to split a train and test set. But I read that, even with setting a random seed, if your actual dataset is updated regularly, the random seed will reset with each updated dataset and take a different train/test split. Doing this, your ML algos will eventually cover the whole dataset, defeating the purpose of the train/test split because it'll eventually train on too much of the whole dataset over time. The book I'm reading (Hands-On …
Category: Data Science

Create a random chi-Square independence distribution with a given p-Value

I want to randomly create a table of data that has a predefined p-Value and chi-Value of a chi-square distribution. For example this would have a p-Value of 1 on a chi-square independence test: [[25,25], [25,25]] Trying arround some random values I see that: [[50,0], [30,20]] has a p-Value of 2.02E-6 and a chi-Value of 22,56. But how would I do it the other way arround? I have given p-Value of 0.05 for example from that I want to get …
Category: Data Science

Why should the initialization of weights and bias be chosen around 0?

I read this: To train our neural network, we will initialize each parameter W(l)ijWij(l) and each b(l)ibi(l) to a small random value near zero (say according to a Normal(0,ϵ2)Normal(0,ϵ2) distribution for some small ϵϵ, say 0.01) from Stanford Deep learning tutorials at the 7th paragraph in the Backpropagation Algorithm What I don't understand is why the initialization of the weight or bias should be around 0?
Category: Data Science

Cannot clone object <keras.wrappers.scikit_learn.KerasRegressor object at 0x7fdc9c3ba550>

Trying to hypertune ANN but getting an error while using fit..(grid1.fit(X_train, y_train)) Below is the code def create_model(dropout_rate,weight_constraint,optimizer,init,layers,activation): model = Sequential() model.add(Dense(nodes, input_dim=171, kernel_initializer=init, activation='relu', kernel_constraint=maxnorm(weight_constraint))) model.add(Dropout(dropout_rate)) model.add(Dense(1, kernel_initializer=init, activation='relu')) model.compile(loss='mse', optimizer=optimizers, metrics=['mean_absolute_error']) return model model = KerasRegressor(build_fn=create_model, verbose=0) #hyperparameters layers = [[50],[50, 20], [50, 30, 15], [70,45,15,5]] optimizers = ['rmsprop', 'adam'] dropout_rate = [0.1, 0.2, 0.3, 0.4] init = ['glorot_uniform', 'normal', 'uniform'] epochs = [150, 500] batches = [5, 10, 20] weight_constraint = [1, 2, 3] param_dist = dict(optimizer=optimizers, …
Category: Data Science

How to choose the random seed?

I understand this question can be strange, but how do I pick the final random_seed for my classifier? Below is an example code. It uses the SGDClassifier from SKlearn on the iris dataset, and GridSearchCV to find the best random_state: from sklearn.linear_model import SGDClassifier from sklearn import datasets from sklearn.model_selection import train_test_split, GridSearchCV iris = datasets.load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) parameters = {'random_state':[1, 42, 999, 123456]} sgd = SGDClassifier(max_iter=20, shuffle=True) …
Category: Data Science

What is the objective that is optimized with Random Search?

I have recently learned about Random Search (or sklearn.model_selection.RandomizedSearchCV in Python) and was thinking about the theory behind the optimization process. In particular my question is, given that one performs Random Search on a certain algorithm (let's say random forest), what are the best hyperparameter based on? More specifically in what sense are they the "best" hyperparameters for the model? Do they maximize accuracy of the model? If not what is the (performance-)criterion that is optimized? Or is it entropy/gini?
Category: Data Science

Is shuffling data really necessary for training?

I don't mean if we had a dataset where if sequentially sampled, the labels would be [1111122223333]. In this case, the network learns to predict everything as 1, then 2, and so on and it's impossible to learn. I mean: Assume you have Imagenet 2012 dataset. You shuffle it once. So now the labels and the images are shuffled. Since the dataset is huge, can the network really remember the previous epoch's predictions and overfit? OR, I shuffle data 5 …
Category: Data Science

How to compute modulo of a hash?

Let's say that I have a set of users in my database, that have GUIDs as their IDs. I use xxhash to generate fixed-length hashes for each value, so that I can then proceed to "bucketizing" them and being able to do random sampling with the help of the modulo function. That said, if I have a hash such as 367b50760441849e, I want to be able to use hash % 20 == 0 to randomly pick 5% of the population …
Category: Data Science

RL Sutton book, initial estimate of q*(a) for 10 arm testbed

The Sutton book does not mention what the initial estimate is for q*(a) before the first reward is received. In this code repo that seems to go along with the book: Sutton code repo They have initialized it with 0 per snippet below: def __init__(self, kArm=10, epsilon=0., initial=0., stepSize=0.1, sampleAverages=False, UCBParam=None, gradient=False, gradientBaseline=False, trueReward=0.): But the explanation for Figure 2.1 that shows the distribution of rewards for the 10 arms of the bandit says, Figure 2.1: An example bandit problem …
Category: Data Science

What is the most efficient method for hyperparameter optimization in scikit-learn?

An overview of the hyperparameter optimization process in scikit-learn is here. Exhaustive grid search will find the optimal set of hyperparameters for a model. The downside is that exhaustive grid search is slow. Random search is faster than grid search but has unnecessarily high variance. There are also additional strategies in other packages, including scikit-optimize, auto-sklearn, and scikit-hyperband. What is the most efficient (finding reasonably performant parameters quickly) method for hyperparameter optimization in scikit-learn? Ideally, I would like working code …
Category: Data Science

Why would one crossvalidate the random state number?

Still learning about machine learning, I've stumbled across a kaggle (link), which I cannot understand. Here are lines 72 and 73: parameters = {'solver': ['lbfgs'], 'max_iter': [1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000 ], 'alpha': 10.0 ** -np.arange(1, 10), 'hidden_layer_sizes':np.arange(10, 15), 'random_state':[0,1,2,3,4,5,6,7,8,9]} clf = GridSearchCV(MLPClassifier(), parameters, n_jobs=-1) As you can see, the random_state parameter is been tested across 10 values. What is the point of doing this? If one model perform better with some random_state, does it make any sense to use this particular parameter on …
Category: Data Science

how to label a tain_data?

I have one assignment that I have four files 1) train_data.csv: The training file contains two fields (text, id). 2) train_label.csv: The label file contains two fields (id, label). 3) test_data.csv: The test file contains two fields (text, id). 4) sample_submission.csv: This is a file that needs to be submitted. And this should be obvious multilabel classification, but whenever I try to identify labels in train data, it doesn't show labels. How can I remove noise from train_data?? Any type …
Category: Data Science

Epoch greedy algorithm for contextual bandits

I'm reading the following paper on the epoch greedy algorithm for the contextual bandits problem. I have two questions http://hunch.net/~jl/projects/interactive/sidebandits/bandit.pdf I'm unsure how they've used the Bernstein inequality on page 6 to conclude $\mu_{n}(\mathcal{H},1) \leq c^{-1} \sqrt{k \mathrm{ln}(m)/n}$. Could someone please elaborate on this as it seems Bernsteins inequality seems to measure whp the deviation of a sum of random variables from it's mean. Where as the regret bound $\mu_{n}(\mathcal{H},1)$ is defined as the expected regret from the empirically best …
Category: Data Science

Testing Multi-Arm Bandits on Historical Data

Suppose I want to test a multi-arm bandit algorithm in the contextual setting on a set of historical data. For simplicity, let's assume there are only two arms A and B and suppose the rewards are binary. Furthermore, suppose I have a data set where users were shown one of the two arms and I have a record of the rewards. What would be the best approach to simulating the scenario of running the algorithm online? I was thinking of …
Category: Data Science

Multi-arm bandit problem for bernoulli reward distribution

Suppose in the multi-arm bandit problem I know my rewards are distributed as $0$ or $1$ i.e according to a Bernoulli distribution rather than the condition that they lie in the range $[0,1]$. Does anyone know if we can do better with our confidence bounds with this restricted condition? In particular how does the upper confidence bound algorithm change and what is the corresponding upper bound on the expected regret? Can someone provide links to a paper or a set …
Category: Data Science

HOW TO: Deep Neural Network weight initialization

Given difficult learning task (e.g high dimensionality, inherent data complexity) Deep Neural Networks become hard to train. To ease many of the problems one might: Normalize &amp;&amp; handpick quality data choose a different training algorithm (e.g RMSprop instead of Gradient Descent) pick a steeper gradient Cost function (e.g Cross Entropy instead of MSE) Use different network structure (e.g Convolution layers instead of Feedforward) I have heard that there are clever ways to initialize better weights. For example you can choose …
Category: Data Science

Interpreting the results of randomized PCA in scikit-learn

I'm using scikit-learn to do a genome-wide association study with a feature vector of about 100K SNPs. My goal is to tell the biologists which SNPs are "interesting". RandomizedPCA really improved my models, but I'm having trouble interpreting the results. Can scikit-learn tell me which features are used in each component?
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.