Ethical consequences of non-deterministic learning processes?

Most advanced supervised learning techniques are non-deterministic by construction. The final output of the model usually depends on some random parts of the learning process. (Random weight initialization for Neural Networks or variable selection / splits for Gradient Boosted Trees). This phenomenon can be observed by plotting the predictions for a given random seed against the predictions for another seed : the prediction are usually correlated but don't coincide exactly. Generally speaking it is often not a problem. When trying …
Category: Data Science

Capping labels negatively impacts business metric

I have this deep neural network model with an integer label to predict. The label is heavily skewed so we cap the labels at some value (let's say 90 %ile). Now when we build and run the model, it performs well in general. But in online experiment shows degradation in business metrics for a fraction of users that have high value labels. If we don't cap the label, the business metrics gets skewed for users with low number of activities. …
Category: Data Science

Model for predicting duration based on categorical data

I am working on a model which will allow me to predict how long it will take for a "job" to be completed, based on historical data. Each job has a handful of categorical characteristics (all independant), and some historic data might look like: JobID Manager City Design ClientType TaskDuration a1 George Brisbane BigKahuna Personal 10 a2 George Brisbane SmallKahuna Business 15 a3 George Perth BigKahuna Investor 7 Thus far, my model has been relatively basic, following these basic steps: …
Category: Data Science

How can I export the best classifier from my code to a model for real future usage?

# Read the CSV file df = pd.read_csv('processed.csv', header=0, engine='python') # Pre-processing the data # Define X,Y features X = df.drop('Class', axis=1) Y = df['Class'] # prepare configuration for cross validation test harness seed = 3 # prepare models models = [('LR', LogisticRegression()), ('LDA', LinearDiscriminantAnalysis()), ('KNN', KNeighborsClassifier()), ('CART', DecisionTreeClassifier()), ('NB', GaussianNB()), ('SVM', SVC())] # evaluate each model in turn results = [] names = [] scoring = 'accuracy' for name, model in models: kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=seed) cv_results = …
Category: Data Science

How do I select the "best" unsupervised machine learning algorithm to cluster my specific dataset?

I want to cluster a dataset without prior knowledge on the correct amount of clusters. For different algorithms (i.e. k-means, gmm...) I can iterate through different values and try to find the best solution for any given algorithm (i.e. ellbow-curve, silhouette-coefficient etc.). But I get very different results - as expected with different algorithms. K-Means is good for spherical clusters, density-based approaches for totally different cluster shapes. Now the actual question: How do I select the "best" unsupervised machine learning …
Category: Data Science

Same validation accuracy, different train accuracy for two neural networks models

I'm performing emotion classification over FER2013 dataset. I'm trying to measure different models performance, and when I checked ImageDataGenerator with a model I had already used I came up with the following situation: Model without data augmentation got: train_accuracy = 0.76 val_accuracy = 0.70 Model with data augmentation got: train_accuracy = 0.86 val_accuracy = 0.70 As you can see, validation accuracy is the same in both models, but train accuracy is significantly different. In this case: Should I go with …
Category: Data Science

Optimizing decision threshold on model with oversampled/imbalanced data

I'm working on developing a model with a highly imbalanced dataset (0.7% Minority class). To remedy the imbalance, I was going to oversample using algorithms from imbalanced-learn library. I had a workflow in mind which I wanted to share and get an opinion on if I'm heading in the right direction or maybe I missed something. Split Train/Test/Val Setup pipeline for GridSearch and optimize hyper-parameters (pipeline will only oversample training folds) Scoring metric will be AUC as training set is …
Category: Data Science

Neural Network for solving these linear algebra problems

Intro There are several questions on this site about whether or not machine learning can solve specific problems. The answer (in my words) seems to be: "Yes, trivially, if you choose a model to learn your specific problem, but you sometimes may choose a model that can't represent/approximate the correct hypothesis." I would like to choose a neural network model where, a priori, all I know is that the input is a "linear algebra" kind of function. The Problem I …
Category: Data Science

Retrieve user features in real time from UserId for prediction

Let's say I'm building an app like Uber and I want to predict the user's most likely destination based on the user's past history, current latitude/longitude, and time/date. Here is the proposed architecture - Let's say I have a pre-trained model hosted as a service. The part I'm struggling with is, how do I get the user features from the database in realtime from the RiderID to be used by the prediction service (XGBoost Model)? I'm guessing a lookup in …
Category: Data Science

Propensity model with Only Positive Data

Is it possible to build a propensity model (i.e., the likelihood that a user will buy an item) using only positive values. For example, I have a bunch of data about Customers (people that bought stuff) and Users (people that haven't bought stuff yet) I want to get the likelihood that a User becomes a Customer. It seems that the only way to do so is to train a model using the data of Customers, therefore using only Positive values.
Category: Data Science

Neural Network - Sparsity of collaborative based filtering and modelling the prediction problem

I'm fairly new to machine learning and for that matter, neural networks, but for the past couple of days I decided to take a stab at a fairly classical and practical problem of neural networks/machine learning which is recommendation systems. Apologies if this is an unnecessarily broad question, but I found it hard to read up on resources answering this particular question. My main question is, how do you even model the problem (or what directions/advice is there on how …
Category: Data Science

Model works on TF 2.3 but not on 2.6 ( model.predict_classes removed?)

I am writing a project that classifies the date codes on a pack, I have developed a pipeline that works as intended on my PC, I trained the model on my computer and ran the classification script (tf2.3). Works well. I copied the files over to my raspberry pi 4, its 64 bit runs tf2.6, the script runs ok but I am not getting the same output in fact it is returning the same character for each contour. [ '[','[','[','[','[','[','[','[','[','['] …
Category: Data Science

How to choose Recursive Feature Elimination parameters

in my project I have >900 features and I thought to use Recursive Feature Elimination algorithm to reduce the dimensionality of my problem (in order to improve the accuracy). But I can't figure out how to choose the RFE parameters (estimator and the number of parameters to select). Should I use model selection techniques in this case as well? Do you have any advice?
Category: Data Science

Is data leakage giving me misleading results? Independent test set says no!

TLDR: I evaluated a classification model using 10-fold CV with data leakage in the training and test folds. The results were great. I then solved the data leakage and the results were garbage. I then tested the model in an independent new dataset and the results were similar to the evaluation performed with data leakage. What does this mean? Was my data leakage not relevant? Can I trust my model evaluation and report that performance ? Extended version: I'm developing …
Category: Data Science

Regression model for continuous dependent variable and count independent variables

I am currently learning R and I am relatively inexperienced in the field. Hope I can get some advice from you guys! I am working on a project where I have to estimate the average processing time of different work items (tasks). I have the following panel data: My sample size is n=2000 individual workers, and T=10 (each time interval is a four week period) Independent variables: 51 different work items. I have count data for each work item (# …
Category: Data Science

ML: Classification Model Comparison

Given is a dataset that I need to use for a classification and I want to compare the performance of different classification models. Let's assume, I want to look at logistic regression (with different cut-off-points) and KNN. Is there anything problematic if I proceed as follows: Split data in training and validation data (and a test set for the performance evaluation of the winning model). Train a logistic regression model and a KNN classification model on the training set. I …
Category: Data Science

Is there any way to explicitly measure the complexity of a Machine Learning Model in Python

I'm interested in model debugging and one of the points that it mentions is to compare your model with a "less complex" one to check if the performance is substantially better on the most complex model as compared with the simpler. So, it raises my questions: Suppose you have a Ensemble model and a Linear model for a classification task "It seems natural to think that the ensemble model is more complex than the linear model" What would it be …
Category: Data Science

Probabilistic Machine Learning model to match spatial data

I have spatial data from multiple sources. This data consists of ID, lat, long, and time. My goal is that given a new lat-long, the model needs to return (preferably with a probability) the data points that match the new lat-long. This matching should be based on the features (such as lat, long, timestamp). I could only think of clustering. ie. Cluster the dataset and try to predict which cluster the new data belongs to. The drawback is that if …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.