model-selection

Ethical consequences of non-deterministic learning processes?

lcrmorin

2022年5月31日 13:59

Most advanced supervised learning techniques are non-deterministic by construction. The final output of the model usually depends on some random parts of the learning process. (Random weight initialization for Neural Networks or variable selection / splits for Gradient Boosted Trees). This phenomenon can be observed by plotting the predictions for a given random seed against the predictions for another seed : the prediction are usually correlated but don't coincide exactly. Generally speaking it is often not a problem. When trying …

Topic: ethical-ai methodology model-selection

Category: Data Science

Capping labels negatively impacts business metric

aghd

2022年5月29日 15:12

I have this deep neural network model with an integer label to predict. The label is heavily skewed so we cap the labels at some value (let's say 90 %ile). Now when we build and run the model, it performs well in general. But in online experiment shows degradation in business metrics for a fraction of users that have high value labels. If we don't cap the label, the business metrics gets skewed for users with low number of activities. …

Topic: model-selection deep-learning machine-learning

Category: Data Science

Model for predicting duration based on categorical data

Kadin

2022年5月28日 17:05

I am working on a model which will allow me to predict how long it will take for a "job" to be completed, based on historical data. Each job has a handful of categorical characteristics (all independant), and some historic data might look like: JobID Manager City Design ClientType TaskDuration a1 George Brisbane BigKahuna Personal 10 a2 George Brisbane SmallKahuna Business 15 a3 George Perth BigKahuna Investor 7 Thus far, my model has been relatively basic, following these basic steps: …

Topic: model-selection python predictive-modeling categorical-data

Category: Data Science

Population stability Index vs Population Accuracy Index

PavanKumar

2022年5月23日 06:04

Can anyone explain to me the difference between Population Stability Index(PSI) and Population Accuracy Index(PAI)?

Topic: model-selection statistics machine-learning

Category: Data Science

How can I export the best classifier from my code to a model for real future usage?

Tempu

2022年5月17日 13:40

# Read the CSV file df = pd.read_csv('processed.csv', header=0, engine='python') # Pre-processing the data # Define X,Y features X = df.drop('Class', axis=1) Y = df['Class'] # prepare configuration for cross validation test harness seed = 3 # prepare models models = [('LR', LogisticRegression()), ('LDA', LinearDiscriminantAnalysis()), ('KNN', KNeighborsClassifier()), ('CART', DecisionTreeClassifier()), ('NB', GaussianNB()), ('SVM', SVC())] # evaluate each model in turn results = [] names = [] scoring = 'accuracy' for name, model in models: kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=seed) cv_results = …

Topic: model-selection scikit-learn classification predictive-modeling machine-learning

Category: Data Science

How do I select the "best" unsupervised machine learning algorithm to cluster my specific dataset?

Alex

2022年5月7日 07:00

I want to cluster a dataset without prior knowledge on the correct amount of clusters. For different algorithms (i.e. k-means, gmm...) I can iterate through different values and try to find the best solution for any given algorithm (i.e. ellbow-curve, silhouette-coefficient etc.). But I get very different results - as expected with different algorithms. K-Means is good for spherical clusters, density-based approaches for totally different cluster shapes. Now the actual question: How do I select the "best" unsupervised machine learning …

Topic: unsupervised-learning model-selection algorithms data-mining

Category: Data Science

Same validation accuracy, different train accuracy for two neural networks models

sneaky_lobster

2022年5月6日 05:06

I'm performing emotion classification over FER2013 dataset. I'm trying to measure different models performance, and when I checked ImageDataGenerator with a model I had already used I came up with the following situation: Model without data augmentation got: train_accuracy = 0.76 val_accuracy = 0.70 Model with data augmentation got: train_accuracy = 0.86 val_accuracy = 0.70 As you can see, validation accuracy is the same in both models, but train accuracy is significantly different. In this case: Should I go with …

Topic: data-augmentation model-selection accuracy neural-network

Category: Data Science

Optimizing decision threshold on model with oversampled/imbalanced data

rayven1lk

2022年5月5日 03:01

I'm working on developing a model with a highly imbalanced dataset (0.7% Minority class). To remedy the imbalance, I was going to oversample using algorithms from imbalanced-learn library. I had a workflow in mind which I wanted to share and get an opinion on if I'm heading in the right direction or maybe I missed something. Split Train/Test/Val Setup pipeline for GridSearch and optimize hyper-parameters (pipeline will only oversample training folds) Scoring metric will be AUC as training set is …

Topic: grid-search smote model-selection cross-validation

Category: Data Science

Neural Network for solving these linear algebra problems

user135222

2022年5月1日 15:23

Intro There are several questions on this site about whether or not machine learning can solve specific problems. The answer (in my words) seems to be: "Yes, trivially, if you choose a model to learn your specific problem, but you sometimes may choose a model that can't represent/approximate the correct hypothesis." I would like to choose a neural network model where, a priori, all I know is that the input is a "linear algebra" kind of function. The Problem I …

Topic: linear-algebra machine-learning-model model-selection neural-network

Category: Data Science

Retrieve user features in real time from UserId for prediction

rohan23

2022年4月30日 04:07

Let's say I'm building an app like Uber and I want to predict the user's most likely destination based on the user's past history, current latitude/longitude, and time/date. Here is the proposed architecture - Let's say I have a pre-trained model hosted as a service. The part I'm struggling with is, how do I get the user features from the database in realtime from the RiderID to be used by the prediction service (XGBoost Model)? I'm guessing a lookup in …

Topic: model-selection ensemble-modeling apache-spark predictive-modeling machine-learning

Category: Data Science

Propensity model with Only Positive Data

QuantNoob

2022年4月28日 11:04

Is it possible to build a propensity model (i.e., the likelihood that a user will buy an item) using only positive values. For example, I have a bunch of data about Customers (people that bought stuff) and Users (people that haven't bought stuff yet) I want to get the likelihood that a User becomes a Customer. It seems that the only way to do so is to train a model using the data of Customers, therefore using only Positive values.

Topic: model-selection predictive-modeling

Category: Data Science

do feature selection and model selection must share the same ratio between development set and test set?

Giorgio Martinez

2022年4月24日 22:08

As the title, after I performed a Feature Selection, is it mandatory to respect the same ratio (between development set and test set) in Model Selection?

Topic: feature-engineering model-selection cross-validation feature-extraction feature-selection

Category: Data Science

Neural Network - Sparsity of collaborative based filtering and modelling the prediction problem

q.Then

2022年4月24日 04:04

I'm fairly new to machine learning and for that matter, neural networks, but for the past couple of days I decided to take a stab at a fairly classical and practical problem of neural networks/machine learning which is recommendation systems. Apologies if this is an unnecessarily broad question, but I found it hard to read up on resources answering this particular question. My main question is, how do you even model the problem (or what directions/advice is there on how …

Topic: model-selection neural-network recommender-system

Category: Data Science

Model works on TF 2.3 but not on 2.6 ( model.predict_classes removed?)

Tam

2022年4月19日 06:03

I am writing a project that classifies the date codes on a pack, I have developed a pipeline that works as intended on my PC, I trained the model on my computer and ran the classification script (tf2.3). Works well. I copied the files over to my raspberry pi 4, its 64 bit runs tf2.6, the script runs ok but I am not getting the same output in fact it is returning the same character for each contour. [ '[','[','[','[','[','[','[','[','[','['] …

Topic: tensorflow model-selection

Category: Data Science

How to choose Recursive Feature Elimination parameters

Giorgio Martinez

2022年4月15日 20:26

in my project I have >900 features and I thought to use Recursive Feature Elimination algorithm to reduce the dimensionality of my problem (in order to improve the accuracy). But I can't figure out how to choose the RFE parameters (estimator and the number of parameters to select). Should I use model selection techniques in this case as well? Do you have any advice?

Topic: rfe model-selection dimensionality-reduction

Category: Data Science

Is data leakage giving me misleading results? Independent test set says no!

PeMADS

2022年4月14日 08:07

TLDR: I evaluated a classification model using 10-fold CV with data leakage in the training and test folds. The results were great. I then solved the data leakage and the results were garbage. I then tested the model in an independent new dataset and the results were similar to the evaluation performed with data leakage. What does this mean? Was my data leakage not relevant? Can I trust my model evaluation and report that performance ? Extended version: I'm developing …

Topic: model-evaluations data-leakage overfitting model-selection machine-learning

Category: Data Science

Regression model for continuous dependent variable and count independent variables

Henry Fung

2022年4月13日 14:04

I am currently learning R and I am relatively inexperienced in the field. Hope I can get some advice from you guys! I am working on a project where I have to estimate the average processing time of different work items (tasks). I have the following panel data: My sample size is n=2000 individual workers, and T=10 (each time interval is a four week period) Independent variables: 51 different work items. I have count data for each work item (# …

Topic: model-selection regression r

Category: Data Science

ML: Classification Model Comparison

espressionist

2022年4月10日 18:07

Given is a dataset that I need to use for a classification and I want to compare the performance of different classification models. Let's assume, I want to look at logistic regression (with different cut-off-points) and KNN. Is there anything problematic if I proceed as follows: Split data in training and validation data (and a test set for the performance evaluation of the winning model). Train a logistic regression model and a KNN classification model on the training set. I …

Topic: model-selection logistic-regression classification

Category: Data Science

Is there any way to explicitly measure the complexity of a Machine Learning Model in Python

Multivac

2022年4月10日 14:22

I'm interested in model debugging and one of the points that it mentions is to compare your model with a "less complex" one to check if the performance is substantially better on the most complex model as compared with the simpler. So, it raises my questions: Suppose you have a Ensemble model and a Linear model for a classification task "It seems natural to think that the ensemble model is more complex than the linear model" What would it be …

Topic: model-selection python predictive-modeling r machine-learning

Category: Data Science

Probabilistic Machine Learning model to match spatial data

ajroot

2022年4月9日 02:03

I have spatial data from multiple sources. This data consists of ID, lat, long, and time. My goal is that given a new lat-long, the model needs to return (preferably with a probability) the data points that match the new lat-long. This matching should be based on the features (such as lat, long, timestamp). I could only think of clustering. ie. Cluster the dataset and try to predict which cluster the new data belongs to. The drawback is that if …

Topic: probability model-selection geospatial

Category: Data Science

About