methodology

Ethical consequences of non-deterministic learning processes?

lcrmorin

2022年5月31日 13:59

Most advanced supervised learning techniques are non-deterministic by construction. The final output of the model usually depends on some random parts of the learning process. (Random weight initialization for Neural Networks or variable selection / splits for Gradient Boosted Trees). This phenomenon can be observed by plotting the predictions for a given random seed against the predictions for another seed : the prediction are usually correlated but don't coincide exactly. Generally speaking it is often not a problem. When trying …

Topic: ethical-ai methodology model-selection

Category: Data Science

Can I compare two models trained on different but similar datasets to help find differences between the two datasets?

Mike

2022年3月29日 18:00

I have a multivariate dataset the contains A and B. I want to see if there are differences between the A and B samples. I currently have two ideas on how to do this, but I am not sure if they are valid. Train a model on A's samples and separately train a model on B's samples and compare the regression coefficients. Train a model with A's samples and compare the errors of a holdout of A's and all of …

Topic: methodology regression statistics machine-learning

Category: Data Science

Organization method on sharing research within a company

minattosama

2022年3月25日 16:04

Currently we are trying to organize a methodology how different teams can share theirs projects with other team. These projects can be papers, code, pptx, views on everything. Is there a known scheme ; data lake or everything than can be useful to our company for this? We recently found that two teams where creating the same projects without knowing it. I am open to papers or exemples that have already worked in real life.

Topic: methodology

Category: Data Science

How can I learn and apply the scientific method in machine learning?

Mostafa Touny

2022年3月23日 13:52

Rigor Theory. I wish to learn the scientific method and how to apply it in machine learning. Specifically, how to verify that a model captured the pattern in data; how to rigorously reach conclusions based on well-justified empirical evidence. Verification in Practice. My colleagues in both academia and industry tell me measuring the accuracy of the model on testing data is sufficient, but I don't feel confident such criteria are sufficient. Data Science Books. I have picked up multiple data …

Topic: methodology methods

Category: Data Science

Handling gaps in regression model

nprime496

2022年2月14日 22:30

I'm facing a regression problem where I'm supposed to predict the delay of some trains. There's some peculiar particularity, however: a train is not considered delayed until it has more than 10 mins delays (its delay is 0 otherwise). Therefore, the distribution of target looks like a normal distribution but with a peak at 0. I tried different approaches to solve the problem. First approach I fitted some regressors on raw data but there are a lot of predictions in …

Topic: distribution methodology regression

Category: Data Science

Splitting sentiment analysis training data into x-train and y-train for a RNN?

sangstar

2022年1月12日 01:21

Suppose I have a dataset of comments from users, around multiple websites, such that in each row, there are two comments, and one is considered more 'negative' and one more 'positive' indicated by the placement of the comments in the 'negative' and 'positive' columns. If I were to preprocess and vectorize the data, how would I split this up into x-train and y-train data for a categorical crossentropy RNN? I thought at first to have my x-train be tuples of …

Topic: methodology rnn sentiment-analysis

Category: Data Science

Explaining the logic behind the pipe_line method for cross-validation of imbalance datasets

PwNzDust

2022年1月1日 21:20

Reading the following article: https://kiwidamien.github.io/how-to-do-cross-validation-when-upsampling-data.html There is an explanation of how to use from imblearn.pipeline import make_pipeline in order to perform a cross-validation on an imbalanced dataset while avoiding memory leakage. Here I copy the code used in the notebook linked by the article: X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=45) rf = RandomForestClassifier(n_estimators=100, random_state=13) imba_pipeline = make_pipeline(SMOTE(random_state=42), RandomForestClassifier(n_estimators=100, random_state=13)) cross_val_score(imba_pipeline, X_train, y_train, scoring='recall', cv=kf) new_params = {'randomforestclassifier__' + key: params[key] for key in params} grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, …

Topic: oversampling pipelines imbalanced-learn methodology class-imbalance

Category: Data Science

Regressing over tiny floats with Neural Networks

neel g

2021年11月29日 15:33

I am trying to regress over very small floats - of the magnitude [1e-2, 9e-3]. They're mostly in this range. Using simple MSE (Mean Squared Error) loss and backpropagating against it does not lead to very good results. The networks get the answer usually in the right neighbourhood, but fails to achieve even decent precision. This implies MSE is penalizing less for small differences I tried checking some articles and results published by people, but they don't seem to yield …

Topic: methodology deep-learning machine-learning

Category: Data Science

Methodology for parallelising linked data?

MeridarchGekkota

2021年11月3日 19:06

If I have some form of data that can have inherent links to all other data in the set but I wish to parallelise out this data in order to increase computation time or to reduce the size of any particular piece of data currently being worked on, is there a methodology to split this out into chunks without reducing the validity of the data? For example, assume I have a grid of crime across the whole of a country. …

Topic: methodology optimization parallel

Category: Data Science

Can identifiers be used to train a model?

nprime496

2021年10月17日 09:01

I recently participated in some Machine Learning competition where we were asked to decide whether a rider should accept or not a course (~2k riders and ~140k courses). It came up that some of the winners used the identifier of the rider (an integer number unique for each rider) in their features which was discarded on default notebook and it greatly improved their score. Is it legitimate? can identifier be used to train a model?

Topic: methodology machine-learning

Category: Data Science

Predictive modeling when output affects future input

CutePoison

2021年9月19日 04:38

Assume I have a model which predicts the outcome of the number of icecreams sold in a store. The model is trained on data for the last 5 years while keeping the last year as a validation set and has produced very good results. We now put the model into production such that the CFO can create an estimate for the upcoming year's budget. The CFO now look at the prediction for May, say 2000 ice creams, and thinks "Ooh... …

Topic: concept-drift bias methodology predictive-modeling machine-learning

Category: Data Science

How to extract features insights to change classifier decision?

nprime496

2021年9月18日 14:45

I don't know if my question is specific enough but there's what I mean. Suppose we have high school grades of students who attended a Computer Science degree and whether or not they succeeded (given a certain criteria). I want to create an "adviser", which given high school grades, point out which features (grades) doesn't fit (below a certain "important" range for example) to twist them to reach the objective (be Good in Computer Science Degree). Is this possible ? …

Topic: methodology supervised-learning classification machine-learning

Category: Data Science

payment data prediction at test time

Vinit Sutar

2021年7月14日 04:34

I have the payment data of the client. I want to predict the prob of customers paying late with target classes being 0-30 days, 30-60 days, 60-90 days, and 90+ days based on this paper. The features I have are as follows: Amount Payment Terms Diff in days Paid Invoices bef order Paid invoices late Ratio of paid inv which were late Sum of inv bef order Sum inv late Ratio of outs inv Avg days late outs Target 14298.0 …

Topic: methodology feature-engineering feature-extraction machine-learning

Category: Data Science

Is removing poorly predicted data points a valid approach?

mtosic

2021年7月7日 18:12

I'm getting my feet wet with data science and machine learning. Please bear with me as I try to explain my problem. I haven't been able to find anything about this method, but I suspect there's a simple name for what I'm trying to do. To give an idea about my level of knowledge, I've had some basic statistics training in my university studies (social sciences), and I work as a programmer at the moment. So when it comes to …

Topic: methodology regression predictive-modeling machine-learning

Category: Data Science

Alternatives to CRISP-DM for solo projects

Tom Bush

2021年6月15日 21:45

I am wondering if there is a data science methodology/model for working that is less prescriptive/detailed than CRISP-DM but still a framework of sorts that is less generic than, say, Agile, or using kanban. My motivation for asking is that I have a number of projects where I will be the only one working on them and the outputs are clear to me, however I would still like to ensure there is some rigour, process and to an extent, credibility …

Topic: data-science-model methodology

Category: Data Science

Chossing between gradient boosting algorithms

nprime496

2021年4月7日 19:50

I just stepped in machine learning competitions and it looks like most of the mid-sized dataset competitions are won by Gradient boosting based models. However I came accross case where LightGBM,Catboost or Adaboost had very different scores. Is there a method to choose between those algorithms?

Topic: gradient-boosting-decision-trees methodology classification

Category: Data Science

Custom thresholds on categorical classification

nprime496

2021年4月6日 20:21

When assessing a binary classification task, it is possible to search for particular threshold in order to have better score on some metrics (f1,recall,etc) through numerous methods. Unfortunately, it looks like the method cannot be applied on categorical classification (more than two classes) task. I've thought about training a Simple Classifier (SVC,Log,...,Tree) on top of a already trained model in order to find best thresholds to apply on outputs to maximize similarity of results. My proposed workflow is to train …

Topic: methodology optimization classification

Category: Data Science

reliability of human-level evaluation of the interpretability quality of a model

Akshay Prabhakant

2021年3月27日 10:04

Christoph Molnar, in his book Interpretable Machine Learning, writes that Human level evaluation (simple task) is a simplified application level evaluation. The difference is that these experiments are not carried out with the domain experts, but with laypersons. This makes experiments cheaper (especially if the domain experts are radiologists) and it is easier to find more testers. An example would be to show a user different explanations and the user would choose the best one. (Chapter = Interpretability , section …

Topic: methodology machine-learning

Category: Data Science

Machine Learning in Practice

Spencer Gibson

2021年3月12日 00:25

I worked on a machine learning project where we dealt with relatively small data sets. I noticed that the way that we tried to increase performance was basically to try out a bunch of different models with different hyperparameters, try out a bunch of different sets of features, etc. Basically, it seemed like we approached the problem fairly randomly and that we had no real theoretical basis for anything we tried. This disillusioned me a fair amount and made me …

Topic: methodology machine-learning

Category: Data Science

How to Impute Top-Censored Data

adamsalenushka

2021年3月10日 00:04

What are the main imputation methods that can be used to deal with top censored panel income data if a big proportion (about 40%) is censored?

Topic: methodology data statistics

Category: Data Science

About