Ethical consequences of non-deterministic learning processes?

Most advanced supervised learning techniques are non-deterministic by construction. The final output of the model usually depends on some random parts of the learning process. (Random weight initialization for Neural Networks or variable selection / splits for Gradient Boosted Trees). This phenomenon can be observed by plotting the predictions for a given random seed against the predictions for another seed : the prediction are usually correlated but don't coincide exactly. Generally speaking it is often not a problem. When trying …
Category: Data Science

Can I compare two models trained on different but similar datasets to help find differences between the two datasets?

I have a multivariate dataset the contains A and B. I want to see if there are differences between the A and B samples. I currently have two ideas on how to do this, but I am not sure if they are valid. Train a model on A's samples and separately train a model on B's samples and compare the regression coefficients. Train a model with A's samples and compare the errors of a holdout of A's and all of …
Category: Data Science

Organization method on sharing research within a company

Currently we are trying to organize a methodology how different teams can share theirs projects with other team. These projects can be papers, code, pptx, views on everything. Is there a known scheme ; data lake or everything than can be useful to our company for this? We recently found that two teams where creating the same projects without knowing it. I am open to papers or exemples that have already worked in real life.
Topic: methodology
Category: Data Science

How can I learn and apply the scientific method in machine learning?

Rigor Theory. I wish to learn the scientific method and how to apply it in machine learning. Specifically, how to verify that a model captured the pattern in data; how to rigorously reach conclusions based on well-justified empirical evidence. Verification in Practice. My colleagues in both academia and industry tell me measuring the accuracy of the model on testing data is sufficient, but I don't feel confident such criteria are sufficient. Data Science Books. I have picked up multiple data …
Category: Data Science

Handling gaps in regression model

I'm facing a regression problem where I'm supposed to predict the delay of some trains. There's some peculiar particularity, however: a train is not considered delayed until it has more than 10 mins delays (its delay is 0 otherwise). Therefore, the distribution of target looks like a normal distribution but with a peak at 0. I tried different approaches to solve the problem. First approach I fitted some regressors on raw data but there are a lot of predictions in …
Category: Data Science

Splitting sentiment analysis training data into x-train and y-train for a RNN?

Suppose I have a dataset of comments from users, around multiple websites, such that in each row, there are two comments, and one is considered more 'negative' and one more 'positive' indicated by the placement of the comments in the 'negative' and 'positive' columns. If I were to preprocess and vectorize the data, how would I split this up into x-train and y-train data for a categorical crossentropy RNN? I thought at first to have my x-train be tuples of …
Category: Data Science

Explaining the logic behind the pipe_line method for cross-validation of imbalance datasets

Reading the following article: https://kiwidamien.github.io/how-to-do-cross-validation-when-upsampling-data.html There is an explanation of how to use from imblearn.pipeline import make_pipeline in order to perform a cross-validation on an imbalanced dataset while avoiding memory leakage. Here I copy the code used in the notebook linked by the article: X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=45) rf = RandomForestClassifier(n_estimators=100, random_state=13) imba_pipeline = make_pipeline(SMOTE(random_state=42), RandomForestClassifier(n_estimators=100, random_state=13)) cross_val_score(imba_pipeline, X_train, y_train, scoring='recall', cv=kf) new_params = {'randomforestclassifier__' + key: params[key] for key in params} grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, …
Category: Data Science

Regressing over tiny floats with Neural Networks

I am trying to regress over very small floats - of the magnitude [1e-2, 9e-3]. They're mostly in this range. Using simple MSE (Mean Squared Error) loss and backpropagating against it does not lead to very good results. The networks get the answer usually in the right neighbourhood, but fails to achieve even decent precision. This implies MSE is penalizing less for small differences I tried checking some articles and results published by people, but they don't seem to yield …
Category: Data Science

Methodology for parallelising linked data?

If I have some form of data that can have inherent links to all other data in the set but I wish to parallelise out this data in order to increase computation time or to reduce the size of any particular piece of data currently being worked on, is there a methodology to split this out into chunks without reducing the validity of the data? For example, assume I have a grid of crime across the whole of a country. …
Category: Data Science

Can identifiers be used to train a model?

I recently participated in some Machine Learning competition where we were asked to decide whether a rider should accept or not a course (~2k riders and ~140k courses). It came up that some of the winners used the identifier of the rider (an integer number unique for each rider) in their features which was discarded on default notebook and it greatly improved their score. Is it legitimate? can identifier be used to train a model?
Category: Data Science

Predictive modeling when output affects future input

Assume I have a model which predicts the outcome of the number of icecreams sold in a store. The model is trained on data for the last 5 years while keeping the last year as a validation set and has produced very good results. We now put the model into production such that the CFO can create an estimate for the upcoming year's budget. The CFO now look at the prediction for May, say 2000 ice creams, and thinks "Ooh... …
Category: Data Science

How to extract features insights to change classifier decision?

I don't know if my question is specific enough but there's what I mean. Suppose we have high school grades of students who attended a Computer Science degree and whether or not they succeeded (given a certain criteria). I want to create an "adviser", which given high school grades, point out which features (grades) doesn't fit (below a certain "important" range for example) to twist them to reach the objective (be Good in Computer Science Degree). Is this possible ? …
Category: Data Science

payment data prediction at test time

I have the payment data of the client. I want to predict the prob of customers paying late with target classes being 0-30 days, 30-60 days, 60-90 days, and 90+ days based on this paper. The features I have are as follows: Amount Payment Terms Diff in days Paid Invoices bef order Paid invoices late Ratio of paid inv which were late Sum of inv bef order Sum inv late Ratio of outs inv Avg days late outs Target 14298.0 …
Category: Data Science

Is removing poorly predicted data points a valid approach?

I'm getting my feet wet with data science and machine learning. Please bear with me as I try to explain my problem. I haven't been able to find anything about this method, but I suspect there's a simple name for what I'm trying to do. To give an idea about my level of knowledge, I've had some basic statistics training in my university studies (social sciences), and I work as a programmer at the moment. So when it comes to …
Category: Data Science

Alternatives to CRISP-DM for solo projects

I am wondering if there is a data science methodology/model for working that is less prescriptive/detailed than CRISP-DM but still a framework of sorts that is less generic than, say, Agile, or using kanban. My motivation for asking is that I have a number of projects where I will be the only one working on them and the outputs are clear to me, however I would still like to ensure there is some rigour, process and to an extent, credibility …
Category: Data Science

Custom thresholds on categorical classification

When assessing a binary classification task, it is possible to search for particular threshold in order to have better score on some metrics (f1,recall,etc) through numerous methods. Unfortunately, it looks like the method cannot be applied on categorical classification (more than two classes) task. I've thought about training a Simple Classifier (SVC,Log,...,Tree) on top of a already trained model in order to find best thresholds to apply on outputs to maximize similarity of results. My proposed workflow is to train …
Category: Data Science

reliability of human-level evaluation of the interpretability quality of a model

Christoph Molnar, in his book Interpretable Machine Learning, writes that Human level evaluation (simple task) is a simplified application level evaluation. The difference is that these experiments are not carried out with the domain experts, but with laypersons. This makes experiments cheaper (especially if the domain experts are radiologists) and it is easier to find more testers. An example would be to show a user different explanations and the user would choose the best one. (Chapter = Interpretability , section …
Category: Data Science

Machine Learning in Practice

I worked on a machine learning project where we dealt with relatively small data sets. I noticed that the way that we tried to increase performance was basically to try out a bunch of different models with different hyperparameters, try out a bunch of different sets of features, etc. Basically, it seemed like we approached the problem fairly randomly and that we had no real theoretical basis for anything we tried. This disillusioned me a fair amount and made me …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.