I have a data from a store for the products that sold since more than 5 years. Each sell process has a customer id, date, and the quantity of the product. I want to build a machine learning model to predict the products that will be sold in the next day/s for each of the customers, giving that I have N products (~2k) and M customers (~50). I am not able to formulate this problem. It's a regression task (probably), …
I came from a software development background and we have separate servers of the same database (dev, test, prod). The reason for this is because we develop our apps against the dev DB, run tests against the Test DB, and prod is prod. This is so we create a clear separation and won't bring down prod trying to build our app. Do you guys train your models the same way? Have 3 environments of the same database and as your …
I am looking for tools that allow me to monitor machine learning models once they are gone to production. I would like to monitor: Long term changes: changes of distribution in the features with respect to training time, that would suggest retraining the model. Short term changes: bugs in the features (radical changes of distribution). Changes in the performance of the model with respect to a given metric. I have been looking over the Internet, but I don't see any …
Consider a neural network with $f(x) = w^T_2 \sigma(w^T_1 x) $ where $\sigma(.)$ is a activation function such as ReLU. $w_2 \in R^{d \times k}, w_1 \in R^{k \times o}$ are two matrices. I would like to compute the inner product between two initialization of model's parameters $\theta =(w_2, w_1)$ and $\theta'=(w'_2, w'_1)$. Should we stack all elements of networks parameter into a single vector, i.e $\theta, \theta'$ will be a big vector with the number of entries equal to …
First, I was asked by the manager to make a plot showing produced vs received items, its a multistage process so we are only in charge of one of the steps which is designing, I made the plot comparing Received cases vs produced here in my country, produced out of the country, total produced and % of advancement. Later on in a meeting she asked me to show the graph and table I made to the production supervisors, and she …
I have a data science project, predicting customer's next purchase day. Customer's one year behavioral data was split to 9 and 3 months for train and test, using RFM analysis, I trained a model with different classifiers and the best one's result is as follow: Accuracy of XGB classifier on training set: 0.93 Accuracy of XGB classifier on test set: 0.68 This is my school's project, and I was wondering, in real world projects, how can we evaluate a model's …
I was recently asked this in a ds interview pertaining to product thinking - If a product feature is being rolled out, but an a/b test is unable to be performed due to whatever reason, how can we measure the efficacy of this feature? My response was along the lines of an exploratory standpoint in terms of comparing data pre and post rollout, but was curious as if there were better methods for this, thanks and thanks so much for …
I am working on a ML model to be deployed in a product operating in many countries. The issue that I am having is the following: should I train one model and serve it for all countries? train a model per country and serve each model in its country? I've faced this problem several times, and to me, there's a trade-off in the learning: in the first case, the model has more data to learn, and it'll be more robust …
I am thinking to "deploy" a machine learning model (in pickle it is sized 3 megabytes) and after discussing with my developer colleagues, they said it would be better if the model is packed as a python library instead of a microservice (like a rest API). I wanted to ask what's your view on this: Pickled model packed in a library specifically meant for it vs. a rest API, pros and cons? I was thinking that having it as a …
In order to validate a recommender model, a usual approach is create a hold-out set that will provide random suggestions (similar to an A/B testing setup). However, in healthcare applications, this cannot be possible as a random suggestion can put at risk a patient's life. Hence, what is a reasonable approach to validate the model?
How to correctly model a if condition to choose estimator/predictor(linear regression, gbt) to be used in scikit/spark-ml in a single pipeline. if feature_x < constant: result = pipeline1.predict(feature_vector) else: result = pipeline2.predict(feature_vector) Other than modelling it as custom transformer/predictor, is there a alternate way to model it in a pipeline
Lets say a Model was trained on date $dt1$ using the available labeled data, split into training and test i.e $train_{dt1}$, $test_{dt1}$. This model is then deployed in production and makes predictions on new incoming data. Some $X$ days pass, and there is bunch of labelled data that is collected in between $dt1$ and $dt1 + X$ days, lets call it $Data_x$. In my current approach, I take random samples out of $DATA_x$ (take for e.g 80/20 split) , So, …
I am trying to build price recommendation solution for clients in a scalable manner. I have two choices as below. Professional service: Statistician involvement to build regression model or any other kind of predictive model that fits specifically to client data and can be used. Issue: So on the long run there will be issues around scalability as one analyst cannot build model simultaneously for hundreds of clients who want to come on board and use this service. Hiring 1 …
I have data & a R script that creates a report from the data. I can't expose the data to internet. Also, I cant expose my script to internet / user. But I would like to eliminate myself from the work, and allow couple of users (yeah, only three users that use this script to generate reports, but they do it weekly) to run the script and generate report for them. I would like the community to suggest me how …