Best practices for scoring hundreds of models on the same massive dataset?

I have 500+ models predicting various things and a massive database of over 400m+ individuals and about 5,000 possible independent variables. Currently, my scoring process takes about 5 days, and operates by chunking up the 400m+ records into 100k-person pieces and spinning up n-number of threads, each with a particular subset of the 500+ models, and running this way until all records are scored for all models. Each thread is a Python process which submits R code (i.e. loads an …
Topic: scoring
Category: Data Science

Ranking algorithm based on a handful of features

I am trying to determine the apt algorithm for a ranking problem that I am working on. I have social media metrics - engagement, sentiment, audience size etc for several brands and am looking for a ranking / classification algorithm to rank them. I am not sure if I have a dependent variable or label class for classical classification algorithms. The data is aggregated by brands and algorithms needs to rank the brands based on the metrics. Any ideas would …
Category: Data Science

R in production

Many of us are very familiar with using R in reproducible, but very much targeted, ad-hoc analysis. Given that R is currently the best collection of cutting-edge scientific methods from world-class experts in each particular field, and given that plenty of libraries exist for data io in R, it seems very natural to extend its applications into production environments for live decision making. Therefore my questions are: did someone of you go into production with pure R (I know of …
Category: Data Science

Ranking ATM based on Utilization and Economic Data (Scoring/Rank Model)

I have a sample data of around 10 ATM's Locations along with their Utilization Count (Deposits, Withdrawals and Others) for the past 3 months. I am planning to collect additional data such as nearby places of Commercial Interest and Others where there might be demand of Cash. The data is collected for approximately 300 meters of each ATM, i.e., places of Commercial Interest nearby the ATM. I would like to build a 'Scoring/Rank Model' which can take all these inputs …
Category: Data Science

Scikit-learn with a custom scoring function using a 'feature'

0 I am trying to use a new metric called 'SERA' (Squared Error Relevance Area) as a custom scoring function for imbalanced regression as mentioned in this paper. https://link.springer.com/article/10.1007/s10994-020-05900-9 Here is what the paper tells in brief. To calculate SERA a feature known as 'relevance' defined by the user is required for each feature-label pair. Relevance varies from 0 to 1. 0 for not relevant and 1 for highly relevant. This is the procedure for the calculation of SERA. Relevance …
Category: Data Science

Approaches for matching leads to salesmen

I'm starting to tackle a new problem where we are trying to optimally match new leads (perspective customers) for our product to our sales representatives in the hopes of improving bottom-line metrics like conversion rate, average sale price, etc. We have a bunch of data from the leads when they fill out their info on web forms and from 3rd party data providers we use to enrich the core web form data (we try and pull their soft credit score, …
Category: Data Science

Develop a Scorecard Model with Orange 3.30

I'm a super fan user of Orange 3.30, actually I've beeing develop some Collection Strategys and some othe of CLI in my actuall work, and everything has been OK with all the decisions that I've being making, but not that we have some suppet data history it is time to create our Behaviour Score so that need a Scorecard Model. I've beeing reading a lot of Orange 3.30 but nothing seems to approuch of what I've need. The main goal …
Category: Data Science

How can i adapt accuracy metric for multiclass classification?

I have a problem which is multiclass e.g. That is 4 classes. I would like a custom metric to assess the model where only if class 3 is predicted as class 2 and class 2 is predicted as class 3 (i.e. those in the middle) then it is penalized less. How can i do this by adapting the sklearn accuracy_score metric of similar? e.g. comparing: predicted_labels = [1,3,0,0,2..] actual = [0,0,2,1,3,3...]
Category: Data Science

Right way to compare model scores for Next Best Action

I have around 15 classification models for different products built in different ways (some are RF, some are Gradient Boosting, some were downsampled in one way, others in other way, some are built in 12 months historic, some are built in 24 months historic) and I have to compare the scores to choose what product to offer. All models have target 1 for "customer bought the product" and 0 for "customer didn't buy the product". I have read about this …
Category: Data Science

How to interpret Sum of Squared Error in a classification task

I am working on ANN. I have 2497 training examples and each of them is a vector of 128, so the input size is 128. Number of neurons in hidden layer is 64 and number of output neurons is 6 (since classes are six). My Target vector looks something like this: [0 1 0 0 0 0]. This means that the example belongs to class 2. I have used sigmoid as an activation at all layers and sum of squared …
Category: Data Science

Scikit-learn make_scorer custom metric problem for multiclass clasification

I was doing a churn analysis using: randomcv = RandomizedSearchCV(estimator=clf,param_distributions = params_grid, cv=kfoldcv,n_iter=100, n_jobs=-1, scoring='roc_auc') and everything was fine, but then, I tried it with a custom scoring function this way: def gain_fn(y_true, y_prob): tp = np.where((y_prob >= 0.02) & (y_true==1), 40000, 0) fp = np.where((y_prob >= 0.02) & (y_true==0), -1000, 0) return np.sum([tp,fp]) scorer_fn = make_scorer(gain_fn, greater_is_better = True, needs_proba=True) randomcv = RandomizedSearchCV(estimator=clf,param_distributions = params_grid, cv=kfoldcv,n_iter=100, n_jobs=-1, scoring=scorer_fn) but I need to make a calculation, inside of gain_fn, with …
Category: Data Science

Random forest mode scoring

We are using random forest algorithm but having some trouble understanding the scoring method it uses. take for example the following CM of the test set: Threshold 45 cm is: [[67969 48031] [ 3321 11120]] and the prescion is: 0.18799344051632602 Threshold 50 cm is: [[77642 38358] [ 4785 9656]] and the prescion is: 0.2011080101632834 Threshold 55 cm is: [[88825 27175] [ 6796 7645]] and the prescion is: 0.2195577254445159 Threshold 60 cm is: [[100411 15589] [ 9629 4812]] and the prescion …
Category: Data Science

Having trouble scaling scores of logistic regression

I am constructing a credit scorecard using logistic regression, similar to the one shown here. However, when trying to convert the coefficients of logistic regression into score representation (by scaling the values using the provided formula) I am getting numbers that dont make much sense. Formula used for calculating scores: Score_i= (βi × WoE_i + α/n) × Factor + Offset/n where βi is the coefficient of the logistic regression (of variable i), WoE_i is the weight of evidence of corresponding …
Category: Data Science

Data science tools for easing the participation of a business into their scoring system

I'm a working in a small company. The company sells products on a website and they have a python script that runs everyday to attribute a score to each product based on a set of parameters (google analytics events, similar products popularity, price, etc). The problem is that the scoring outcome is not satisfying, and requiring developers to edit this script arbitrarily, based on business people assumptions, is time consuming and not a proper way to achieve what the business …
Category: Data Science

Scoring samples after clusterings

I want to assign a score to all points in a group that I cluster several time. I want the score to indicate how much this point is grouped with the same individuals all time. I suppose this idea allready exists, however I didn't found anything, only scoring on global clusters, as mutual information score. I had an idea, for a point x, to count each point y that is in the same cluster of x in two clusterisations, or …
Category: Data Science

Why is the F-measure preferred for classification tasks?

Why is the F-measure usually used for (supervised) classification tasks, whereas the G-measure (or Fowlkes–Mallows index) is generally used for (unsupervised) clustering tasks? The F-measure is the harmonic mean of the precision and recall. The G-measure (or Fowlkes–Mallows index) is the geometric mean of the precision and recall. Below is a plot of the different means. F1 (harmonic) $= 2\cdot\frac{precision\cdot recall}{precision + recall}$ Geometric $= \sqrt{precision\cdot recall}$ Arithmetic $= \frac{precision + recall}{2}$ The reason I ask is that I need …
Category: Data Science

Standardizing binary decision with other scales (Like 1-5)

In the company I work for there are 2 different evaluation metrics for a song: Yes / No (Equivalent to like/dislike) 1-5 Scale Customers can use both to rank songs they like. I would like to create a model that predicts the next possible songs you would like. Currently, I'm ignoring the Binary data. I wonder if there's a good way of utilizing the Binary data as tagged data [And not as a feature]. I've thought about two possible solutions: …
Category: Data Science

What is the proper way to bin variables for calculating WoE during credit scoring?

I have read this article about developing a credit scorecard in python, where it is stated that when binning the continuous variables, it needs to be ensured that: 1. Each bin should have at least 5% of the observations 2. Each bin should be non-zero for both good and bad loans 3. The WOE should be distinct for each category. Similar groups should be aggregated or binned together. It is because the bins with similar WoE have almost the same …
Topic: scoring
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.