Estimating class prevalence in unlabelled data after predicting labels with a binary classifier

I'm looking to get an estimate of the prevalence of 1's (i.e. the rate of positive labels) in a very large dataset that I have. However, I am hoping to report this percentage as a 95% credible interval instead of as an exact estimate of rate, taking into account the model uncertainties. These are the steps I'm hoping to perform: Train a binary classifier on labelled training data. Use a labelled test set to estimate the specificity and sensitivity of …
Category: Data Science

Remove all characters following a certain character in a column of a dataset

I have a data set like the following, and the first column contains the groupings. However, some are labelled slightly differently. I need to remove all characters following the punctuation used (bracket, semicolon, comma). groups <- c("Group1", "Group1", "Group1;Group1", "Group1(subset)", "Group1,ex" ) I would like this to present all of these just as Group1 (so they would all appear the same as the first two) - so to remove all characters in the string following the punctuation. I then need …
Category: Data Science

Model Undetermined Number of Labels

I'm look for tutorials on how to build a Tensorflow model that generates predictions from input, for example, generating sentences from a paragraph, then the loss is determined when compared to ground truth labels. Or generating a number of predictions for objects found in an image. The main idea is having undetermined number of predictions or labels.
Category: Data Science

Multiple activation functions with TensorFlow estimator DNNClassifier

I just want to know if is it possible to use tf.estimator.DNNClassifier with multiple different activation functions. I mean, could I use a DNNClassifier estimator which use different activation functions for different layers? For example, if I have a three layers model, could I use for the first layer a sigmoid function, for the second one a ReLu function and finally for the last one a tanh function? I would like to know if it isn't possible to do it …
Category: Data Science

dataset split for image classification

I am trying to do image classification for 14 categories (around 1000 images for each cat). And i initially created two folders for training and validation. In this case, do I still need to set a validation split or a subset in a code? or I can use the whole files as train_ds and val_ds by deleting them Folder names in the training and validation directory are same. data_dir = 'trainingdatav1' data_val = 'Validationv1' train_ds = tf.keras.preprocessing.image_dataset_from_directory( data_dir, validation_split=0.1, #is …
Category: Data Science

Loading saved model fails

I've trained a model and saved it in .h5 format. when I try loading it I received this error ValueError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_588/661726548.py in <module> 9 # returns a compiled model 10 # identical to the previous one ---> 11 reconstructed_model = keras.models.load_model("./custom_model.h5") ~\Anaconda3\lib\site-packages\keras\utils\traceback_utils.py in error_handler(*args, **kwargs) 65 except Exception as e: # pylint: disable=broad-except 66 filtered_tb = _process_traceback_frames(e.__traceback__) ---> 67 raise e.with_traceback(filtered_tb) from None 68 finally: 69 del filtered_tb ~\Anaconda3\lib\site-packages\keras\utils\generic_utils.py in class_and_config_for_serialized_keras_object(config, module_objects, custom_objects, printable_module_name) 560 …
Category: Data Science

How to preprocess an ordered categorical variable to feed a machine learning algorithm?

I have a categorical variable that measures the income of a family: A: no income B: Up to $500 C: $500-$700 … P: $5000-$6000 Q: More than \\\$6000 It seems odd to me that I have to get dummies for this variable, since it's ordered. I wonder if it's better to map the values: {'A': 0, 'B': 1, …, 'Q': 17} so I can input it into the algorithm this values as integer numbers. What's the proper way of preprocessing …
Category: Data Science

How to add the Luong Attention Mechanism into CNN?

As I write my CNN model for an image binary classification below, I'm trying to add an attention layer to this model. I read from tf.keras.layers.Attention: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention But I still don't know exactly how to use it, any help is appreciated. model = keras.Sequential() model.add(Conv2D(filters = 64, kernel_size = (3, 3), activation = 'relu', padding='same', input_shape = ((256,256,3)))) model.add(MaxPooling2D(pool_size = (2, 2), strides=(2, 2))) model.add(Conv2D(filters = 128, kernel_size = (3, 3), activation = 'relu', padding='same')) model.add(MaxPooling2D(pool_size = (2, 2), strides=(2, …
Category: Data Science

Prediction issue with xgboost custom loss

I have an issue with xgboost custom objectives: I do not manage to get consistent forecasts. In other words, the scale of my forecasts is not in line with the values I would like to predict. I tried many custom loss, but I always get the same issue. import numpy as np import pandas as pd import xgboost as xgb from sklearn.datasets import make_regression n_samples_train = 500 n_samples_test = 100 n_features = 200 X, y = make_regression(n_samples_train, n_features,noise=10) X_test, y_test …
Category: Data Science

What is the shape of the vector after it passes through the TfidfVecorizer fit_transform() method?

I am trying to understand what happens inside the IDF part of the TFIDF vectorizer. The official scikit-learn page says that the shape is (4,9) for a corpus of 4 documents having 9 unique features. I get the Term Frequency (TF) part, it makes sense to me that ( for every unique feature(9), for each document(4) we calculate each term's frequency, so we get a matrix of shape (4,9) But what does not make sense to me is the IDF …
Category: Data Science

Temporal Fusion Transformer from PyTorch-Forecasting with Multiple Targets - 'list' error

New to PyTorch and the PyTorch Forecasting library and trying to predict multiple targets using the Temporal Fusion Transformer model. I have 7 targets in a list as my targets variable. I'm using MultiLoss as my loss function with a list of 7 CrossEntropy loss functions (1 per target variable) -- In the problem I'm trying to model, there are 7 possible outcomes per time step and I'm trying to find which option is most likely. I'm looking for a …
Category: Data Science

Using a neural network to learn regression in image processing

I have a camera system with some special optics that warp the field of view of the camera, dependent on two variables, $\theta_1$ and $\theta_2$. Given a specific configuration of these two variables, each pixel on my camera (which is 500x600 resolution) will see a specific coordinate on a screen in front of the camera. I can calculate this for each pixel, but it requires too many computations and is too slow. So, I want to learn a model that …
Category: Data Science

What enables transformers or very deep models "plan" ahead for sequential decision making?

I was watching this amazing lecture by Oriol Vinyals. On one slide, there is a question asking if the very deep models plan. Transformer models or models employed in applications like Dialogue Generation do not have a planning component but behave like they already have the dialogue planned. Dr. Vinyals mentioned that there are papers on "how transformers are building up knowledge to answer questions or do all sorts of very interesting analyses". Can any please refer to a few …
Category: Data Science

How do you do 1-vs-rest classifiers in XGBoost Library (Not Sklearn)?

I am working with a very large dataset that would benefit from using training continuation with the xgb_model parameter in xgb.train(). The label (Y) of dataset itself has 4 classes and is highly imbalanced, so I would like to generate per-label PR curves for it to evaluate its performance, and would thus need to treat each class as it's own binary problem using a one-vs-rest classifier. After a lot of reading I haven't found an equivalent to sklearn's OneVsRestClassifier in …
Category: Data Science

Is this XGBoost model tending to overfit?

Here is the list of hyperparameters that I used: params = { 'scale_pos_weight': [1.0], 'eta': [0.05, 0.1, 0.15, 0.9, 1.0], 'max_depth': [1, 2, 6, 10, 15, 20], 'gamma': [0.0, 0.4, 0.5, 0.7] } The dataset is imbalanced so I used scale_pos_weight parameter. After 5 fold cross validation the f1 score that I got is: 0.530726530426833
Category: Data Science

Neural network / machine learning approach to model specific sequencing-classification problem in industry

I am working on a project which involves developing a machine learning/deep learning for an application in a roll-to-roll industry. For a long time, I have been looking for similar problems as a way to get some guidance but I was never able to find anything related. Basically, the problem can be seen as follows: An industrial machine is producing a roll of some material, which tends to have visible defects throughout the roll. I have already available a machine …
Category: Data Science

How to predict the sentiment of the entities form the tweet?

I have a JSON file (tweets.json) that contains tweets (sentences) along with the name of the author. Objective 1: Get the most frequent entities from the tweets. Objective 2: Find out the sentiment/polarity of each author towards each of the entities. Sample Input: Assume we have only 3 tweets: Tweet1 by Author1: Pink Pearl Apples are tasty but Empire Apples are not. Tweet2 by Author2: Empire Apples are very tasty. Tweet3 by Author3: Pink Pearl Apples are not tasty. Sample …
Category: Data Science

Random Forest Classifier Output

Used a RandomForestClassifier for my prediciton model. But the output printed is either 0 or in decimals. What do I need to do for my model to show me 0 and 1's instead of decimals? Note: used feature importance and removed the least important columns,still the accuracy is the same and the output hasn't changed much. Also, i have my estimators equal to 1000. do i increase or decrease this? edit: target col 1 0 0 1 output col 0.994 …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.