Data Science

Estimating class prevalence in unlabelled data after predicting labels with a binary classifier

CadPat

2022年6月5日 03:03

I'm looking to get an estimate of the prevalence of 1's (i.e. the rate of positive labels) in a very large dataset that I have. However, I am hoping to report this percentage as a 95% credible interval instead of as an exact estimate of rate, taking into account the model uncertainties. These are the steps I'm hoping to perform: Train a binary classifier on labelled training data. Use a labelled test set to estimate the specificity and sensitivity of …

Topic: bayesian classification statistics machine-learning

Category: Data Science

Remove all characters following a certain character in a column of a dataset

NewtoR

2022年6月5日 02:05

I have a data set like the following, and the first column contains the groupings. However, some are labelled slightly differently. I need to remove all characters following the punctuation used (bracket, semicolon, comma). groups <- c("Group1", "Group1", "Group1;Group1", "Group1(subset)", "Group1,ex" ) I would like this to present all of these just as Group1 (so they would all appear the same as the first two) - so to remove all characters in the string following the punctuation. I then need …

Topic: regex bioinformatics r

Category: Data Science

Model Undetermined Number of Labels

sakher

2022年6月5日 01:58

I'm look for tutorials on how to build a Tensorflow model that generates predictions from input, for example, generating sentences from a paragraph, then the loss is determined when compared to ground truth labels. Or generating a number of predictions for objects found in an image. The main idea is having undetermined number of predictions or labels.

Topic: machine-learning-model prediction tensorflow machine-learning

Category: Data Science

Multiple activation functions with TensorFlow estimator DNNClassifier

David C.

2022年6月5日 01:05

I just want to know if is it possible to use tf.estimator.DNNClassifier with multiple different activation functions. I mean, could I use a DNNClassifier estimator which use different activation functions for different layers? For example, if I have a three layers model, could I use for the first layer a sigmoid function, for the second one a ReLu function and finally for the last one a tanh function? I would like to know if it isn't possible to do it …

Topic: tensorflow python machine-learning

Category: Data Science

dataset split for image classification

Hello-experts

2022年6月5日 00:06

I am trying to do image classification for 14 categories (around 1000 images for each cat). And i initially created two folders for training and validation. In this case, do I still need to set a validation split or a subset in a code? or I can use the whole files as train_ds and val_ds by deleting them Folder names in the training and validation directory are same. data_dir = 'trainingdatav1' data_val = 'Validationv1' train_ds = tf.keras.preprocessing.image_dataset_from_directory( data_dir, validation_split=0.1, #is …

Topic: validation overfitting image-classification dataset

Category: Data Science

Loading saved model fails

nmorsi

2022年6月4日 23:32

I've trained a model and saved it in .h5 format. when I try loading it I received this error ValueError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_588/661726548.py in <module> 9 # returns a compiled model 10 # identical to the previous one ---> 11 reconstructed_model = keras.models.load_model("./custom_model.h5") ~\Anaconda3\lib\site-packages\keras\utils\traceback_utils.py in error_handler(*args, **kwargs) 65 except Exception as e: # pylint: disable=broad-except 66 filtered_tb = _process_traceback_frames(e.__traceback__) ---> 67 raise e.with_traceback(filtered_tb) from None 68 finally: 69 del filtered_tb ~\Anaconda3\lib\site-packages\keras\utils\generic_utils.py in class_and_config_for_serialized_keras_object(config, module_objects, custom_objects, printable_module_name) 560 …

Topic: machine-learning-model keras tensorflow loss-function deep-learning

Category: Data Science

The difference between data science and algorithm development

אבנר יעקב

2022年6月4日 23:08

I see a lot of job opportunities in the field of data science but I'm not sure the difference between a data scientist and deep learning algorithm developer. Can someone explain that to me?

Topic: deep-learning algorithms machine-learning

Category: Data Science

How to preprocess an ordered categorical variable to feed a machine learning algorithm?

marcus

2022年6月4日 22:00

I have a categorical variable that measures the income of a family: A: no income B: Up to $500 C: $500-$700 … P: $5000-$6000 Q: More than \\\$6000 It seems odd to me that I have to get dummies for this variable, since it's ordered. I wonder if it's better to map the values: {'A': 0, 'B': 1, …, 'Q': 17} so I can input it into the algorithm this values as integer numbers. What's the proper way of preprocessing …

Topic: data-wrangling preprocessing dataset machine-learning

Category: Data Science

How to add the Luong Attention Mechanism into CNN?

xniwniw

2022年6月4日 21:17

As I write my CNN model for an image binary classification below, I'm trying to add an attention layer to this model. I read from tf.keras.layers.Attention: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention But I still don't know exactly how to use it, any help is appreciated. model = keras.Sequential() model.add(Conv2D(filters = 64, kernel_size = (3, 3), activation = 'relu', padding='same', input_shape = ((256,256,3)))) model.add(MaxPooling2D(pool_size = (2, 2), strides=(2, 2))) model.add(Conv2D(filters = 128, kernel_size = (3, 3), activation = 'relu', padding='same')) model.add(MaxPooling2D(pool_size = (2, 2), strides=(2, …

Topic: attention-mechanism keras convolutional-neural-network

Category: Data Science

Prediction issue with xgboost custom loss

phil

2022年6月4日 21:00

I have an issue with xgboost custom objectives: I do not manage to get consistent forecasts. In other words, the scale of my forecasts is not in line with the values I would like to predict. I tried many custom loss, but I always get the same issue. import numpy as np import pandas as pd import xgboost as xgb from sklearn.datasets import make_regression n_samples_train = 500 n_samples_test = 100 n_features = 200 X, y = make_regression(n_samples_train, n_features,noise=10) X_test, y_test …

Topic: prediction xgboost machine-learning

Category: Data Science

What are the differences between the below feature selection methods?

Niyaz

2022年6月4日 20:48

Do the below codes do the same? If not, what are the differences? fs = RFE(estimator=RandomForestClassifier(), n_features_to_select=10) fs.fit(X, y) print(fs.support_) fs= RandomForestClassifier(), fs.fit(X, y) print(fs.feature_importances_[:10,])

Topic: scikit-learn feature-selection machine-learning

Category: Data Science

What is the shape of the vector after it passes through the TfidfVecorizer fit_transform() method?

Allan_Aj5

2022年6月4日 20:02

I am trying to understand what happens inside the IDF part of the TFIDF vectorizer. The official scikit-learn page says that the shape is (4,9) for a corpus of 4 documents having 9 unique features. I get the Term Frequency (TF) part, it makes sense to me that ( for every unique feature(9), for each document(4) we calculate each term's frequency, so we get a matrix of shape (4,9) But what does not make sense to me is the IDF …

Topic: scikit-learn nlp

Category: Data Science

Temporal Fusion Transformer from PyTorch-Forecasting with Multiple Targets - 'list' error

John Herwig

2022年6月4日 19:22

New to PyTorch and the PyTorch Forecasting library and trying to predict multiple targets using the Temporal Fusion Transformer model. I have 7 targets in a list as my targets variable. I'm using MultiLoss as my loss function with a list of 7 CrossEntropy loss functions (1 per target variable) -- In the problem I'm trying to model, there are 7 possible outcomes per time step and I'm trying to find which option is most likely. I'm looking for a …

Topic: transformer forecasting pytorch lstm time-series

Category: Data Science

Using a neural network to learn regression in image processing

Jay Jackman

2022年6月4日 19:04

I have a camera system with some special optics that warp the field of view of the camera, dependent on two variables, $\theta_1$ and $\theta_2$. Given a specific configuration of these two variables, each pixel on my camera (which is 500x600 resolution) will see a specific coordinate on a screen in front of the camera. I can calculate this for each pixel, but it requires too many computations and is too slow. So, I want to learn a model that …

Topic: image-preprocessing regression neural-network

Category: Data Science

What enables transformers or very deep models "plan" ahead for sequential decision making?

Water Dragon

2022年6月4日 18:29

I was watching this amazing lecture by Oriol Vinyals. On one slide, there is a question asking if the very deep models plan. Transformer models or models employed in applications like Dialogue Generation do not have a planning component but behave like they already have the dialogue planned. Dr. Vinyals mentioned that there are papers on "how transformers are building up knowledge to answer questions or do all sorts of very interesting analyses". Can any please refer to a few …

Topic: transformer reinforcement-learning deep-learning neural-network machine-learning

Category: Data Science

How do you do 1-vs-rest classifiers in XGBoost Library (Not Sklearn)?

Sebastian

2022年6月4日 18:02

I am working with a very large dataset that would benefit from using training continuation with the xgb_model parameter in xgb.train(). The label (Y) of dataset itself has 4 classes and is highly imbalanced, so I would like to generate per-label PR curves for it to evaluate its performance, and would thus need to treat each class as it's own binary problem using a one-vs-rest classifier. After a lot of reading I haven't found an equivalent to sklearn's OneVsRestClassifier in …

Topic: xgboost multiclass-classification bigdata machine-learning

Category: Data Science

Is this XGBoost model tending to overfit?

Suvrodip Mukhopadhyay

2022年6月4日 17:38

Here is the list of hyperparameters that I used: params = { 'scale_pos_weight': [1.0], 'eta': [0.05, 0.1, 0.15, 0.9, 1.0], 'max_depth': [1, 2, 6, 10, 15, 20], 'gamma': [0.0, 0.4, 0.5, 0.7] } The dataset is imbalanced so I used scale_pos_weight parameter. After 5 fold cross validation the f1 score that I got is: 0.530726530426833

Topic: hyperparameter-tuning overfitting xgboost hyperparameter dataset

Category: Data Science

Neural network / machine learning approach to model specific sequencing-classification problem in industry

Kunis

2022年6月4日 17:27

I am working on a project which involves developing a machine learning/deep learning for an application in a roll-to-roll industry. For a long time, I have been looking for similar problems as a way to get some guidance but I was never able to find anything related. Basically, the problem can be seen as follows: An industrial machine is producing a roll of some material, which tends to have visible defects throughout the roll. I have already available a machine …

Topic: lstm deep-learning classification machine-learning

Category: Data Science

How to predict the sentiment of the entities form the tweet?

coding_ninza

2022年6月4日 17:05

I have a JSON file (tweets.json) that contains tweets (sentences) along with the name of the author. Objective 1: Get the most frequent entities from the tweets. Objective 2: Find out the sentiment/polarity of each author towards each of the entities. Sample Input: Assume we have only 3 tweets: Tweet1 by Author1: Pink Pearl Apples are tasty but Empire Apples are not. Tweet2 by Author2: Empire Apples are very tasty. Tweet3 by Author3: Pink Pearl Apples are not tasty. Sample …

Topic: spacy stanford-nlp sentiment-analysis language-model nlp

Category: Data Science

Random Forest Classifier Output

Pavan

2022年6月4日 16:42

Used a RandomForestClassifier for my prediciton model. But the output printed is either 0 or in decimals. What do I need to do for my model to show me 0 and 1's instead of decimals? Note: used feature importance and removed the least important columns,still the accuracy is the same and the output hasn't changed much. Also, i have my estimators equal to 1000. do i increase or decrease this? edit: target col 1 0 0 1 output col 0.994 …

Topic: prediction random-forest predictive-modeling machine-learning

Category: Data Science

About