dataset

dataset split for image classification

Hello-experts

2022年6月5日 00:06

I am trying to do image classification for 14 categories (around 1000 images for each cat). And i initially created two folders for training and validation. In this case, do I still need to set a validation split or a subset in a code? or I can use the whole files as train_ds and val_ds by deleting them Folder names in the training and validation directory are same. data_dir = 'trainingdatav1' data_val = 'Validationv1' train_ds = tf.keras.preprocessing.image_dataset_from_directory( data_dir, validation_split=0.1, #is …

Topic: validation overfitting image-classification dataset

Category: Data Science

How to preprocess an ordered categorical variable to feed a machine learning algorithm?

marcus

2022年6月4日 22:00

I have a categorical variable that measures the income of a family: A: no income B: Up to $500 C: $500-$700 … P: $5000-$6000 Q: More than \\\$6000 It seems odd to me that I have to get dummies for this variable, since it's ordered. I wonder if it's better to map the values: {'A': 0, 'B': 1, …, 'Q': 17} so I can input it into the algorithm this values as integer numbers. What's the proper way of preprocessing …

Topic: data-wrangling preprocessing dataset machine-learning

Category: Data Science

Is this XGBoost model tending to overfit?

Suvrodip Mukhopadhyay

2022年6月4日 17:38

Here is the list of hyperparameters that I used: params = { 'scale_pos_weight': [1.0], 'eta': [0.05, 0.1, 0.15, 0.9, 1.0], 'max_depth': [1, 2, 6, 10, 15, 20], 'gamma': [0.0, 0.4, 0.5, 0.7] } The dataset is imbalanced so I used scale_pos_weight parameter. After 5 fold cross validation the f1 score that I got is: 0.530726530426833

Topic: hyperparameter-tuning overfitting xgboost hyperparameter dataset

Category: Data Science

CNN for subsets of a dataset - how to tune hyperparameters

Code Now

2022年6月4日 16:02

I have a dataset and would like to train CNNs on subsets of different size of the dataset. I already have a CNN, which classifies very well if I use the entire dataset. Now the question arises if I should really try to additionally optimize the parameters of the CNN for the subsets, regardless of whether I do Data Augmentation or not? Does it really make sense if I try to change the CNN model for the subsets by using …

Topic: hyperparameter-tuning cnn gridsearchcv accuracy dataset

Category: Data Science

What is the difference between Pachyderm and Git?

Lerner Zhang

2022年6月4日 05:03

I learned that tools like Pachyderm version-control data, but I cannot see any difference between that tool with Git. I learned from this post that: It holds all your data in a central accessible location It updates all depending data sets when data is added to or changed in a data set It can run any transformation, as long as it runs in a Docker, and accepts a file as input and outputs a file as result It versions all …

Topic: data version-control dataset tools bigdata

Category: Data Science

Dataset with Multiple Choice Questions for fine tuning

futuredataengineer

2022年6月3日 19:35

I hope it's allowed to ask here, but I am looking for a dataset (the format is not that important) that is similar to SQuAD, but it also contains false answers to the questions. I wanna use it to fine tune GPT-3, and all I find is either MC questions based on a text, but with no distractors, or classical quizzes that have no context before each question. I have a code that generates distractors, and I can just plug …

Topic: openai-gpt data dataset nlp machine-learning

Category: Data Science

Organizing datasets, dataset version control, MLOps and other questions

KBL

2022年6月3日 11:02

I am currently looking into structuring data and work flows for my ML end to end pipeline. I therefore have multiple problems, and ideally I am looking for one platform that can do all: Visualize and organize multiple datasets. ideally something like the Kaggle datset webinterface Do dataset exploration to quickly visualize errors in data, biases in annotations etc. Annotate images and potentially point clouds commenting functionality for all features Keep track of who annotated what on what date dataset …

Topic: image annotation version-control dataset

Category: Data Science

How to build a model where multiple data points contribute to a result

philee

2022年6月2日 22:03

I’m trying to figure out how to massage data and model the following scenario: Customers at a restaurant rate the quality of the service between 1-10. I have data on individual interactions between the servers and customers. Say - length of interaction, type of interaction (refilling beverage, ordering, cleaning, etc). Hypothesis here is each interaction contributes to the final score. I want to build a model that tells me given an interaction, how does it move the score. My intuition …

Topic: machine-learning-model dataset machine-learning

Category: Data Science

Extract all data of a month from different years

Weiss

2022年6月1日 13:53

Ok I had a typo in this question before which I have now corrected: my database (df_e) looks like this: 0,Country,Latitude,Longitude,Altitude,Date,H2,Year,month,dates,a_diffH,H2a 1,IN,28.58,77.2,212,1964-09-15,-57.6,1964,9,1964-09-15,-3.18,-54.42 2,IN,28.58,77.2,212,1963-09-15,-120.0,1963,9,1963-09-15,-3.18,-116.82 3,IN,28.58,77.2,212,1964-05-15,28.2,1964,5,1964-05-15,-3.18,31.38 ... and I would like to save the data from the 9th month from the years 1963 and 1964 into a new df. For this I use the command: df.loc[df_e['H2a'].isin(['1963-09-15', '1964-09-15'])] But the result is Empty DataFrame Columns: [Country, Latitude, Longitude, Altitude, Date, H2, Year, month, dates, a_diffH, H2a] Index: [] Where is my mistake?

Topic: time-series pandas dataset python

Category: Data Science

convert time series data set to supervised for deep learining

Almas Co

2022年6月1日 13:31

I have dataset like so I want to use that for prediction of time series with deep learning. I have this function to make it supervised def to_supervised(train,n_input,n_out): #falten data data=train X,y=list(),list() in_start=0 for _ in range(len(data)): in_end=in_start+ n_input out_end=in_end + n_out if out_end<=len(data): x_input=data[ in_start:in_end,0] x_input=x_input.reshape((len(x_input))) X.append(x_input) y.append(data[in_end:out_end,0]) in_start+=1 return array(X), array(y) I am not sure about functionality of this function. Do you have replacemment for this function?

Topic: dataset

Category: Data Science

Which algorithm to use for transactional data

Liam Louw

2022年6月1日 11:03

I'm given a Dataset of transactions and asked to find insights for businesses. I'm extremely new to ML / Data science and have only been experiencing with KMeans. The dataset has the following features merchant ID Transaction date Military time Amount card amount paid merchant name Town area code client ID age band gender code province average income 3 months card value spending card tapped Ignoring NULL data, what type of analysis can I do on this data? I have …

Topic: dataset algorithms

Category: Data Science

How to split train/test in recommender systems

jamesmf

2022年5月31日 04:22

I am working with the MovieLens10M dataset, predicting user ratings. If I want to fairly evaluate my algorithm, how should I split my training v. test data? By default, I believe the data is split into train v. test sets where 'test' contains movies previously unseen in the training set. If my model requires each movie to have been seen at least once in the training set, how should I split my data? Should I take all but N of …

Topic: dataset recommender-system machine-learning

Category: Data Science

seasonality in classification model

Ella Jean

2022年5月30日 21:05

I am building a classification model to predict customer status a year from a given time. There seems to be some seasonality, for example, more changes occur in Summer than in Winter etc. so my dataset (mainly labels) would change depending on how to define prediction time (eg 2020 Jan) and predicting time (eg 2021 Jan). Let's say there are 100 customers and I could make 1,200 entries (100 per month for every month in 2020, where labels are from …

Topic: classification dataset machine-learning

Category: Data Science

Is the Dataset XiangyaDerm available anywhere?

WhatAMesh

2022年5月30日 19:40

I've searched far and wide, does anybody know how to access the XiangyaDerm dataset? They say in their paper, that it is accessible. It has 150k images of skin lesions, which is way more than currently publicly available to all other datasets combined (106k~). XiangyaDerm: A Clinical Image Dataset of Asian Race for Skin Disease Aided Diagnosis https://airl.csu.edu.cn/PDFs/LABELS2019_XiangyaDerm.pdf

Topic: dataset

Category: Data Science

Error Loading and Training on Tensorflow's 'Speech Commands Dataset'

Max Walley

2022年5月30日 13:03

I am trying to replicate the most basic version of this Google LEAF example. I am having problems loading in the Tensorflow Speech Commands Dataset. I load the datasets in as a TFRecord: tfds.load('speech_commands', download='true', shuffle_files='false') I then map the train, test and eval datasets through this pre-process function: def preprocess(sample): audio = sample['audio'] label = sample['label'] audio = tf.cast(audio, tf.float32) / tf.int16.max return audio, label I then create my model and attempt to train on my train dataset: #Model …

Topic: tensorflow dataset

Category: Data Science

Sampling methods for Text datasets (NLP)

Aaditya ura

2022年5月25日 14:07

I am working on two text datasets, one is having 68k text samples and other is having 100k text samples. I have encoded the text datasets into bert embedding. Text sample > 'I am working on NLP' ==> bert encoding ==> [0.98, 0.11, 0.12....nth] # raw text 68k # bert encoding [68000, 1024] I want to try different custom NLP models on these embeddings, but dataset large to test the model's performance quickly. To check different models quickly, the best …

Topic: classification dataset nlp statistics machine-learning

Category: Data Science

Explain forward filling and backward filling (data filling)

TJCLK

2022年5月25日 01:56

Can I understand in this way? Let me know if any statement is wrong or not accurate. Reason of data filling: Assume I have a consecutive data (e.g., daily log data), and partial data are missing. In order to make some calculation (e.g., mean value), we first need to assign values to the missing parts (e.g., equal to existing data) Forward filling and backward filling are two data filling methods. The difference is the filling direction? E.g., Tuesday data (missing) …

Topic: dataset

Category: Data Science

Train an LSTM on separate sequences of different lengths

teoML

2022年5月23日 11:02

My case is the following: I want to train a sequential classifier to recognize what action is being performed given sensors observations.My data consists of 10 executions of an assembling task for 10 different people. So, basically each person performed the same task and I have the sensor measurements for each millisecond. That means that for each person I have a really big data set with the corresponding measurements and the labels (which action is being performed) for each millisecond. …

Topic: lstm keras tensorflow sequence dataset

Category: Data Science

How to evaluate data imputation techniques

yassine sfayhi

2022年5月22日 19:42

I have a data set with 29 features 8 if them have missing values. I've tried Sklearn simple imputer and all it's strategies KNN imputer and several Number of K Iterative imputer and all combinations of imputation order , estimators, number of iterations. My question is how to evaluate the imputation techniques and choose the better one for my Data. I can't run a base line model and evaluate it's performance because I'm not familiar with balancing the data and …

Topic: imbalanced-learn data-imputation scikit-learn classification dataset

Category: Data Science

How to gather training data for simple voice commands?

Gautham Venkataraman

2022年5月22日 03:07

I'm trying to build a machine learning model for recognizing simple voice commands like up, down, left, etc. On similar problems based on images, I'd just take the picture and assign a label to it. I can generate features and visualize them using librosa. And I hear CNNs are amazing at this task. So, I was wondering how I'd gather training data for such audio based systems, since I can't record an entire clip considering my commands are only going …

Topic: training dataset machine-learning

Category: Data Science

About