dataset split for image classification

I am trying to do image classification for 14 categories (around 1000 images for each cat). And i initially created two folders for training and validation. In this case, do I still need to set a validation split or a subset in a code? or I can use the whole files as train_ds and val_ds by deleting them Folder names in the training and validation directory are same. data_dir = 'trainingdatav1' data_val = 'Validationv1' train_ds = tf.keras.preprocessing.image_dataset_from_directory( data_dir, validation_split=0.1, #is …
Category: Data Science

How to preprocess an ordered categorical variable to feed a machine learning algorithm?

I have a categorical variable that measures the income of a family: A: no income B: Up to $500 C: $500-$700 … P: $5000-$6000 Q: More than \\\$6000 It seems odd to me that I have to get dummies for this variable, since it's ordered. I wonder if it's better to map the values: {'A': 0, 'B': 1, …, 'Q': 17} so I can input it into the algorithm this values as integer numbers. What's the proper way of preprocessing …
Category: Data Science

Is this XGBoost model tending to overfit?

Here is the list of hyperparameters that I used: params = { 'scale_pos_weight': [1.0], 'eta': [0.05, 0.1, 0.15, 0.9, 1.0], 'max_depth': [1, 2, 6, 10, 15, 20], 'gamma': [0.0, 0.4, 0.5, 0.7] } The dataset is imbalanced so I used scale_pos_weight parameter. After 5 fold cross validation the f1 score that I got is: 0.530726530426833
Category: Data Science

CNN for subsets of a dataset - how to tune hyperparameters

I have a dataset and would like to train CNNs on subsets of different size of the dataset. I already have a CNN, which classifies very well if I use the entire dataset. Now the question arises if I should really try to additionally optimize the parameters of the CNN for the subsets, regardless of whether I do Data Augmentation or not? Does it really make sense if I try to change the CNN model for the subsets by using …
Category: Data Science

What is the difference between Pachyderm and Git?

I learned that tools like Pachyderm version-control data, but I cannot see any difference between that tool with Git. I learned from this post that: It holds all your data in a central accessible location It updates all depending data sets when data is added to or changed in a data set It can run any transformation, as long as it runs in a Docker, and accepts a file as input and outputs a file as result It versions all …
Category: Data Science

Dataset with Multiple Choice Questions for fine tuning

I hope it's allowed to ask here, but I am looking for a dataset (the format is not that important) that is similar to SQuAD, but it also contains false answers to the questions. I wanna use it to fine tune GPT-3, and all I find is either MC questions based on a text, but with no distractors, or classical quizzes that have no context before each question. I have a code that generates distractors, and I can just plug …
Category: Data Science

Organizing datasets, dataset version control, MLOps and other questions

I am currently looking into structuring data and work flows for my ML end to end pipeline. I therefore have multiple problems, and ideally I am looking for one platform that can do all: Visualize and organize multiple datasets. ideally something like the Kaggle datset webinterface Do dataset exploration to quickly visualize errors in data, biases in annotations etc. Annotate images and potentially point clouds commenting functionality for all features Keep track of who annotated what on what date dataset …
Category: Data Science

How to build a model where multiple data points contribute to a result

I’m trying to figure out how to massage data and model the following scenario: Customers at a restaurant rate the quality of the service between 1-10. I have data on individual interactions between the servers and customers. Say - length of interaction, type of interaction (refilling beverage, ordering, cleaning, etc). Hypothesis here is each interaction contributes to the final score. I want to build a model that tells me given an interaction, how does it move the score. My intuition …
Category: Data Science

Extract all data of a month from different years

Ok I had a typo in this question before which I have now corrected: my database (df_e) looks like this: 0,Country,Latitude,Longitude,Altitude,Date,H2,Year,month,dates,a_diffH,H2a 1,IN,28.58,77.2,212,1964-09-15,-57.6,1964,9,1964-09-15,-3.18,-54.42 2,IN,28.58,77.2,212,1963-09-15,-120.0,1963,9,1963-09-15,-3.18,-116.82 3,IN,28.58,77.2,212,1964-05-15,28.2,1964,5,1964-05-15,-3.18,31.38 ... and I would like to save the data from the 9th month from the years 1963 and 1964 into a new df. For this I use the command: df.loc[df_e['H2a'].isin(['1963-09-15', '1964-09-15'])] But the result is Empty DataFrame Columns: [Country, Latitude, Longitude, Altitude, Date, H2, Year, month, dates, a_diffH, H2a] Index: [] Where is my mistake?
Category: Data Science

convert time series data set to supervised for deep learining

I have dataset like so I want to use that for prediction of time series with deep learning. I have this function to make it supervised def to_supervised(train,n_input,n_out): #falten data data=train X,y=list(),list() in_start=0 for _ in range(len(data)): in_end=in_start+ n_input out_end=in_end + n_out if out_end<=len(data): x_input=data[ in_start:in_end,0] x_input=x_input.reshape((len(x_input))) X.append(x_input) y.append(data[in_end:out_end,0]) in_start+=1 return array(X), array(y) I am not sure about functionality of this function. Do you have replacemment for this function?
Topic: dataset
Category: Data Science

Which algorithm to use for transactional data

I'm given a Dataset of transactions and asked to find insights for businesses. I'm extremely new to ML / Data science and have only been experiencing with KMeans. The dataset has the following features merchant ID Transaction date Military time Amount card amount paid merchant name Town area code client ID age band gender code province average income 3 months card value spending card tapped Ignoring NULL data, what type of analysis can I do on this data? I have …
Category: Data Science

How to split train/test in recommender systems

I am working with the MovieLens10M dataset, predicting user ratings. If I want to fairly evaluate my algorithm, how should I split my training v. test data? By default, I believe the data is split into train v. test sets where 'test' contains movies previously unseen in the training set. If my model requires each movie to have been seen at least once in the training set, how should I split my data? Should I take all but N of …
Category: Data Science

seasonality in classification model

I am building a classification model to predict customer status a year from a given time. There seems to be some seasonality, for example, more changes occur in Summer than in Winter etc. so my dataset (mainly labels) would change depending on how to define prediction time (eg 2020 Jan) and predicting time (eg 2021 Jan). Let's say there are 100 customers and I could make 1,200 entries (100 per month for every month in 2020, where labels are from …
Category: Data Science

Is the Dataset XiangyaDerm available anywhere?

I've searched far and wide, does anybody know how to access the XiangyaDerm dataset? They say in their paper, that it is accessible. It has 150k images of skin lesions, which is way more than currently publicly available to all other datasets combined (106k~). XiangyaDerm: A Clinical Image Dataset of Asian Race for Skin Disease Aided Diagnosis https://airl.csu.edu.cn/PDFs/LABELS2019_XiangyaDerm.pdf
Topic: dataset
Category: Data Science

Error Loading and Training on Tensorflow's 'Speech Commands Dataset'

I am trying to replicate the most basic version of this Google LEAF example. I am having problems loading in the Tensorflow Speech Commands Dataset. I load the datasets in as a TFRecord: tfds.load('speech_commands', download='true', shuffle_files='false') I then map the train, test and eval datasets through this pre-process function: def preprocess(sample): audio = sample['audio'] label = sample['label'] audio = tf.cast(audio, tf.float32) / tf.int16.max return audio, label I then create my model and attempt to train on my train dataset: #Model …
Category: Data Science

Sampling methods for Text datasets (NLP)

I am working on two text datasets, one is having 68k text samples and other is having 100k text samples. I have encoded the text datasets into bert embedding. Text sample > 'I am working on NLP' ==> bert encoding ==> [0.98, 0.11, 0.12....nth] # raw text 68k # bert encoding [68000, 1024] I want to try different custom NLP models on these embeddings, but dataset large to test the model's performance quickly. To check different models quickly, the best …
Category: Data Science

Explain forward filling and backward filling (data filling)

Can I understand in this way? Let me know if any statement is wrong or not accurate. Reason of data filling: Assume I have a consecutive data (e.g., daily log data), and partial data are missing. In order to make some calculation (e.g., mean value), we first need to assign values to the missing parts (e.g., equal to existing data) Forward filling and backward filling are two data filling methods. The difference is the filling direction? E.g., Tuesday data (missing) …
Topic: dataset
Category: Data Science

Train an LSTM on separate sequences of different lengths

My case is the following: I want to train a sequential classifier to recognize what action is being performed given sensors observations.My data consists of 10 executions of an assembling task for 10 different people. So, basically each person performed the same task and I have the sensor measurements for each millisecond. That means that for each person I have a really big data set with the corresponding measurements and the labels (which action is being performed) for each millisecond. …
Category: Data Science

How to evaluate data imputation techniques

I have a data set with 29 features 8 if them have missing values. I've tried Sklearn simple imputer and all it's strategies KNN imputer and several Number of K Iterative imputer and all combinations of imputation order , estimators, number of iterations. My question is how to evaluate the imputation techniques and choose the better one for my Data. I can't run a base line model and evaluate it's performance because I'm not familiar with balancing the data and …
Category: Data Science

How to gather training data for simple voice commands?

I'm trying to build a machine learning model for recognizing simple voice commands like up, down, left, etc. On similar problems based on images, I'd just take the picture and assign a label to it. I can generate features and visualize them using librosa. And I hear CNNs are amazing at this task. So, I was wondering how I'd gather training data for such audio based systems, since I can't record an entire clip considering my commands are only going …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.