Turning multiple binary columns into categorical (with less columns) with Python Pandas

I want to turn these categories into values of categorical columns. The values in each category are the current binary columns present in the data frame. We have : A11, A12.. is a detail of A1 so if the value in A11 ==1 it will necessarily imply having A1==1 but the inverse is not valid. Respecting the following conditions : maximaum of existing types is 4 if A11==1 value of type1 should be equal to 'A11' and we ignore 'A1' …
Category: Data Science

Is it good practice to include data cleaning or feature engineering steps in an sklearn pipeline to create a scalable pipeline?

I am working on implementing a scalable pipeline for cleaning my data and pre-processing it before modeling. I am pretty comfortable with the sklearn Pipeline object that I use for pre-processing but I am not sure if I should include data cleaning, data extraction and feature engineering steps that are typically more specific to the dataset I am working on. My general thinking is that the pre-processing phase would include operations on the data that need to be done after …
Category: Data Science

Tool vs Python Script for Transforming Data in Mongo

We have a bunch of Mongo collections (data collected from APIs, web scraping, etc) that we need to transform to a cleaner data structure (standardized schema) on a monthly basis. Are there any good tools to help us manage this process, or would you recommend writing a Python script instead?
Category: Data Science

How to disentangle non-mutually exclusive items coded in same question?

I have to work with a dataset where people ostensibly had the option to check several options to a question (eg "check all that apply"). But in the data, all of the options when selected are shown under one variable where they look like "option1_option2" instead of having created different variables for option 1/option 2 etc when selected. Is there an easy way to create separate variables based on this? I don't know how to select criteria from these kinds …
Category: Data Science

Missing value Imputation in dataset

I have two separate files for Testing and Training. In the training data, I am dropping rows that contain too many missing values . But , In the test data , I cannot afford to drop the rows so I have chosen to impute the missing values using KNN approach . My question is , to impute missing values in the test data using KNN , is it enough to consider only the test data ? As in , neighbors …
Category: Data Science

Performing EDA on a dataset with missing features

I'm new to DS. I want to perform EDA on such dataset, where these are the missing features stats of my train and test sets: train: Test_0 0 Test_1 31 Test_2 0 Test_3 141 Test_4 0 Test_5 0 Test_6 0 Test_7 0 Test_8 1045 Test_9 0 Test_10 0 Test_11 0 Test_12 0 Test_13 0 Test_14 0 Test_15 2967 Class 0 dtype: int64 test: Test_0 0 Test_1 7 Test_2 0 Test_3 46 Test_4 0 Test_5 0 Test_6 0 Test_7 0 Test_8 …
Category: Data Science

How can I calculate total days past due between billing events?

I am dealing with a dataframe with subscription events partitioned by username, subscription status, and relative timestamps. For each of the dates, there are changes in time when the subscription becomes past due and renews as such: username subscription_events_name subscription_events_timestamp A subscription_charged_unsuccess 2021-01-08 A subscription_past_due 2021-01-08 A subscription_past_due 2021-01-15 A subscription_charged_successfully 2021-01-16 A subscription_renew 2021-01-16 Say a customer enters past due status, and 15 days later their subscription is billed in full and they enter an active status. I want …
Category: Data Science

Scaling and handling highly correlated features in tabular data for regression

I am working on a regression problem trying to predict a target variable with seven predictor variables. I have a tabular dataset of 1400 rows. Before delving into the machine learning to build a predictor, I did an EDA(exploratory data analysis) and I got the below correlation coefficients (Pearson r) in my data. Note that I have included both the numerical predictor variables and the target variable. I am wondering about the the following questions: We see that pv3 is …
Category: Data Science

PCA huge parts of missing data filling

I’m performing PCA on different time series’ and then using K Means clustering to try and group together common factors. The issue I’m facing is that some of the factors come in and out of the time series. For example I may have 12 years in total of data points, some factors may exist for the entire 12 years but some may dip in and out (active for the first two years, inactive for three years, active for the rest …
Category: Data Science

Orange3 summarizing data, grouping data values

Is there a simple way in orange3 (not writing a Python script) to summarize data and group similar data values? For example, instead of plotting a scatter with lots of data points, I would like to plot just the average y value at every x value. In pandas, this is easily done with groupby().mean(). Is there a similar widget/feature I'm overlooking in orange?
Category: Data Science

How to prepare Audio-text data for speech recognition

I have gathered some raw audio from all the conferences, meetings, lectures & casual conversation that I was part of. The machine transcription did not offer good results (from Azure, AWS etc.) I would transcribe it so to have both data+label (audio+text) for ML training. My question is whether to have small (3-10 sec.) audio files (split it at silence) and then transcribe each small file? or large file with timestamps with subtitle.srt format? What if I have a long …
Category: Data Science

How to improve regression neural network?

I am new to deep learning and data science and trying to increase my knowledge by working on some hackathons. Currently, the hackathon project I am working on has the task to predict the closing price of crypto-currency based on 48 parameters with ~1200 records. By far I was able to achieve some good accuracy from the model but still, my score is very low. I have tried many things from knowledge but it doesn't seem to be affecting the …
Category: Data Science

Downsampling audio files for use in Machine Learning

I'm trying to use the work (Neural Networks) done in this repo: https://github.com/jtkim-kaist/VAD It says this: Note: To apply this toolkit to other speech data, the speech data should be sampled with 16kHz sampling frequency. I've got speech data at 48khz. I've read in places that reducing sampling rate is a complicated process, you can't just remove every nth datapoint, you have to filter things... Is this necessary if I only intend to use the data in the Neural Network …
Category: Data Science

Data preprocessing methods

Data Cleaning Data Imbalance solving (Classification) Data Smoothing (decreasing noise) Creating-deleting features from original data Data Transformation (Box-cox,Log Transform) Making Dataset stationery (time series) And other specific data preprocessing methods in NLP-Computer Vision (very specific ones) I am trying to research data preparation methods and so far those are the things i could find. Do you think is anything missing? Thanks.
Category: Data Science

How to deal with highly skewed (on counts) dependent variables?

I am working on a binary classification problem and the dataset consists of several variables which are count variables. For example, how many times a customer defaulted on a broadband bill payment in the last 3 months. The problem is, these features are highly skewed. This is how the distribution for the above variable looks like: 0.0 98.175855 1.0 1.275902 2.0 0.348707 3.0 0.199535 This is due to the nature of the event being evaluated during the construction of the …
Category: Data Science

Should I Impute target values?

I am new to data science and I am currently playing around a bit. Data exploration and preparation is really annoying. Eventhough I use pandas. I achieved imputing missing values in independant variables. For numerical data by using the Imputer with the means strategy and for one categorical variable I used the Labelencoder and afterwards imputed with the mode strategy. But now I face the issue that the dependant variable $y$ also contains missing values. Should I delete those lines …
Category: Data Science

PySpark: How do I specify dropna axis in PySpark transformation?

I would like to drop columns that contain all null values using dropna(). With Pandas you can do this with setting the keyword argument axis = 'columns' in dropna(). Here an example in a GitHub post. How do I do this in PySpark ? dropna() is available as a transformation in PySpark, however axis is not an available keyword. Note: I do not want to transpose my dataframe for this to work. How would I drop the furniture column from …
Category: Data Science

How can we predict a value after several rows of data?

I have a regression problem in which for each week I have several rows (variable between rows i.e 1 week might have 1800 rows and other might have 5000 rows). My target is to predict a value at end of each week's data. Here's an example of what I need : x is a feature y is the target. Week 1 ; x1, x2, x3.. x90 Week 1 ; v1, v2, v3... v90 .... 100 more rows Week 1 ; …
Category: Data Science

R code making 1 column into multiple columns with their unique ID

Currently stuck on a data wrangling question in R. So far I've tried variations of this code using tidyverse package, columns 5 and 6 here were the rating and the user: df[,5:6] %>% pivot_wider(names_from = question, values_from = rating, names_sep = ".") %>% unnest(cols = everything())-> df_reformat Each column will be the question ID and the rows are the scores for each user, ideally clustered by group. Data structure needed: repID user Customer question 1 Customer question 2 .... Customer …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.