data-cleaning

Turning multiple binary columns into categorical (with less columns) with Python Pandas

Legna

2022年6月3日 22:24

I want to turn these categories into values of categorical columns. The values in each category are the current binary columns present in the data frame. We have : A11, A12.. is a detail of A1 so if the value in A11 ==1 it will necessarily imply having A1==1 but the inverse is not valid. Respecting the following conditions : maximaum of existing types is 4 if A11==1 value of type1 should be equal to 'A11' and we ignore 'A1' …

Topic: categorical-encoding dataframe pandas python data-cleaning

Category: Data Science

Is it good practice to include data cleaning or feature engineering steps in an sklearn pipeline to create a scalable pipeline?

LazyEval

2022年5月30日 00:03

I am working on implementing a scalable pipeline for cleaning my data and pre-processing it before modeling. I am pretty comfortable with the sklearn Pipeline object that I use for pre-processing but I am not sure if I should include data cleaning, data extraction and feature engineering steps that are typically more specific to the dataset I am working on. My general thinking is that the pre-processing phase would include operations on the data that need to be done after …

Topic: pipelines preprocessing scikit-learn python data-cleaning

Category: Data Science

Tool vs Python Script for Transforming Data in Mongo

Rob

2022年5月29日 04:08

We have a bunch of Mongo collections (data collected from APIs, web scraping, etc) that we need to transform to a cleaner data structure (standardized schema) on a monthly basis. Are there any good tools to help us manage this process, or would you recommend writing a Python script instead?

Topic: data-cleaning

Category: Data Science

what degree of fredom one should use while calculating standard diviation for standardizing data

Sahil Lohiya

2022年5月24日 10:11

I am writing a function to standardize the data and I found out that we can choose either ddof = 0 or ddof = 1, so I got confused that which one to choose and why? Does this make any difference?

Topic: statistics data-cleaning

Category: Data Science

How to disentangle non-mutually exclusive items coded in same question?

Bird

2022年5月17日 18:24

I have to work with a dataset where people ostensibly had the option to check several options to a question (eg "check all that apply"). But in the data, all of the options when selected are shown under one variable where they look like "option1_option2" instead of having created different variables for option 1/option 2 etc when selected. Is there an easy way to create separate variables based on this? I don't know how to select criteria from these kinds …

Topic: data-cleaning

Category: Data Science

Missing value Imputation in dataset

Bharathi A

2022年5月15日 19:03

I have two separate files for Testing and Training. In the training data, I am dropping rows that contain too many missing values . But , In the test data , I cannot afford to drop the rows so I have chosen to impute the missing values using KNN approach . My question is , to impute missing values in the test data using KNN , is it enough to consider only the test data ? As in , neighbors …

Topic: k-nn data-imputation data-cleaning machine-learning

Category: Data Science

Performing EDA on a dataset with missing features

user135735

2022年5月15日 05:32

I'm new to DS. I want to perform EDA on such dataset, where these are the missing features stats of my train and test sets: train: Test_0 0 Test_1 31 Test_2 0 Test_3 141 Test_4 0 Test_5 0 Test_6 0 Test_7 0 Test_8 1045 Test_9 0 Test_10 0 Test_11 0 Test_12 0 Test_13 0 Test_14 0 Test_15 2967 Class 0 dtype: int64 test: Test_0 0 Test_1 7 Test_2 0 Test_3 46 Test_4 0 Test_5 0 Test_6 0 Test_7 0 Test_8 …

Topic: exploratory-factor-analysis visualization data-cleaning

Category: Data Science

How can I calculate total days past due between billing events?

datadummy

2022年5月15日 02:05

I am dealing with a dataframe with subscription events partitioned by username, subscription status, and relative timestamps. For each of the dates, there are changes in time when the subscription becomes past due and renews as such: username subscription_events_name subscription_events_timestamp A subscription_charged_unsuccess 2021-01-08 A subscription_past_due 2021-01-08 A subscription_past_due 2021-01-15 A subscription_charged_successfully 2021-01-16 A subscription_renew 2021-01-16 Say a customer enters past due status, and 15 days later their subscription is billed in full and they enter an active status. I want …

Topic: pandas python data-cleaning

Category: Data Science

Scaling and handling highly correlated features in tabular data for regression

hAcKnRoCk

2022年5月14日 07:38

I am working on a regression problem trying to predict a target variable with seven predictor variables. I have a tabular dataset of 1400 rows. Before delving into the machine learning to build a predictor, I did an EDA(exploratory data analysis) and I got the below correlation coefficients (Pearson r) in my data. Note that I have included both the numerical predictor variables and the target variable. I am wondering about the the following questions: We see that pv3 is …

Topic: pearsons-correlation-coefficient preprocessing regression data-cleaning data-mining

Category: Data Science

PCA huge parts of missing data filling

Simon Nicholls

2022年5月11日 20:06

I’m performing PCA on different time series’ and then using K Means clustering to try and group together common factors. The issue I’m facing is that some of the factors come in and out of the time series. For example I may have 12 years in total of data points, some factors may exist for the entire 12 years but some may dip in and out (active for the first two years, inactive for three years, active for the rest …

Topic: pca data-cleaning k-means

Category: Data Science

Orange3 summarizing data, grouping data values

Sid

2022年5月9日 00:01

Is there a simple way in orange3 (not writing a Python script) to summarize data and group similar data values? For example, instead of plotting a scatter with lots of data points, I would like to plot just the average y value at every x value. In pandas, this is easily done with groupby().mean(). Is there a similar widget/feature I'm overlooking in orange?

Topic: orange data dataset data-cleaning data-mining

Category: Data Science

How to prepare Audio-text data for speech recognition

johnyc

2022年5月8日 16:05

I have gathered some raw audio from all the conferences, meetings, lectures & casual conversation that I was part of. The machine transcription did not offer good results (from Azure, AWS etc.) I would transcribe it so to have both data+label (audio+text) for ML training. My question is whether to have small (3-10 sec.) audio files (split it at silence) and then transcribe each small file? or large file with timestamps with subtitle.srt format? What if I have a long …

Topic: speech-to-text dataset data-cleaning

Category: Data Science

How to improve regression neural network?

Darkstar Dream

2022年5月8日 08:02

I am new to deep learning and data science and trying to increase my knowledge by working on some hackathons. Currently, the hackathon project I am working on has the task to predict the closing price of crypto-currency based on 48 parameters with ~1200 records. By far I was able to achieve some good accuracy from the model but still, my score is very low. I have tried many things from knowledge but it doesn't seem to be affecting the …

Topic: hyperparameter-tuning regression deep-learning neural-network data-cleaning

Category: Data Science

Downsampling audio files for use in Machine Learning

Finn Maunsell

2022年5月7日 17:05

I'm trying to use the work (Neural Networks) done in this repo: https://github.com/jtkim-kaist/VAD It says this: Note: To apply this toolkit to other speech data, the speech data should be sampled with 16kHz sampling frequency. I've got speech data at 48khz. I've read in places that reducing sampling rate is a complicated process, you can't just remove every nth datapoint, you have to filter things... Is this necessary if I only intend to use the data in the Neural Network …

Topic: audio-recognition data-cleaning

Category: Data Science

Data preprocessing methods

canP

2022年5月6日 04:11

Data Cleaning Data Imbalance solving (Classification) Data Smoothing (decreasing noise) Creating-deleting features from original data Data Transformation (Box-cox,Log Transform) Making Dataset stationery (time series) And other specific data preprocessing methods in NLP-Computer Vision (very specific ones) I am trying to research data preparation methods and so far those are the things i could find. Do you think is anything missing? Thanks.

Topic: data preprocessing data-cleaning data-mining machine-learning

Category: Data Science

How to deal with highly skewed (on counts) dependent variables?

Rohit Gavval

2022年5月6日 04:08

I am working on a binary classification problem and the dataset consists of several variables which are count variables. For example, how many times a customer defaulted on a broadband bill payment in the last 3 months. The problem is, these features are highly skewed. This is how the distribution for the above variable looks like: 0.0 98.175855 1.0 1.275902 2.0 0.348707 3.0 0.199535 This is due to the nature of the event being evaluated during the construction of the …

Topic: data-wrangling statistics data-cleaning machine-learning

Category: Data Science

Should I Impute target values?

Bestname

2022年5月6日 03:58

I am new to data science and I am currently playing around a bit. Data exploration and preparation is really annoying. Eventhough I use pandas. I achieved imputing missing values in independant variables. For numerical data by using the Imputer with the means strategy and for one categorical variable I used the Labelencoder and afterwards imputed with the mode strategy. But now I face the issue that the dependant variable $y$ also contains missing values. Should I delete those lines …

Topic: data-imputation preprocessing regression data-cleaning machine-learning

Category: Data Science

PySpark: How do I specify dropna axis in PySpark transformation?

DataBach

2022年5月5日 16:06

I would like to drop columns that contain all null values using dropna(). With Pandas you can do this with setting the keyword argument axis = 'columns' in dropna(). Here an example in a GitHub post. How do I do this in PySpark ? dropna() is available as a transformation in PySpark, however axis is not an available keyword. Note: I do not want to transpose my dataframe for this to work. How would I drop the furniture column from …

Topic: pyspark python data-cleaning

Category: Data Science

How can we predict a value after several rows of data?

Aneeq

2022年5月4日 18:48

I have a regression problem in which for each week I have several rows (variable between rows i.e 1 week might have 1800 rows and other might have 5000 rows). My target is to predict a value at end of each week's data. Here's an example of what I need : x is a feature y is the target. Week 1 ; x1, x2, x3.. x90 Week 1 ; v1, v2, v3... v90 .... 100 more rows Week 1 ; …

Topic: multi-instance-learning aggregation time-series data-cleaning machine-learning

Category: Data Science

R code making 1 column into multiple columns with their unique ID

codingc0nfusions

2022年5月3日 23:55

Currently stuck on a data wrangling question in R. So far I've tried variations of this code using tidyverse package, columns 5 and 6 here were the rating and the user: df[,5:6] %>% pivot_wider(names_from = question, values_from = rating, names_sep = ".") %>% unnest(cols = everything())-> df_reformat Each column will be the question ID and the rows are the scores for each user, ideally clustered by group. Data structure needed: repID user Customer question 1 Customer question 2 .... Customer …

Topic: dplyr data-wrangling data-formats data-cleaning r

Category: Data Science

About