data-wrangling

How to preprocess an ordered categorical variable to feed a machine learning algorithm?

marcus

2022年6月4日 22:00

I have a categorical variable that measures the income of a family: A: no income B: Up to $500 C: $500-$700 … P: $5000-$6000 Q: More than \\\$6000 It seems odd to me that I have to get dummies for this variable, since it's ordered. I wonder if it's better to map the values: {'A': 0, 'B': 1, …, 'Q': 17} so I can input it into the algorithm this values as integer numbers. What's the proper way of preprocessing …

Topic: data-wrangling preprocessing dataset machine-learning

Category: Data Science

Giving each person in order their top choice which is still available in Google Sheets

Heleno Paiva

2022年5月24日 04:05

The problem I want to solve is my residential building's garage choices. There will be a random distribution of parking spaces. I thought that it would be better if each person writes down which spaces they want in order of preference, and then their priority of picking a parking slot is randomized. For instance: Person a chooses: p3, p5, p1, p2, p4 Person b chooses: p3, p1, p2, p4, p5 Person c chooses: p1, p3, p2, p5, p4 Person d …

Topic: data-wrangling excel classification

Category: Data Science

How to deal with highly skewed (on counts) dependent variables?

Rohit Gavval

2022年5月6日 04:08

I am working on a binary classification problem and the dataset consists of several variables which are count variables. For example, how many times a customer defaulted on a broadband bill payment in the last 3 months. The problem is, these features are highly skewed. This is how the distribution for the above variable looks like: 0.0 98.175855 1.0 1.275902 2.0 0.348707 3.0 0.199535 This is due to the nature of the event being evaluated during the construction of the …

Topic: data-wrangling statistics data-cleaning machine-learning

Category: Data Science

R code making 1 column into multiple columns with their unique ID

codingc0nfusions

2022年5月3日 23:55

Currently stuck on a data wrangling question in R. So far I've tried variations of this code using tidyverse package, columns 5 and 6 here were the rating and the user: df[,5:6] %>% pivot_wider(names_from = question, values_from = rating, names_sep = ".") %>% unnest(cols = everything())-> df_reformat Each column will be the question ID and the rows are the scores for each user, ideally clustered by group. Data structure needed: repID user Customer question 1 Customer question 2 .... Customer …

Topic: dplyr data-wrangling data-formats data-cleaning r

Category: Data Science

Filter for top 10 highest values of group in dataset (in R)

n.baes

2022年4月25日 04:07

Context: I am trying to find the top 10 highest values of count in my data frame conditional on them falling within the years 1970-1979. My data frame looks as below: id lemma year count 1 word1 1970 737 2 word2 1971 767 3 word3 1972 988 etc... Attempt: #1970s df_n_maxcount_1970s <- df_n %>% filter(year < 1980) %>% slice_max(count, n=40) #1990s df_n_maxcount_1980s <- df_n %>% filter(year == 1980:1989) %>% slice_max(count, n=40) This has worked pretty well, but there's a level …

Topic: data-wrangling r

Category: Data Science

Data Wrangling and data cleaning

Inuraghe

2022年4月20日 13:06

I found some information about Data Wrangling and they say different things. In this one, they say data cleaning is a subcategory of data wrangling link In this PDF, data wrangling is EDA and model building link Someone can explain to me what data wrangling is? What is the difference between data cleaning and data wrangling? If possible can you add the reference?

Topic: data-wrangling data-cleaning

Category: Data Science

Group_by 2 variables and pivot_wider distribution based on 2 others

DataGuy23

2022年3月17日 10:00

Performing some calculations on a dataframe and stuck trying to calculate a few percentages. Trying to append 3 additional columns added for %POS/NEG/NEU. E.g., the sum of amount col for all observations w/ POS Direction in both Drew & A/total sum of all amounts for Drew ** Name Rating Amount Price Rate Type Direction Drew A 455 99.54 4.5 white POS Drew A 655 88.44 5.3 white NEG Drew B 454 54.43 3.4 blue NEU Drew B 654 33.54 5.4 …

Topic: groupby dplyr data-wrangling data-cleaning r

Category: Data Science

Joining of Technical replicates with experimental data

LongStreak

2022年3月12日 18:03

I have a task in which I need to join data collected from non-destructive biological sensor analyses with data collected from various microbiological "wet-lab" methods, e.g. colony counting, on the observation/sample names, which represent various environmental conditions, for the purposes of generating machine learning models for the prediction of microbiological status based on the aforementioned sensor output. However, I am considering how to proceed with dealing with technical duplicates/repeats, i.e. additional plates from the same biological sample, re-runs/re-evaluation of samples …

Topic: data-leakage machine-learning-model data-wrangling machine-learning

Category: Data Science

Compare multiple values from a DataFrame against a single row from another

Ricardo Sanchez

2022年2月25日 01:04

I'm trying to compare address values for inaccuracies, for example, given multiple records like: Reference Apartment Address PostCode AS097 NaN 00 Name Road BH1 4HB AS097 Flat 1 Building Name 00 Name Road BH1 4HB AS097 Flat 2 Building Name 00 Name Rd BH1 4HB AS097 Flat 3 Building Name 00 Name Road BH1 4HB AS097 Flat 4 Building Name 00 Name Road BH1 4HB AS097 Flat 5 Building Name 00 Name Road BH1 4HX HO056 NaN 23 Street Road …

Topic: data-wrangling pandas python

Category: Data Science

What should I do with the NaN values on this stock quote data?

Yoyong

2022年2月12日 07:03

I concatenated 3 stock quote data-frames all with date-time indexes. However, they differ in starting dates so the resulting data-frame contains NaN values for the stock quotes with more recent starting dates. Should I just drop the rows with NaN and start the new data frame with the row where all have values or is there a way to fill them up? I'm planning on using the data to train a neural network that predicts future stock quotes.

Topic: data-wrangling data time-series

Category: Data Science

Advantages to combining similarly-named columns for supervised ML?

v81

2021年10月14日 09:19

Is there any benefit to combining similarly named columns either for an improvement in accuracy or for speeding up training/prediction in case of logistic regression, random forest or neural network models? I have seen this done at times but wasn't sure if there was more than a heuristically-motivated reason for doing it. eg. Converting this: name col1 col2 col3 time gina 5 12 20 30 john 6 7 43 40 to this: name (col1,col2,col3) time gina (5,12,20) 30 john (6,7,43) …

Topic: data-wrangling supervised-learning accuracy

Category: Data Science

Data cleaning in Pandas, where the csv file has all data of each row in 1 field

PlatinumMaths

2021年6月1日 18:24

I have really messy data that looks like this: As you can see all the data in each row is contained in 1 column separated by a semi colon. How do I arrange this data so that they are spread out over more columns? For example, category_id, category_id_lvl_0 etc., to be in separate columns and the rows underneath corresponding to that columns i.e ones that are separated by the semi colon to fall under the column of category_id, category_id_lvl_0...

Topic: data-wrangling data-cleaning

Category: Data Science

What is a good way to handle nominal spatial data with a changing number of categories to use in prediction model?

Nander Vilar Castellar

2021年4月23日 10:45

For a project I'm going to be working with spatial data with a nominal attribute (land use). Every year the number of categories for this attribute changes because categories split or merge. I do have access to a chart that shows me how the categories are transformed into each other from one year to another. For the same spatial extent, I also have data for a bunch of other variables. I want to use these as explanatory variables for the …

Topic: data-wrangling geospatial predictive-modeling

Category: Data Science

Data wrangling dates

Luiscri

2021年4月13日 08:06

I have a feature with data creation dates. I have normalized them all to the same format and split them to 'day', 'month' and 'year' columns. But now I have a question. Should I apply normalization or standardization to these columns, or on dates this does not have sense?

Topic: data-wrangling data-cleaning

Category: Data Science

Data wrangling for a big set of docx files advice!

mess1n

2021年3月17日 15:28

I'm looking for some advice on a data wrangling problem I'm trying to solve. I've spent a week solid taking different approaches and nothing seems to be quite perfect. Just FYI, this is my first big (for me anyway) data science project, so I'm really in need of some wisdom on the best way to approach it. Essentially I have a set (200+) of docx files that are semi-structured. By semi-structured I mean the information I want is organized into …

Topic: similar-documents data-wrangling python

Category: Data Science

Export pandas to dictionary by combining multiple row values

sfactor

2021年3月14日 09:41

I have a pandas dataframe df that looks like this name value1 value2 A 123 1 B 345 5 C 712 4 B 768 2 A 318 9 C 178 6 A 321 3 I want to convert this into a dictionary with name as a key and list of dictionaries (value1 key and value2 value) for all values that are in name So, the output would look like this { 'A': [{'123':1}, {'318':9}, {'321':3}], 'B': [{'345':5}, {'768':2}], 'C': [{'712':4}, …

Topic: data-wrangling pandas python

Category: Data Science

What is the difference between 'if the data is of good quality' and 'if the data is tidy'?

user3508140

2021年2月2日 20:07

I'm doing Data Analyst nanodegree from Udacity. I'm confused between the difference even after going through the lecture a few times.

Topic: data-analysis data-wrangling data-cleaning

Category: Data Science

Mean across every several rows in pandas

Mirit

2021年1月21日 20:21

I have a table of features and labels where each row has a time stamp. Labels are categorical. They go in a batch where one label repeats several times. Batches with the same label do not have a specific order. The number of repetitions of the same label in one batch is always the same. In the example below, every three rows has the same label. I would like to get a new table where Var 1 and Var 2 …

Topic: data-table data-wrangling sql pandas python

Category: Data Science

When to choose character instead of factor in R?

lupi5

2020年12月13日 08:41

I am currently working on a dataset which contains a name attribute, which stands for a person's first name. After reading the csv file with read.csv, the variable is a factor by default (stringsAsFactors=TRUE) with ~10k levels. Since name does not reflect any group membership, I am uncertain to leave it as factor. Is it necessary to convert name to character? Are there some advantages in doing (or not doing) this? Does it even matter?

Topic: data-wrangling r

Category: Data Science

Similar values cleaning

miro_muras

2020年12月3日 10:31

can someone know algorithm how to identify account names that are similar enough to be potentially merged and imported as one Duplicates with different values: Geico val1 NaN =====>> Geico val1 val2 Geico NaN val2 Similar or almost exact Geico Gaico

Topic: data-wrangling data pandas

Category: Data Science

About