How to preprocess an ordered categorical variable to feed a machine learning algorithm?

I have a categorical variable that measures the income of a family: A: no income B: Up to $500 C: $500-$700 … P: $5000-$6000 Q: More than \\\$6000 It seems odd to me that I have to get dummies for this variable, since it's ordered. I wonder if it's better to map the values: {'A': 0, 'B': 1, …, 'Q': 17} so I can input it into the algorithm this values as integer numbers. What's the proper way of preprocessing …
Category: Data Science

Giving each person in order their top choice which is still available in Google Sheets

The problem I want to solve is my residential building's garage choices. There will be a random distribution of parking spaces. I thought that it would be better if each person writes down which spaces they want in order of preference, and then their priority of picking a parking slot is randomized. For instance: Person a chooses: p3, p5, p1, p2, p4 Person b chooses: p3, p1, p2, p4, p5 Person c chooses: p1, p3, p2, p5, p4 Person d …
Category: Data Science

How to deal with highly skewed (on counts) dependent variables?

I am working on a binary classification problem and the dataset consists of several variables which are count variables. For example, how many times a customer defaulted on a broadband bill payment in the last 3 months. The problem is, these features are highly skewed. This is how the distribution for the above variable looks like: 0.0 98.175855 1.0 1.275902 2.0 0.348707 3.0 0.199535 This is due to the nature of the event being evaluated during the construction of the …
Category: Data Science

R code making 1 column into multiple columns with their unique ID

Currently stuck on a data wrangling question in R. So far I've tried variations of this code using tidyverse package, columns 5 and 6 here were the rating and the user: df[,5:6] %>% pivot_wider(names_from = question, values_from = rating, names_sep = ".") %>% unnest(cols = everything())-> df_reformat Each column will be the question ID and the rows are the scores for each user, ideally clustered by group. Data structure needed: repID user Customer question 1 Customer question 2 .... Customer …
Category: Data Science

Filter for top 10 highest values of group in dataset (in R)

Context: I am trying to find the top 10 highest values of count in my data frame conditional on them falling within the years 1970-1979. My data frame looks as below: id lemma year count 1 word1 1970 737 2 word2 1971 767 3 word3 1972 988 etc... Attempt: #1970s df_n_maxcount_1970s <- df_n %>% filter(year < 1980) %>% slice_max(count, n=40) #1990s df_n_maxcount_1980s <- df_n %>% filter(year == 1980:1989) %>% slice_max(count, n=40) This has worked pretty well, but there's a level …
Category: Data Science

Data Wrangling and data cleaning

I found some information about Data Wrangling and they say different things. In this one, they say data cleaning is a subcategory of data wrangling link In this PDF, data wrangling is EDA and model building link Someone can explain to me what data wrangling is? What is the difference between data cleaning and data wrangling? If possible can you add the reference?
Category: Data Science

Group_by 2 variables and pivot_wider distribution based on 2 others

Performing some calculations on a dataframe and stuck trying to calculate a few percentages. Trying to append 3 additional columns added for %POS/NEG/NEU. E.g., the sum of amount col for all observations w/ POS Direction in both Drew & A/total sum of all amounts for Drew ** Name Rating Amount Price Rate Type Direction Drew A 455 99.54 4.5 white POS Drew A 655 88.44 5.3 white NEG Drew B 454 54.43 3.4 blue NEU Drew B 654 33.54 5.4 …
Category: Data Science

Joining of Technical replicates with experimental data

I have a task in which I need to join data collected from non-destructive biological sensor analyses with data collected from various microbiological "wet-lab" methods, e.g. colony counting, on the observation/sample names, which represent various environmental conditions, for the purposes of generating machine learning models for the prediction of microbiological status based on the aforementioned sensor output. However, I am considering how to proceed with dealing with technical duplicates/repeats, i.e. additional plates from the same biological sample, re-runs/re-evaluation of samples …
Category: Data Science

Compare multiple values from a DataFrame against a single row from another

I'm trying to compare address values for inaccuracies, for example, given multiple records like: Reference Apartment Address PostCode AS097 NaN 00 Name Road BH1 4HB AS097 Flat 1 Building Name 00 Name Road BH1 4HB AS097 Flat 2 Building Name 00 Name Rd BH1 4HB AS097 Flat 3 Building Name 00 Name Road BH1 4HB AS097 Flat 4 Building Name 00 Name Road BH1 4HB AS097 Flat 5 Building Name 00 Name Road BH1 4HX HO056 NaN 23 Street Road …
Category: Data Science

What should I do with the NaN values on this stock quote data?

I concatenated 3 stock quote data-frames all with date-time indexes. However, they differ in starting dates so the resulting data-frame contains NaN values for the stock quotes with more recent starting dates. Should I just drop the rows with NaN and start the new data frame with the row where all have values or is there a way to fill them up? I'm planning on using the data to train a neural network that predicts future stock quotes.
Category: Data Science

Advantages to combining similarly-named columns for supervised ML?

Is there any benefit to combining similarly named columns either for an improvement in accuracy or for speeding up training/prediction in case of logistic regression, random forest or neural network models? I have seen this done at times but wasn't sure if there was more than a heuristically-motivated reason for doing it. eg. Converting this: name col1 col2 col3 time gina 5 12 20 30 john 6 7 43 40 to this: name (col1,col2,col3) time gina (5,12,20) 30 john (6,7,43) …
Category: Data Science

Data cleaning in Pandas, where the csv file has all data of each row in 1 field

I have really messy data that looks like this: As you can see all the data in each row is contained in 1 column separated by a semi colon. How do I arrange this data so that they are spread out over more columns? For example, category_id, category_id_lvl_0 etc., to be in separate columns and the rows underneath corresponding to that columns i.e ones that are separated by the semi colon to fall under the column of category_id, category_id_lvl_0...
Category: Data Science

What is a good way to handle nominal spatial data with a changing number of categories to use in prediction model?

For a project I'm going to be working with spatial data with a nominal attribute (land use). Every year the number of categories for this attribute changes because categories split or merge. I do have access to a chart that shows me how the categories are transformed into each other from one year to another. For the same spatial extent, I also have data for a bunch of other variables. I want to use these as explanatory variables for the …
Category: Data Science

Data wrangling dates

I have a feature with data creation dates. I have normalized them all to the same format and split them to 'day', 'month' and 'year' columns. But now I have a question. Should I apply normalization or standardization to these columns, or on dates this does not have sense?
Category: Data Science

Data wrangling for a big set of docx files advice!

I'm looking for some advice on a data wrangling problem I'm trying to solve. I've spent a week solid taking different approaches and nothing seems to be quite perfect. Just FYI, this is my first big (for me anyway) data science project, so I'm really in need of some wisdom on the best way to approach it. Essentially I have a set (200+) of docx files that are semi-structured. By semi-structured I mean the information I want is organized into …
Category: Data Science

Export pandas to dictionary by combining multiple row values

I have a pandas dataframe df that looks like this name value1 value2 A 123 1 B 345 5 C 712 4 B 768 2 A 318 9 C 178 6 A 321 3 I want to convert this into a dictionary with name as a key and list of dictionaries (value1 key and value2 value) for all values that are in name So, the output would look like this { 'A': [{'123':1}, {'318':9}, {'321':3}], 'B': [{'345':5}, {'768':2}], 'C': [{'712':4}, …
Category: Data Science

Mean across every several rows in pandas

I have a table of features and labels where each row has a time stamp. Labels are categorical. They go in a batch where one label repeats several times. Batches with the same label do not have a specific order. The number of repetitions of the same label in one batch is always the same. In the example below, every three rows has the same label. I would like to get a new table where Var 1 and Var 2 …
Category: Data Science

When to choose character instead of factor in R?

I am currently working on a dataset which contains a name attribute, which stands for a person's first name. After reading the csv file with read.csv, the variable is a factor by default (stringsAsFactors=TRUE) with ~10k levels. Since name does not reflect any group membership, I am uncertain to leave it as factor. Is it necessary to convert name to character? Are there some advantages in doing (or not doing) this? Does it even matter?
Category: Data Science

Similar values cleaning

can someone know algorithm how to identify account names that are similar enough to be potentially merged and imported as one Duplicates with different values: Geico val1 NaN =====>> Geico val1 val2 Geico NaN val2 Similar or almost exact Geico Gaico
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.