Turning multiple binary columns into categorical (with less columns) with Python Pandas

I want to turn these categories into values of categorical columns. The values in each category are the current binary columns present in the data frame. We have : A11, A12.. is a detail of A1 so if the value in A11 ==1 it will necessarily imply having A1==1 but the inverse is not valid. Respecting the following conditions : maximaum of existing types is 4 if A11==1 value of type1 should be equal to 'A11' and we ignore 'A1' …
Category: Data Science

While using reindex method on any dataframe why do original values go missing?

This is the original Dataframe: What I wanted : I wanted to convert this above data-frame into this multi-indexed column data-frame : I managed to do it by this piece of code : # tols : original dataframe cols = pd.MultiIndex.from_product([['A','B'],['Y','X'] ['P','Q']]) tols.set_axis(cols, axis = 1, inplace = False) What I tried : I tried to do this with the reindex method like this : cols = pd.MultiIndex.from_product([['A','B'],['Y','X'], ['P','Q']]) tols.reindex(cols, axis = 'columns') it resulted in an output like this …
Category: Data Science

Predicting Customer Activity Absence

Could you please assist me with to following question? I have a customer activity dataframe that looks like this: It contains at least 500.000 customers and a "timeseries" of 42 months. The ones and zeroes represent customer activity. If a customer was active during a particular month then there will be a 1, if not - 0. I need determine those customers that most likely (+ probability) will not be active during the next 6 months (2018 July-December). Could you …
Category: Data Science

RAM crashed for XML to DataFrame conversion function

I have created the following function which converts an XML File to a DataFrame. This function works good for files smaller than 1 GB, for anything greater than that the RAM(13GB Google Colab RAM) crashes. Same happens if I try it locally on Jupyter Notebook (4GB Laptop RAM). Is there a way to optimize the code? Code #Libraries import pandas as pd import xml.etree.cElementTree as ET #Function to convert XML file to Pandas Dataframe def xml2df(file_path): #Parsing XML File and …
Category: Data Science

How to match a word from column and compare with other column in pandas dataframe

I have the below dataframe Text Keywords Type It’s a roll-on tube roll-on ball It is barrel barrel barr An unknown shape others it’s a assembly assembly assembly it’s a sealing assembly assembly factory its a roll-on double roll-on factory I have first found out the keywords, and based on the keyword and its corresponding type, it should true or false For example, when the keyword is roll-on , the type should be "ball" or "others" when the keyword is …
Category: Data Science

Dataframe Python - Conditional Column based on multiple criteria

I want someone to correct my code in python. My goal is to create a code that will add a new column that is based on the conditional function of two columns in my dataframe. I want to add a fees column which is a numeric column that is different and based on whether the success is True or False and based on the PSP column as well. Note that data type are as below: success = boolen PSP = …
Category: Data Science

How to find the count of consecutive same string values in a pandas dataframe?

Assume that we have the following pandas dataframe: df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]}) input: col1 col2 start 0 A>G TCT 1000 1 C>T ACA 2000 2 C>T TCA 3000 3 G>T TCA 4000 4 C>T GCT 5000 5 A>G ACT 6000 6 A>G CTG 10000 7 A>G ATG 20000 8 C>A TCT 10000 9 C>T ACA 2000 10 C>T TCA 3000 11 C>T TCA 4000 What I want to get is the number of consecutive values in col1 and …
Category: Data Science

ML methods for vector correlation

I am dealing with a timeseries consisting of input flow sampled every 5 minutes over 441 days. My aim is to find any possible correlation from data coming from: The same day of the week The same moment in time I proceeded to sample according to weekdays and hours. Then I computed the 63x63 correlation matrix for each of the weekdays and a 441x441 for each hour, which in the second case is pretty impractical. I feel like this way …
Category: Data Science

How do I read a dat file, for which I don't know its structure?

Is there any way to at least read the text from the dat file. I have its corresponding mdf file hence I know what all data and columns are there in it. How do I figure out the contents in my dat file. Because all that I am getting currently is some gibberish even if I am opening it in binary mode. from asammdf import MDF dat_file = r"C:\Users\HPO2KOR\Desktop\Work\data1.dat" mdf_file = r"C:\Users\HPO2KOR\Desktop\Work\data1.mdf" df = mdf.to_dataframe() mdf = MDF(mdf_file) df.head() which …
Category: Data Science

Is there a way to make the window in df.rolling dynamic depending on which row it is calculating for?

I have a dataset of stock prices, and I want to add a column of 52 week lows for each day, however for the rows which dont have 365 days above them I just want the column to have the rolling min using whatever amount of rows exist above. I was trying code like this, but this obviously doesnt work because its creating the column twice. for row in data.iterrows(): if row[0] < (data.index[0] + timedelta(days = 365)): data['52wkLow'] = …
Category: Data Science

ValueError: ('The truth value of a Series is ambiguous after applying if/else condition in Pandas data frames

I want to create a new variable for the dataframe details called lower after iterating through multiple data frames. list1 is a list of string values of a column named variable_name in the details. vars_df is another data frame with 2 columns, namely variable_name and direction. Both columns contain with string values. vars_df.shape = (19,2) Some values of variable_name in vars_df are present in list1 as well as data_set. data_set.shape = (32,107).df.shape = (96,1) The following code, which aims to …
Category: Data Science

Add ID information from one dataframe to every row in another dataframe without a common key

I am trying to combine two dataframes into the second dataframe, but duplicating the first dataframe into every row of dataframe two. The first dataframe (df1) contains the identifing information, and the other dataframe (df2) contains data about that identifying information. They are in different dataframes since they come from different files. I am sure there is an easy way to do this, but cant seem to find it. The basic problem is there are no unique identifiers on df2 …
Category: Data Science

Create new rows based on a value in a column

My dateset is generated like the example df = {'event':['A','B','C','D'], 'budget':['123','433','1000','1299'], 'duration_days':['6','3','4','2']} I need to create rows for each event based on the column 'duration_days', if I have duration = 6 the event may have 6 rows: event budget duration_days A 123 6 A 123 6 A 123 6 A 123 6 A 123 6 A 123 6 B 123 3 B 123 3 B 123 3
Category: Data Science

Replacing rows of dataframe with rows of another dataframe that have the same index

I have a dataframe that has rows with indices 0 to 128 and a smaller dataframe with indices 4, 8, 105, and 107. I made edits to the rows in the smaller dataframe and am now trying to replace rows indexed 4, 8, 105, and 107 in the large dataframe with rows indexed 4, 8, 105, and 107 in the smaller dataframe. Why can I not just do: bigDF[smallDF.index] = smallDF How would I accomplish this replacement? Thank you!
Category: Data Science

How to combine and separate test and train data for data cleaning?

I am working on an ML model in which I have been provided the data in 2 files test.csv and train.csv. I want to perform data cleaning on both files together be concatenating them and then separating them. I know how to concatenate 2 dataframes, but after data cleaning how will I separate the two files? Please help me complete the code. CODE test = pd.read_csv('test.csv') train = pd.read_csv('train.csv') df = pd.concat([test, train]) //Data Cleaning steps //Separating them back to …
Category: Data Science

lookup and fill some value from one dataframe to another

I have 2 dataframes, df1,and df2 as below. df1 and df2 I would like to lookup "result" from df1 and fill into df2 by "Mode" as below format. Note "Mode" has become my column names and the results have been filled into corresponding columns. also note that "ID" from df2 may not necessary equal to "ID" from df1.For example, I am only interested in 4 IDs (A01,A03,A04 and A05, no A02) while df1 may contain more IDs I tried to …
Category: Data Science

How can i merge two datasets with similar words in python?

For instance i have a row value on the dataset_1 : "Entity" = Apple dataset_2 : "Entity" = iCloud Apple (Entity is the column) I need to merge one dataset to the other by the column entity, but to do that i need them to have exacly the same value and Apple ≠ iCloud Apple. Both datasets are huge so i cant do this manually, one by one
Category: Data Science

Plot multiple time series from single dataframe

I have a dataframe with multiple time series and columns with labels. My goal is to plot all time series in a single plot, where the labels should be used in the legend of the plot. The important point is that the x-data of the time series do not match each other, only their ranges roughly do. See this example: import pandas as pd import matplotlib.pyplot as plt df = pd.DataFrame([[1, 2, "A", "A"], [2, 3, "A", "A"], [3, 1, …
Category: Data Science

Identify the existence of a wall like structure in a given int array

The idea is that I will query an API endpoint which will return me an array consisting of a price value and a quantity value [price, quantity]. In this dataset there is high possibility that there are structure of values where there is a sudden increase in quantity for a given price range compared to the rest of its surroundings, basically a wall like structure. Example below shows a range of price values and a quantity value as a heatmap. …
Category: Data Science

Should hexadecimal addresses of a dataset be cleaned?

I am working on fraud detection on blockchains. To be more specific, I fetched a big number of transactions that took place on the blockchain, labeled them to spam / non spam using an appropriate API and now I will train a model to detect fraud using SVM, etc ... My question is about the preparation of the data. The fields I have are : hash, nonce transaction_index, from_address, to_address,... The fields "from/to_address" are hexadecimal fields like 0x5e14d30d2155c0cdd65044d7e0f296373f3e92f65ebd My question …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.