Merging two datasets with different features for machine learning prediction

I'm trying to create a model which predicts Real estate prices with xgboost in machine learning, my question is : Can i combine two datasets to do it ? First dataset : 13 features Second dataset : 100 features Thé différence between the two datasets is that the first dataset is Real estate transaction from 2018 to 2021 with features like area , région And the second is also transaction but from 2011 to 2016 but with more features like …
Category: Data Science

Turning multiple binary columns into categorical (with less columns) with Python Pandas

I want to turn these categories into values of categorical columns. The values in each category are the current binary columns present in the data frame. We have : A11, A12.. is a detail of A1 so if the value in A11 ==1 it will necessarily imply having A1==1 but the inverse is not valid. Respecting the following conditions : maximaum of existing types is 4 if A11==1 value of type1 should be equal to 'A11' and we ignore 'A1' …
Category: Data Science

While using reindex method on any dataframe why do original values go missing?

This is the original Dataframe: What I wanted : I wanted to convert this above data-frame into this multi-indexed column data-frame : I managed to do it by this piece of code : # tols : original dataframe cols = pd.MultiIndex.from_product([['A','B'],['Y','X'] ['P','Q']]) tols.set_axis(cols, axis = 1, inplace = False) What I tried : I tried to do this with the reindex method like this : cols = pd.MultiIndex.from_product([['A','B'],['Y','X'], ['P','Q']]) tols.reindex(cols, axis = 'columns') it resulted in an output like this …
Category: Data Science

How to groupby and sum values of only one column based on value of another column

I have a dataset that has the following columns: Category, Product, Launch_Year, and columns named 2010, 2011 and 2012. These 13 columns contain sales of the product in that year. The goal is to create another column Launch_Sum that calculates the sum of the Category (not the Product) for each Launch_Year: test = pd.DataFrame({ 'Category':['A','A','A','B','B','B'], 'Product':['item1','item2','item3','item4','item5','item6'], 'Launch_Year':[2010,2012,2010,2012,2010,2011], '2010':[25,0,27,0,10,0], '2011':[50,0,5,0,20,39], '2012':[30,40,44,20,30,42] )} Category Product Launch_Year 2010 2011 2012 Launch_Sum (to be created) A item1 2010 25 50 30 52 A item2 …
Topic: groupby pandas
Category: Data Science

Azure Cloud SQL - Querying large number of rows with Python

I have a Python Flask application that connects to an Azure Cloud SQL Database, and uses the Pandas read_sql method with SQLAlchemy to perform a select operation on a table and load it into a dataframe. recordsdf = pd.read_sql(recordstable.select(), connection) The recordstable has around 5000 records, and the function is taking around 10 seconds to execute (I have to pull all records every time). However, the exact same operation with the same data takes around 0.5 seconds when I'm selecting …
Category: Data Science

Extract all data of a month from different years

Ok I had a typo in this question before which I have now corrected: my database (df_e) looks like this: 0,Country,Latitude,Longitude,Altitude,Date,H2,Year,month,dates,a_diffH,H2a 1,IN,28.58,77.2,212,1964-09-15,-57.6,1964,9,1964-09-15,-3.18,-54.42 2,IN,28.58,77.2,212,1963-09-15,-120.0,1963,9,1963-09-15,-3.18,-116.82 3,IN,28.58,77.2,212,1964-05-15,28.2,1964,5,1964-05-15,-3.18,31.38 ... and I would like to save the data from the 9th month from the years 1963 and 1964 into a new df. For this I use the command: df.loc[df_e['H2a'].isin(['1963-09-15', '1964-09-15'])] But the result is Empty DataFrame Columns: [Country, Latitude, Longitude, Altitude, Date, H2, Year, month, dates, a_diffH, H2a] Index: [] Where is my mistake?
Category: Data Science

How do I get the divided values of two columns that are a result from a groupby method

I currently have a dataframe that was made by the following example code df.groupby(['col1', 'col2', 'Count'])[['Sum']].agg('sum') which looks like this col1 col2 Count Sum DOG HUSKY 600 1500 CAT CALICO 200 3000 BIRD BLUE JAY 1500 4500 I would like to create a new column which outputs the division of df['Sum'] and df['Count'] The expected data frame would look like this col1 col2 Count Sum Average DOG HUSKY 600 1500 2.5 CAT CALICO 200 3000 15 BIRD BLUE JAY 1500 …
Category: Data Science

Predicting Customer Activity Absence

Could you please assist me with to following question? I have a customer activity dataframe that looks like this: It contains at least 500.000 customers and a "timeseries" of 42 months. The ones and zeroes represent customer activity. If a customer was active during a particular month then there will be a 1, if not - 0. I need determine those customers that most likely (+ probability) will not be active during the next 6 months (2018 July-December). Could you …
Category: Data Science

RAM crashed for XML to DataFrame conversion function

I have created the following function which converts an XML File to a DataFrame. This function works good for files smaller than 1 GB, for anything greater than that the RAM(13GB Google Colab RAM) crashes. Same happens if I try it locally on Jupyter Notebook (4GB Laptop RAM). Is there a way to optimize the code? Code #Libraries import pandas as pd import xml.etree.cElementTree as ET #Function to convert XML file to Pandas Dataframe def xml2df(file_path): #Parsing XML File and …
Category: Data Science

How to match a word from column and compare with other column in pandas dataframe

I have the below dataframe Text Keywords Type It’s a roll-on tube roll-on ball It is barrel barrel barr An unknown shape others it’s a assembly assembly assembly it’s a sealing assembly assembly factory its a roll-on double roll-on factory I have first found out the keywords, and based on the keyword and its corresponding type, it should true or false For example, when the keyword is roll-on , the type should be "ball" or "others" when the keyword is …
Category: Data Science

IterativeImputer Evaluation

I am having a hard time evaluating my model of imputation. I used an iterative imputer model to fill in the missing values in all four columns. For the model on the iterative imputer, I am using a Random forest model, here is my code for imputing: imp_mean = IterativeImputer(estimator=RandomForestRegressor(), random_state=0) imp_mean.fit(my_data) my_data_filled= pd.DataFrame(imp_mean.transform(my_data)) my_data_filled.head() My problem is how can I evaluate my model. How can I know if the filled values are right? I used a describe function before …
Category: Data Science

Applying a matching function for string and substring with missing values on a python dataframe

I have programmed the following functionality: The function returns True, when the two strings match sequentially except for a "*" value and false when they differ by at least 1 character. def matching(row1, row2): string = row1['number'] sub_string = row2['number'] flag = True i=0 if len(string) == len(sub_string): while i < len(string) and flag==True: if string[i] != "*" and sub_string[i] != "*": if string[i] != sub_string[i]: flag = False i+=1 else: flag = False return flag Assuming I have a …
Topic: pandas python
Category: Data Science

Need help on Time Series ARIMA Model

I'm working on forecasting daily volumes and have used time series model to check for data stationarity. However, I'm strugging at forecasting data with 90% accuracy. Right now variation is extremely high and I'm just unable to bring it down. I've used log method to transform my data. Please find the link to folder below which contains ipynb and csv files: https://drive.google.com/drive/folders/1QUJkTucLPIf2vjo2mRmoBU6be083dYpQ?usp=sharing Any help will be highly appreciable Thanks, Rahul
Category: Data Science

Transitioning from a python script for data transformation to BigQuery

So I have a dataset spread over multiple and ever-growing excel files all of which looks like: email order_ID order_date [email protected] 1234 23-Mar-2021 [email protected] 1235 23-Mar-2021 [email protected] 1236 23-Mar-2021 [email protected] 1237 24-Mar-2021 [email protected] 1238 28-Mar-2021 End goal is to have two distinct datasets as: First one being Orders: (Public. For analysis, trading emails with user_IDs for anonymity and marking returning for further analyses) user_ID order_ID order_date is_returning? 1 1234 23-Mar-2021 0 2 1235 23-Mar-2021 0 2 1236 23-Mar-2021 1 1 …
Category: Data Science

How to deal with errors of defining data types in pandas' read_csv ()?

I have a table with 118,000 rows and 80 columns. I would like to select 8 columns from the table. I am reading the file using the pandas function pd.read_csv command as: df = pd.read_csv(filename, header=None, sep='|', usecols=[1,3,4,5,37,40,51,76]) I would like to change the data type of each column inside of read_csv using dtype={'5': np.float, '37': np.float, ....}, but this does not work. There is a message that column 5 has mixed types. The command print(df.dtypes) shows all columns of …
Topic: pandas python
Category: Data Science

How to return the number of values that has a specific count

I would like to find how many occurrences of a specific value count a column contains. For example, based on the data frame below, I want to find how many values in the ID column are repeated twice | ID | | -------- | | 000001 | | 000001 | | 000002 | | 000002 | | 000002 | | 000003 | | 000003 | The output should look something like this Number of ID's repeated twice: 2 The ID's …
Category: Data Science

Python: calculate the weighted average correlation coefficient

I am calculating the volatility (standard deviation) of returns of a portfolio of assets using the variance-covariance approach. Correlation coefficients and asset volatilities have been estimated from historical returns. Now what I'd like to do is compute the average correlation coefficient, that is the common correlation coefficient between all asset pairs that gives me the same overall portfolio volatility. I could of course take an iterative approach, but was wondering if there was something simpler / out of the box …
Category: Data Science

Ordering a material science dataset (properties names, properties scalars, formulas)

I'm dealing with a materials science dataset and I'm in the following situation, I have data organized like this: Chemical_ Formula Property_name Property_Scalar He Electrical conduc. 1 NO_2 Resistance 50 CuO3 Hardness ... ... ... CuO3 Fluorescence 300 He Toxicity 39 NO2 Hardness 80 ... ... ... As you can understand it is really messy because the same chemical formula appears more than once through the entire dataset, but referred to a different property that is considered. My question is, …
Category: Data Science

Fill the null values in the dataframe with condition

traindf[traindf['Gender'] == 'female']['Age'].fillna(value=femage,inplace=True) I've tried to update the null values in the age column in the dataframe with the mean values.Here I tried to replace the null values in the age column of female gender with the female mean age.But the column doesn't get updated.why?
Topic: pandas python
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.