pandas

Merging two datasets with different features for machine learning prediction

Djakarta_zero

2022年6月4日 06:13

I'm trying to create a model which predicts Real estate prices with xgboost in machine learning, my question is : Can i combine two datasets to do it ? First dataset : 13 features Second dataset : 100 features Thé différence between the two datasets is that the first dataset is Real estate transaction from 2018 to 2021 with features like area , région And the second is also transaction but from 2011 to 2016 but with more features like …

Topic: prediction pandas feature-selection python machine-learning

Category: Data Science

Turning multiple binary columns into categorical (with less columns) with Python Pandas

Legna

2022年6月3日 22:24

I want to turn these categories into values of categorical columns. The values in each category are the current binary columns present in the data frame. We have : A11, A12.. is a detail of A1 so if the value in A11 ==1 it will necessarily imply having A1==1 but the inverse is not valid. Respecting the following conditions : maximaum of existing types is 4 if A11==1 value of type1 should be equal to 'A11' and we ignore 'A1' …

Topic: categorical-encoding dataframe pandas python data-cleaning

Category: Data Science

While using reindex method on any dataframe why do original values go missing?

Arnav Das

2022年6月3日 13:06

This is the original Dataframe: What I wanted : I wanted to convert this above data-frame into this multi-indexed column data-frame : I managed to do it by this piece of code : # tols : original dataframe cols = pd.MultiIndex.from_product([['A','B'],['Y','X'] ['P','Q']]) tols.set_axis(cols, axis = 1, inplace = False) What I tried : I tried to do this with the reindex method like this : cols = pd.MultiIndex.from_product([['A','B'],['Y','X'], ['P','Q']]) tols.reindex(cols, axis = 'columns') it resulted in an output like this …

Topic: dataframe pandas data-indexing-techniques python

Category: Data Science

How to groupby and sum values of only one column based on value of another column

MushyMush

2022年6月3日 10:20

I have a dataset that has the following columns: Category, Product, Launch_Year, and columns named 2010, 2011 and 2012. These 13 columns contain sales of the product in that year. The goal is to create another column Launch_Sum that calculates the sum of the Category (not the Product) for each Launch_Year: test = pd.DataFrame({ 'Category':['A','A','A','B','B','B'], 'Product':['item1','item2','item3','item4','item5','item6'], 'Launch_Year':[2010,2012,2010,2012,2010,2011], '2010':[25,0,27,0,10,0], '2011':[50,0,5,0,20,39], '2012':[30,40,44,20,30,42] )} Category Product Launch_Year 2010 2011 2012 Launch_Sum (to be created) A item1 2010 25 50 30 52 A item2 …

Topic: groupby pandas

Category: Data Science

Azure Cloud SQL - Querying large number of rows with Python

Allen Wu

2022年6月2日 17:04

I have a Python Flask application that connects to an Azure Cloud SQL Database, and uses the Pandas read_sql method with SQLAlchemy to perform a select operation on a table and load it into a dataframe. recordsdf = pd.read_sql(recordstable.select(), connection) The recordstable has around 5000 records, and the function is taking around 10 seconds to execute (I have to pull all records every time). However, the exact same operation with the same data takes around 0.5 seconds when I'm selecting …

Topic: azure-ml pandas python databases

Category: Data Science

Extract all data of a month from different years

Weiss

2022年6月1日 13:53

Ok I had a typo in this question before which I have now corrected: my database (df_e) looks like this: 0,Country,Latitude,Longitude,Altitude,Date,H2,Year,month,dates,a_diffH,H2a 1,IN,28.58,77.2,212,1964-09-15,-57.6,1964,9,1964-09-15,-3.18,-54.42 2,IN,28.58,77.2,212,1963-09-15,-120.0,1963,9,1963-09-15,-3.18,-116.82 3,IN,28.58,77.2,212,1964-05-15,28.2,1964,5,1964-05-15,-3.18,31.38 ... and I would like to save the data from the 9th month from the years 1963 and 1964 into a new df. For this I use the command: df.loc[df_e['H2a'].isin(['1963-09-15', '1964-09-15'])] But the result is Empty DataFrame Columns: [Country, Latitude, Longitude, Altitude, Date, H2, Year, month, dates, a_diffH, H2a] Index: [] Where is my mistake?

Topic: time-series pandas dataset python

Category: Data Science

How do I get the divided values of two columns that are a result from a groupby method

jibecat

2022年6月1日 11:50

I currently have a dataframe that was made by the following example code df.groupby(['col1', 'col2', 'Count'])[['Sum']].agg('sum') which looks like this col1 col2 Count Sum DOG HUSKY 600 1500 CAT CALICO 200 3000 BIRD BLUE JAY 1500 4500 I would like to create a new column which outputs the division of df['Sum'] and df['Count'] The expected data frame would look like this col1 col2 Count Sum Average DOG HUSKY 600 1500 2.5 CAT CALICO 200 3000 15 BIRD BLUE JAY 1500 …

Topic: data-analysis pandas python

Category: Data Science

Predicting Customer Activity Absence

Andrei

2022年6月1日 06:04

Could you please assist me with to following question? I have a customer activity dataframe that looks like this: It contains at least 500.000 customers and a "timeseries" of 42 months. The ones and zeroes represent customer activity. If a customer was active during a particular month then there will be a 1, if not - 0. I need determine those customers that most likely (+ probability) will not be active during the next 6 months (2018 July-December). Could you …

Topic: data-analysis dataframe prediction pandas python

Category: Data Science

RAM crashed for XML to DataFrame conversion function

Ishan Dutta

2022年5月31日 20:08

I have created the following function which converts an XML File to a DataFrame. This function works good for files smaller than 1 GB, for anything greater than that the RAM(13GB Google Colab RAM) crashes. Same happens if I try it locally on Jupyter Notebook (4GB Laptop RAM). Is there a way to optimize the code? Code #Libraries import pandas as pd import xml.etree.cElementTree as ET #Function to convert XML file to Pandas Dataframe def xml2df(file_path): #Parsing XML File and …

Topic: dataframe python-3.x parsing pandas python

Category: Data Science

How to match a word from column and compare with other column in pandas dataframe

anagha s

2022年5月31日 14:13

I have the below dataframe Text Keywords Type It’s a roll-on tube roll-on ball It is barrel barrel barr An unknown shape others it’s a assembly assembly assembly it’s a sealing assembly assembly factory its a roll-on double roll-on factory I have first found out the keywords, and based on the keyword and its corresponding type, it should true or false For example, when the keyword is roll-on , the type should be "ball" or "others" when the keyword is …

Topic: dataframe text pandas python

Category: Data Science

IterativeImputer Evaluation

candy bird

2022年5月31日 09:53

I am having a hard time evaluating my model of imputation. I used an iterative imputer model to fill in the missing values in all four columns. For the model on the iterative imputer, I am using a Random forest model, here is my code for imputing: imp_mean = IterativeImputer(estimator=RandomForestRegressor(), random_state=0) imp_mean.fit(my_data) my_data_filled= pd.DataFrame(imp_mean.transform(my_data)) my_data_filled.head() My problem is how can I evaluate my model. How can I know if the filled values are right? I used a describe function before …

Topic: wikipedia evaluation scikit-learn pandas python

Category: Data Science

Applying a matching function for string and substring with missing values on a python dataframe

Carola

2022年5月31日 02:07

I have programmed the following functionality: The function returns True, when the two strings match sequentially except for a "*" value and false when they differ by at least 1 character. def matching(row1, row2): string = row1['number'] sub_string = row2['number'] flag = True i=0 if len(string) == len(sub_string): while i < len(string) and flag==True: if string[i] != "*" and sub_string[i] != "*": if string[i] != sub_string[i]: flag = False i+=1 else: flag = False return flag Assuming I have a …

Topic: pandas python

Category: Data Science

Is it a best practice to exclude retweets from the data set?

user84037

2022年5月30日 21:06

I am going to build machine learning algorithm to identify fake tweets. The data set has huge retweets which I think might be an issue. Do you think given that the focus is the original tweet, it is better to remove all the retweets? Thank you,

Topic: supervised-learning pandas python machine-learning

Category: Data Science

Need help on Time Series ARIMA Model

Rajan

2022年5月30日 19:01

I'm working on forecasting daily volumes and have used time series model to check for data stationarity. However, I'm strugging at forecasting data with 90% accuracy. Right now variation is extremely high and I'm just unable to bring it down. I've used log method to transform my data. Please find the link to folder below which contains ipynb and csv files: https://drive.google.com/drive/folders/1QUJkTucLPIf2vjo2mRmoBU6be083dYpQ?usp=sharing Any help will be highly appreciable Thanks, Rahul

Topic: machine-learning-model time-series pandas predictive-modeling

Category: Data Science

Transitioning from a python script for data transformation to BigQuery

Hamza

2022年5月29日 13:02

So I have a dataset spread over multiple and ever-growing excel files all of which looks like: email order_ID order_date [email protected] 1234 23-Mar-2021 [email protected] 1235 23-Mar-2021 [email protected] 1236 23-Mar-2021 [email protected] 1237 24-Mar-2021 [email protected] 1238 28-Mar-2021 End goal is to have two distinct datasets as: First one being Orders: (Public. For analysis, trading emails with user_IDs for anonymity and marking returning for further analyses) user_ID order_ID order_date is_returning? 1 1234 23-Mar-2021 0 2 1235 23-Mar-2021 0 2 1236 23-Mar-2021 1 1 …

Topic: google-bigquery data-engineering etl pandas python

Category: Data Science

How to deal with errors of defining data types in pandas' read_csv ()?

Liliana

2022年5月29日 02:01

I have a table with 118,000 rows and 80 columns. I would like to select 8 columns from the table. I am reading the file using the pandas function pd.read_csv command as: df = pd.read_csv(filename, header=None, sep='|', usecols=[1,3,4,5,37,40,51,76]) I would like to change the data type of each column inside of read_csv using dtype={'5': np.float, '37': np.float, ....}, but this does not work. There is a message that column 5 has mixed types. The command print(df.dtypes) shows all columns of …

Topic: pandas python

Category: Data Science

How to return the number of values that has a specific count

jibecat

2022年5月28日 19:08

I would like to find how many occurrences of a specific value count a column contains. For example, based on the data frame below, I want to find how many values in the ID column are repeated twice | ID | | -------- | | 000001 | | 000001 | | 000002 | | 000002 | | 000002 | | 000003 | | 000003 | The output should look something like this Number of ID's repeated twice: 2 The ID's …

Topic: data-analysis pandas python

Category: Data Science

Python: calculate the weighted average correlation coefficient

Richard H

2022年5月28日 14:19

I am calculating the volatility (standard deviation) of returns of a portfolio of assets using the variance-covariance approach. Correlation coefficients and asset volatilities have been estimated from historical returns. Now what I'd like to do is compute the average correlation coefficient, that is the common correlation coefficient between all asset pairs that gives me the same overall portfolio volatility. I could of course take an iterative approach, but was wondering if there was something simpler / out of the box …

Topic: numpy pearsons-correlation-coefficient correlation pandas python

Category: Data Science

Ordering a material science dataset (properties names, properties scalars, formulas)

James Arten

2022年5月28日 14:04

I'm dealing with a materials science dataset and I'm in the following situation, I have data organized like this: Chemical_ Formula Property_name Property_Scalar He Electrical conduc. 1 NO_2 Resistance 50 CuO3 Hardness ... ... ... CuO3 Fluorescence 300 He Toxicity 39 NO2 Hardness 80 ... ... ... As you can understand it is really messy because the same chemical formula appears more than once through the entire dataset, but referred to a different property that is considered. My question is, …

Topic: jupyter pandas python

Category: Data Science

Fill the null values in the dataframe with condition

vkd

2022年5月28日 12:01

traindf[traindf['Gender'] == 'female']['Age'].fillna(value=femage,inplace=True) I've tried to update the null values in the age column in the dataframe with the mean values.Here I tried to replace the null values in the age column of female gender with the female mean age.But the column doesn't get updated.why?

Topic: pandas python

Category: Data Science

About