can someone explain how to create new features using feature interactions?

There is this notebook solving housing prices. https://www.kaggle.com/code/jesucristo/1-house-prices-solution-top-1/notebook?scriptVersionId=12846740 and it had this bit of code, can anyone explain the how addition and multiplication and weighs work? features['YrBltAndRemod']=features['YearBuilt']+features['YearRemodAdd'] features['TotalSF']=features['TotalBsmtSF'] + features['1stFlrSF'] + features['2ndFlrSF'] features['Total_sqr_footage'] = (features['BsmtFinSF1'] + features['BsmtFinSF2'] + features['1stFlrSF'] + features['2ndFlrSF']) features['Total_Bathrooms'] = (features['FullBath'] + (0.5 * features['HalfBath']) + features['BsmtFullBath'] + (0.5 * features['BsmtHalfBath'])) features['Total_porch_sf'] = (features['OpenPorchSF'] + features['3SsnPorch'] + features['EnclosedPorch'] + features['ScreenPorch'] + features['WoodDeckSF'])
Category: Data Science

How to load numerous files from google drive into colab

I am trying to load in 30k images (600mb) from Google drive into Google Colaboratory to further process them with Keras/PyTorch. Therefore I have first mounted my Google drive using: from google.colab import drive drive.mount('/content/gdrive') Next I have unzipped the image file using: !unzip -uq "/content/gdrive/My Drive/path.zip" -d "/content/gdrive/My Drive/path/" Counting how many files are located in the directory using: len(os.listdir(path-to-train-images)) I only find 13k images (whereas I should find 30k). According to the output of unzip, the files appear …
Category: Data Science

Kaggle Titanic submission score is higher than local accuracy score

This is the starter challenge, Titanic. The original question I posted on Kaggle is here. However, nobody really gives any insightful advice so I am turning to the powerful Stackoverflow community. Based on this Notebook, we can download the ground truth for this challenge and get a perfect score. I tested it and it does give me 100% on LB for the purpose of confirming it is the ground truth as it claims. (side question here: how do I remove …
Category: Data Science

How to use Random Forest to reduce dimensions

I am working on the Boston competition on Kaggle and at the moment I am trying to use Random Forest to find the columns with the highest correlation with the target variable SalePrice. However, the implementation returned almost every single variable in the dataset: 0 1 2 3 4 5 6 ... 252 253 254 255 256 257 258 0 1 RL 65.0 8450 Pave NaN Reg ... 0 1 0 0 1 0 1 1 2 RL 80.0 9600 …
Category: Data Science

Why SMOTE is not used in prize-winning Kaggle solutions?

Synthetic Minority Over-sampling Technique SMOTE, is a well known method to tackle imbalanced datasets. There are many papers with a lot of citations out-there claiming that it is used to boost accuracy in unbalanced data scenarios. But then, when I see Kaggle competitions, it is rarely used, to the best of my knowledge there are no prize-winning Kaggle/ML competitions where it is used to achieve the best solution. Why SMOTE is not used in Kaggle? I even see applied research …
Category: Data Science

Where can I practice multivariate outlier detection?

Can anyone provide me with a dataset, hopefully on Kaggle, where I can practice my skills in outlier analysis? I have been studying this topic for quite a while, but I can't find a case study to apply my knowledge? bonus points: if it had some categorical variables where I can practice various techniques for dealing with categorical variables and their correlation, it would be amazing. If not possible in the same dataset, it is ok also to guide me …
Topic: kaggle outlier
Category: Data Science

How to use the fillna method in a for loop

I am working on a housing dataset. In a list of columns (Garage, Fireplace, etc), I have values called NA which just means that the particular house in question does not have that feature (Garage, Fireplace). It doesn't mean that the value is missing/unknown. However, Python interprets this as NaN, which is wrong. To come across this, I want to replace this value NA with XX to help Python distinguish it from NaN values. Because there is a whole list …
Category: Data Science

Need help understanding how this Neural Network is working

This is a model I came across, and I need some help understanding how it works It uses South German Credit Prediction data set from Kaggle !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00573/SouthGermanCredit.zip with zipfile.ZipFile('SouthGermanCredit.zip', 'r') as zip_ref: zip_ref.extractall('./SouthGermanCredit/') from tensorflow.keras import regularizers batch_size=32 learning_rate=1e-3 trainX, testX, trainY, testY = train_test_split(features, labels, test_size=0.2, random_state=69) normalizer = preprocessing.Normalization() normalizer.adapt(np.array(trainX)) model = tf.keras.Sequential([ normalizer, layers.Dense(128, activation='elu', kernel_regularizer=regularizers.l2(0.01)), layers.Dropout(0.5), layers.Dense(128, activation='elu', kernel_regularizer=regularizers.l2(0.01)), layers.Dropout(0.5), layers.Dense(2), layers.Softmax()]) model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate), loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False), metrics=['accuracy']) model.fit(trainX, trainY, epochs=50, verbose=0, batch_size=batch_size) test_loss, test_acc = model.evaluate(testX, testY, …
Category: Data Science

How to use Kaggle Api in Google Colab for directly using dataset?

I know that we can use Kaggle's api directly in google colab which downloads the dataset. The commands are : !mkdir .kaggle !echo '{"username":"somename","key":"apikey"}' > /root/.kaggle/kaggle.json !chmod 600 /root/.kaggle/kaggle.json !kaggle competitions download -c bluebook-for-bulldozers -p /content But I need to do this process of making .kaggle file and pass the apikey in google colab gpu everytime. And sometimes the echo command run saying no file called .kaggle but after say 2 mins without restarting the kernal, it works. It sounds …
Category: Data Science

Changing the predicted variable from price to price/km due to better visual correlation

I'm working on a dataset of Uber Rides from Kaggle. Of the important variables there are pickup and drop-off coordinates, passenger count, datetime of pickup, distance and the final price. I'm currently in the exploration phase and just about to begin feature engineering. When I'm plotting the different potential correlations, some of them just feel odd to plot fare against something. For example, fare vs passenger count or fare vs hour doesn't make much sense to me, as the average …
Category: Data Science

Massive difference in accuracy of KNN depending on random_state

pardon the noob question but I am baffled by the following behavior. My model has MASSIVELY different results based on the random seed. I want to train a KNN classifier on the famous Kaggle Titanic problem where we attempt to predict survival or not. I focus only on the "Sex" feature, to make things easier. The problem becomes now that by changing the random seed the results of the accuracy change incredibly. For example, one random seed gives me a …
Category: Data Science

Kaggle notebook Vs Google Colab

What are the major differences between Kaggle notebook and Google Colab notebook? To work on a dataset my first step is to start a Kaggle notebook but then I cant help thinking what could be the advantage of using Colab notebook instead. I know few differences, correct me if I'm mistaken about any: Kaggle has a console and Colab doesn't (but I still don't know what to do with the console). Kaggle notebook allows collaboration with other users on Kaggle's …
Category: Data Science

Why feature engineering and filling NaN's reduce score?

I used CatBoost for InClass Kaggle competition. I have tried various strategies to filling NaN values. Convert float binary variables to categorical. Add new categorical features (from age, for example). I have tried generate new features from existents, also tried remove irrelevant features (by correlation, feature importance, SHAP). But it all only makes it worse! Why? The best score came out without any preprocessing only with found hyper-parameters via random_seach
Category: Data Science

Why performance varies among validate set, public testset and private testset?

When practicing with classical kaggle competitions, such and Titanic, House pricing, and so on, I followed the traditional process that I learned from textbook: split training data into trainig set and validation set (either by 7:3 or CV fit model with training set evaluate the model performance with validation set combine the training set and validation set and re-train the model with the same parameters that were good on validation set Predict the result of test set Something I could …
Category: Data Science

Keras ImageDataGenerator unable to find images

I'm trying to add image data to a Kaggle notebook so I can run a convolutional neural network but I'm having trouble doing this via ImageDataGenerator. This is the link to my Kaggle notebook These are my imports: import numpy as np # linear algebra# import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from random import randint from sklearn.utils import shuffle from sklearn.preprocessing import MinMaxScaler import tensorflow as tf# from tensorflow import keras# from tensorflow.keras.models import …
Category: Data Science

GloVe Embedding Matrix "could not broadcast input array from shape (0) into shape (300)"

I'm working on Quora Question Pairs data set. I'm trying to get embedding matrix for GloVe with the following code: EMBEDDING_DIM = 300 embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM)) for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector however I get the following error: ValueError: could not broadcast input array from shape (0) into shape (300) I searched the internet but couldn't find any tips. This is how I lobe GloVe: embeddings_index = …
Category: Data Science

Not able to download image dataset from Kaggle api

I want to download only train dataset Dataset link here Kaggle API version - 1.5.12 (tensorflow) parthsharma@Parths-MacBook-Air ~ % kaggle datasets files paultimothymooney/chest-xray-pneumonia max() arg is an empty sequence (tensorflow) parthsharma@Parths-MacBook-Air ~ % kaggle datasets download paultimothymooney/chest-xray-pneumonia -f chest_xray/train 404 - Not Found
Topic: api kaggle python
Category: Data Science

Loading medical imaging data from multiple folders

I have a fairly basic mathematical and implementational understanding of ML algorithms and CNNs, and I am trying to think of an approach for this task: https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification/data?select=test. The "data" section explains the task and also gives a preview of the dataset. Doubt on general Implementation approach: From what I understand, we have 4 input parameters: FLAIR , T1W, T1Gd, and T2W. Based on these 4 parameters, we have to compute the "MGMT status"(Presence of MGMT), which is binary, i,e takes …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.