kaggle

can someone explain how to create new features using feature interactions?

Ishak

2022年5月8日 18:44

There is this notebook solving housing prices. https://www.kaggle.com/code/jesucristo/1-house-prices-solution-top-1/notebook?scriptVersionId=12846740 and it had this bit of code, can anyone explain the how addition and multiplication and weighs work? features['YrBltAndRemod']=features['YearBuilt']+features['YearRemodAdd'] features['TotalSF']=features['TotalBsmtSF'] + features['1stFlrSF'] + features['2ndFlrSF'] features['Total_sqr_footage'] = (features['BsmtFinSF1'] + features['BsmtFinSF2'] + features['1stFlrSF'] + features['2ndFlrSF']) features['Total_Bathrooms'] = (features['FullBath'] + (0.5 * features['HalfBath']) + features['BsmtFullBath'] + (0.5 * features['BsmtHalfBath'])) features['Total_porch_sf'] = (features['OpenPorchSF'] + features['3SsnPorch'] + features['EnclosedPorch'] + features['ScreenPorch'] + features['WoodDeckSF'])

Topic: feature-engineering regression kaggle

Category: Data Science

How to load numerous files from google drive into colab

sir_olf

2022年5月5日 06:02

I am trying to load in 30k images (600mb) from Google drive into Google Colaboratory to further process them with Keras/PyTorch. Therefore I have first mounted my Google drive using: from google.colab import drive drive.mount('/content/gdrive') Next I have unzipped the image file using: !unzip -uq "/content/gdrive/My Drive/path.zip" -d "/content/gdrive/My Drive/path/" Counting how many files are located in the directory using: len(os.listdir(path-to-train-images)) I only find 13k images (whereas I should find 30k). According to the output of unzip, the files appear …

Topic: kaggle dataset google machine-learning

Category: Data Science

Kaggle Titanic submission score is higher than local accuracy score

Kenny

2022年5月2日 01:02

This is the starter challenge, Titanic. The original question I posted on Kaggle is here. However, nobody really gives any insightful advice so I am turning to the powerful Stackoverflow community. Based on this Notebook, we can download the ground truth for this challenge and get a perfect score. I tested it and it does give me 100% on LB for the purpose of confirming it is the ground truth as it claims. (side question here: how do I remove …

Topic: kaggle machine-learning

Category: Data Science

How to use Random Forest to reduce dimensions

Andros Adrianopolos

2022年4月22日 00:04

I am working on the Boston competition on Kaggle and at the moment I am trying to use Random Forest to find the columns with the highest correlation with the target variable SalePrice. However, the implementation returned almost every single variable in the dataset: 0 1 2 3 4 5 6 ... 252 253 254 255 256 257 258 0 1 RL 65.0 8450 Pave NaN Reg ... 0 1 0 0 1 0 1 1 2 RL 80.0 9600 …

Topic: kaggle random-forest feature-selection python machine-learning

Category: Data Science

Why SMOTE is not used in prize-winning Kaggle solutions?

Carlos Mougan

2022年2月21日 07:33

Synthetic Minority Over-sampling Technique SMOTE, is a well known method to tackle imbalanced datasets. There are many papers with a lot of citations out-there claiming that it is used to boost accuracy in unbalanced data scenarios. But then, when I see Kaggle competitions, it is rarely used, to the best of my knowledge there are no prize-winning Kaggle/ML competitions where it is used to achieve the best solution. Why SMOTE is not used in Kaggle? I even see applied research …

Topic: smote kaggle class-imbalance machine-learning

Category: Data Science

Where can I practice multivariate outlier detection?

Mina Ashraf

2022年2月20日 17:57

Can anyone provide me with a dataset, hopefully on Kaggle, where I can practice my skills in outlier analysis? I have been studying this topic for quite a while, but I can't find a case study to apply my knowledge? bonus points: if it had some categorical variables where I can practice various techniques for dealing with categorical variables and their correlation, it would be amazing. If not possible in the same dataset, it is ok also to guide me …

Topic: kaggle outlier

Category: Data Science

How to use the fillna method in a for loop

Andros Adrianopolos

2022年1月24日 15:38

I am working on a housing dataset. In a list of columns (Garage, Fireplace, etc), I have values called NA which just means that the particular house in question does not have that feature (Garage, Fireplace). It doesn't mean that the value is missing/unknown. However, Python interprets this as NaN, which is wrong. To come across this, I want to replace this value NA with XX to help Python distinguish it from NaN values. Because there is a whole list …

Topic: feature-engineering kaggle pandas python machine-learning

Category: Data Science

Need help understanding how this Neural Network is working

Sharhad Bashar

2022年1月23日 20:07

This is a model I came across, and I need some help understanding how it works It uses South German Credit Prediction data set from Kaggle !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00573/SouthGermanCredit.zip with zipfile.ZipFile('SouthGermanCredit.zip', 'r') as zip_ref: zip_ref.extractall('./SouthGermanCredit/') from tensorflow.keras import regularizers batch_size=32 learning_rate=1e-3 trainX, testX, trainY, testY = train_test_split(features, labels, test_size=0.2, random_state=69) normalizer = preprocessing.Normalization() normalizer.adapt(np.array(trainX)) model = tf.keras.Sequential([ normalizer, layers.Dense(128, activation='elu', kernel_regularizer=regularizers.l2(0.01)), layers.Dropout(0.5), layers.Dense(128, activation='elu', kernel_regularizer=regularizers.l2(0.01)), layers.Dropout(0.5), layers.Dense(2), layers.Softmax()]) model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate), loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False), metrics=['accuracy']) model.fit(trainX, trainY, epochs=50, verbose=0, batch_size=batch_size) test_loss, test_acc = model.evaluate(testX, testY, …

Topic: keras tensorflow kaggle neural-network python

Category: Data Science

How to use Kaggle Api in Google Colab for directly using dataset?

mozilla-firefox

2022年1月6日 06:02

I know that we can use Kaggle's api directly in google colab which downloads the dataset. The commands are : !mkdir .kaggle !echo '{"username":"somename","key":"apikey"}' > /root/.kaggle/kaggle.json !chmod 600 /root/.kaggle/kaggle.json !kaggle competitions download -c bluebook-for-bulldozers -p /content But I need to do this process of making .kaggle file and pass the apikey in google colab gpu everytime. And sometimes the echo command run saying no file called .kaggle but after say 2 mins without restarting the kernal, it works. It sounds …

Topic: colab gpu kaggle python google

Category: Data Science

Changing the predicted variable from price to price/km due to better visual correlation

Michael Mykhaylov

2022年1月5日 03:49

I'm working on a dataset of Uber Rides from Kaggle. Of the important variables there are pickup and drop-off coordinates, passenger count, datetime of pickup, distance and the final price. I'm currently in the exploration phase and just about to begin feature engineering. When I'm plotting the different potential correlations, some of them just feel odd to plot fare against something. For example, fare vs passenger count or fare vs hour doesn't make much sense to me, as the average …

Topic: seaborn feature-engineering kaggle visualization

Category: Data Science

Massive difference in accuracy of KNN depending on random_state

Esoog

2021年12月29日 14:36

pardon the noob question but I am baffled by the following behavior. My model has MASSIVELY different results based on the random seed. I want to train a KNN classifier on the famous Kaggle Titanic problem where we attempt to predict survival or not. I focus only on the "Sex" feature, to make things easier. The problem becomes now that by changing the random seed the results of the accuracy change incredibly. For example, one random seed gives me a …

Topic: k-nn kaggle machine-learning

Category: Data Science

Kaggle notebook Vs Google Colab

ashraf

2021年12月16日 20:39

What are the major differences between Kaggle notebook and Google Colab notebook? To work on a dataset my first step is to start a Kaggle notebook but then I cant help thinking what could be the advantage of using Colab notebook instead. I know few differences, correct me if I'm mistaken about any: Kaggle has a console and Colab doesn't (but I still don't know what to do with the console). Kaggle notebook allows collaboration with other users on Kaggle's …

Topic: colab difference kaggle python

Category: Data Science

Why feature engineering and filling NaN's reduce score?

Dmitry Sokolov

2021年12月6日 07:13

I used CatBoost for InClass Kaggle competition. I have tried various strategies to filling NaN values. Convert float binary variables to categorical. Add new categorical features (from age, for example). I have tried generate new features from existents, also tried remove irrelevant features (by correlation, feature importance, SHAP). But it all only makes it worse! Why? The best score came out without any preprocessing only with found hyper-parameters via random_seach

Topic: catboost feature-engineering kaggle feature-selection data-cleaning

Category: Data Science

Difference between model score on test part and Kaggle public score

Dmitry Sokolov

2021年11月13日 10:29

I tested my CatBoostModel model on part of data and get 0.92 score, but Kaggle public score was 0.9. I found new hyperparameters via randomsearch, new model score was 0.925, but on Kaggle score fell to 0.88. What should I do to validate the model correctly?

Topic: catboost validation score kaggle cross-validation

Category: Data Science

Why performance varies among validate set, public testset and private testset?

S.F. Yeh

2021年11月10日 06:42

When practicing with classical kaggle competitions, such and Titanic, House pricing, and so on, I followed the traditional process that I learned from textbook: split training data into trainig set and validation set (either by 7:3 or CV fit model with training set evaluate the model performance with validation set combine the training set and validation set and re-train the model with the same parameters that were good on validation set Predict the result of test set Something I could …

Topic: kaggle evaluation

Category: Data Science

Lightgbm vs xgboost vs catboost

David Masip

2021年10月28日 14:52

I've seen that in Kaggle competitions people are using lightgbms where they used to use xgboost. My question is: when would you rather use xgboost instead of lightgbm? What about catboost?

Topic: catboost lightgbm xgboost kaggle machine-learning

Category: Data Science

Keras ImageDataGenerator unable to find images

Blake Lucey

2021年10月15日 18:18

I'm trying to add image data to a Kaggle notebook so I can run a convolutional neural network but I'm having trouble doing this via ImageDataGenerator. This is the link to my Kaggle notebook These are my imports: import numpy as np # linear algebra# import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from random import randint from sklearn.utils import shuffle from sklearn.preprocessing import MinMaxScaler import tensorflow as tf# from tensorflow import keras# from tensorflow.keras.models import …

Topic: image-preprocessing cnn keras kaggle python

Category: Data Science

GloVe Embedding Matrix "could not broadcast input array from shape (0) into shape (300)"

J.Smith

2021年10月10日 11:01

I'm working on Quora Question Pairs data set. I'm trying to get embedding matrix for GloVe with the following code: EMBEDDING_DIM = 300 embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM)) for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector however I get the following error: ValueError: could not broadcast input array from shape (0) into shape (300) I searched the internet but couldn't find any tips. This is how I lobe GloVe: embeddings_index = …

Topic: embeddings word-embeddings kaggle

Category: Data Science

Not able to download image dataset from Kaggle api

Parth Sharma

2021年10月1日 12:52

I want to download only train dataset Dataset link here Kaggle API version - 1.5.12 (tensorflow) parthsharma@Parths-MacBook-Air ~ % kaggle datasets files paultimothymooney/chest-xray-pneumonia max() arg is an empty sequence (tensorflow) parthsharma@Parths-MacBook-Air ~ % kaggle datasets download paultimothymooney/chest-xray-pneumonia -f chest_xray/train 404 - Not Found

Topic: api kaggle python

Category: Data Science

Loading medical imaging data from multiple folders

satan 29

2021年9月28日 14:53

I have a fairly basic mathematical and implementational understanding of ML algorithms and CNNs, and I am trying to think of an approach for this task: https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification/data?select=test. The "data" section explains the task and also gives a preview of the dataset. Doubt on general Implementation approach: From what I understand, we have 4 input parameters: FLAIR , T1W, T1Gd, and T2W. Based on these 4 parameters, we have to compute the "MGMT status"(Presence of MGMT), which is binary, i,e takes …

Topic: data kaggle visualization machine-learning

Category: Data Science

About