There is this notebook solving housing prices. https://www.kaggle.com/code/jesucristo/1-house-prices-solution-top-1/notebook?scriptVersionId=12846740 and it had this bit of code, can anyone explain the how addition and multiplication and weighs work? features['YrBltAndRemod']=features['YearBuilt']+features['YearRemodAdd'] features['TotalSF']=features['TotalBsmtSF'] + features['1stFlrSF'] + features['2ndFlrSF'] features['Total_sqr_footage'] = (features['BsmtFinSF1'] + features['BsmtFinSF2'] + features['1stFlrSF'] + features['2ndFlrSF']) features['Total_Bathrooms'] = (features['FullBath'] + (0.5 * features['HalfBath']) + features['BsmtFullBath'] + (0.5 * features['BsmtHalfBath'])) features['Total_porch_sf'] = (features['OpenPorchSF'] + features['3SsnPorch'] + features['EnclosedPorch'] + features['ScreenPorch'] + features['WoodDeckSF'])
I am trying to load in 30k images (600mb) from Google drive into Google Colaboratory to further process them with Keras/PyTorch. Therefore I have first mounted my Google drive using: from google.colab import drive drive.mount('/content/gdrive') Next I have unzipped the image file using: !unzip -uq "/content/gdrive/My Drive/path.zip" -d "/content/gdrive/My Drive/path/" Counting how many files are located in the directory using: len(os.listdir(path-to-train-images)) I only find 13k images (whereas I should find 30k). According to the output of unzip, the files appear …
This is the starter challenge, Titanic. The original question I posted on Kaggle is here. However, nobody really gives any insightful advice so I am turning to the powerful Stackoverflow community. Based on this Notebook, we can download the ground truth for this challenge and get a perfect score. I tested it and it does give me 100% on LB for the purpose of confirming it is the ground truth as it claims. (side question here: how do I remove …
I am working on the Boston competition on Kaggle and at the moment I am trying to use Random Forest to find the columns with the highest correlation with the target variable SalePrice. However, the implementation returned almost every single variable in the dataset: 0 1 2 3 4 5 6 ... 252 253 254 255 256 257 258 0 1 RL 65.0 8450 Pave NaN Reg ... 0 1 0 0 1 0 1 1 2 RL 80.0 9600 …
Synthetic Minority Over-sampling Technique SMOTE, is a well known method to tackle imbalanced datasets. There are many papers with a lot of citations out-there claiming that it is used to boost accuracy in unbalanced data scenarios. But then, when I see Kaggle competitions, it is rarely used, to the best of my knowledge there are no prize-winning Kaggle/ML competitions where it is used to achieve the best solution. Why SMOTE is not used in Kaggle? I even see applied research …
Can anyone provide me with a dataset, hopefully on Kaggle, where I can practice my skills in outlier analysis? I have been studying this topic for quite a while, but I can't find a case study to apply my knowledge? bonus points: if it had some categorical variables where I can practice various techniques for dealing with categorical variables and their correlation, it would be amazing. If not possible in the same dataset, it is ok also to guide me …
I am working on a housing dataset. In a list of columns (Garage, Fireplace, etc), I have values called NA which just means that the particular house in question does not have that feature (Garage, Fireplace). It doesn't mean that the value is missing/unknown. However, Python interprets this as NaN, which is wrong. To come across this, I want to replace this value NA with XX to help Python distinguish it from NaN values. Because there is a whole list …
This is a model I came across, and I need some help understanding how it works It uses South German Credit Prediction data set from Kaggle !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00573/SouthGermanCredit.zip with zipfile.ZipFile('SouthGermanCredit.zip', 'r') as zip_ref: zip_ref.extractall('./SouthGermanCredit/') from tensorflow.keras import regularizers batch_size=32 learning_rate=1e-3 trainX, testX, trainY, testY = train_test_split(features, labels, test_size=0.2, random_state=69) normalizer = preprocessing.Normalization() normalizer.adapt(np.array(trainX)) model = tf.keras.Sequential([ normalizer, layers.Dense(128, activation='elu', kernel_regularizer=regularizers.l2(0.01)), layers.Dropout(0.5), layers.Dense(128, activation='elu', kernel_regularizer=regularizers.l2(0.01)), layers.Dropout(0.5), layers.Dense(2), layers.Softmax()]) model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate), loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False), metrics=['accuracy']) model.fit(trainX, trainY, epochs=50, verbose=0, batch_size=batch_size) test_loss, test_acc = model.evaluate(testX, testY, …
I know that we can use Kaggle's api directly in google colab which downloads the dataset. The commands are : !mkdir .kaggle !echo '{"username":"somename","key":"apikey"}' > /root/.kaggle/kaggle.json !chmod 600 /root/.kaggle/kaggle.json !kaggle competitions download -c bluebook-for-bulldozers -p /content But I need to do this process of making .kaggle file and pass the apikey in google colab gpu everytime. And sometimes the echo command run saying no file called .kaggle but after say 2 mins without restarting the kernal, it works. It sounds …
I'm working on a dataset of Uber Rides from Kaggle. Of the important variables there are pickup and drop-off coordinates, passenger count, datetime of pickup, distance and the final price. I'm currently in the exploration phase and just about to begin feature engineering. When I'm plotting the different potential correlations, some of them just feel odd to plot fare against something. For example, fare vs passenger count or fare vs hour doesn't make much sense to me, as the average …
pardon the noob question but I am baffled by the following behavior. My model has MASSIVELY different results based on the random seed. I want to train a KNN classifier on the famous Kaggle Titanic problem where we attempt to predict survival or not. I focus only on the "Sex" feature, to make things easier. The problem becomes now that by changing the random seed the results of the accuracy change incredibly. For example, one random seed gives me a …
What are the major differences between Kaggle notebook and Google Colab notebook? To work on a dataset my first step is to start a Kaggle notebook but then I cant help thinking what could be the advantage of using Colab notebook instead. I know few differences, correct me if I'm mistaken about any: Kaggle has a console and Colab doesn't (but I still don't know what to do with the console). Kaggle notebook allows collaboration with other users on Kaggle's …
I used CatBoost for InClass Kaggle competition. I have tried various strategies to filling NaN values. Convert float binary variables to categorical. Add new categorical features (from age, for example). I have tried generate new features from existents, also tried remove irrelevant features (by correlation, feature importance, SHAP). But it all only makes it worse! Why? The best score came out without any preprocessing only with found hyper-parameters via random_seach
I tested my CatBoostModel model on part of data and get 0.92 score, but Kaggle public score was 0.9. I found new hyperparameters via randomsearch, new model score was 0.925, but on Kaggle score fell to 0.88. What should I do to validate the model correctly?
When practicing with classical kaggle competitions, such and Titanic, House pricing, and so on, I followed the traditional process that I learned from textbook: split training data into trainig set and validation set (either by 7:3 or CV fit model with training set evaluate the model performance with validation set combine the training set and validation set and re-train the model with the same parameters that were good on validation set Predict the result of test set Something I could …
I've seen that in Kaggle competitions people are using lightgbms where they used to use xgboost. My question is: when would you rather use xgboost instead of lightgbm? What about catboost?
I'm trying to add image data to a Kaggle notebook so I can run a convolutional neural network but I'm having trouble doing this via ImageDataGenerator. This is the link to my Kaggle notebook These are my imports: import numpy as np # linear algebra# import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from random import randint from sklearn.utils import shuffle from sklearn.preprocessing import MinMaxScaler import tensorflow as tf# from tensorflow import keras# from tensorflow.keras.models import …
I'm working on Quora Question Pairs data set. I'm trying to get embedding matrix for GloVe with the following code: EMBEDDING_DIM = 300 embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM)) for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector however I get the following error: ValueError: could not broadcast input array from shape (0) into shape (300) I searched the internet but couldn't find any tips. This is how I lobe GloVe: embeddings_index = …
I want to download only train dataset Dataset link here Kaggle API version - 1.5.12 (tensorflow) parthsharma@Parths-MacBook-Air ~ % kaggle datasets files paultimothymooney/chest-xray-pneumonia max() arg is an empty sequence (tensorflow) parthsharma@Parths-MacBook-Air ~ % kaggle datasets download paultimothymooney/chest-xray-pneumonia -f chest_xray/train 404 - Not Found
I have a fairly basic mathematical and implementational understanding of ML algorithms and CNNs, and I am trying to think of an approach for this task: https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification/data?select=test. The "data" section explains the task and also gives a preview of the dataset. Doubt on general Implementation approach: From what I understand, we have 4 input parameters: FLAIR , T1W, T1Gd, and T2W. Based on these 4 parameters, we have to compute the "MGMT status"(Presence of MGMT), which is binary, i,e takes …