How to use Random Forest to reduce dimensions
I am working on the Boston competition on Kaggle and at the moment I am trying to use Random Forest to find the columns with the highest correlation with the target variable SalePrice
. However, the implementation returned almost every single variable in the dataset:
0 1 2 3 4 5 6 ... 252 253 254 255 256 257 258
0 1 RL 65.0 8450 Pave NaN Reg ... 0 1 0 0 1 0 1
1 2 RL 80.0 9600 Pave NaN Reg ... 0 1 0 0 1 0 1
2 3 RL 68.0 11250 Pave NaN IR1 ... 0 1 0 0 1 0 1
3 4 RL 60.0 9550 Pave NaN IR1 ... 0 0 0 0 1 0 1
4 5 RL 84.0 14260 Pave NaN IR1 ... 0 1 0 0 1 0 1
5 6 RL 85.0 14115 Pave NaN IR1 ... 0 1 0 0 1 0 1
6 7 RL 75.0 10084 Pave NaN Reg ... 0 1 0 0 1 0 1
7 8 RL NaN 10382 Pave NaN IR1 ... 0 1 0 0 1 0 1
8 9 RM 51.0 6120 Pave NaN Reg ... 0 0 0 0 1 0 1
9 10 RL 50.0 7420 Pave NaN Reg ... 0 1 0 0 1 0 1
10 11 RL 70.0 11200 Pave NaN Reg ... 0 1 0 0 1 0 1
11 12 RL 85.0 11924 Pave NaN IR1 ... 0 0 1 0 1 0 1
12 13 RL NaN 12968 Pave NaN IR2 ... 0 1 0 0 1 0 1
13 14 RL 91.0 10652 Pave NaN IR1 ... 0 0 1 0 1 0 1
14 15 RL NaN 10920 Pave NaN IR1 ... 0 1 0 0 1 0 1
15 16 RM 51.0 6120 Pave NaN Reg ... 0 1 0 0 1 0 1
16 17 RL NaN 11241 Pave NaN IR1 ... 0 1 0 0 1 0 1
17 18 RL 72.0 10791 Pave NaN Reg ... 0 1 0 0 1 0 1
18 19 RL 66.0 13695 Pave NaN Reg ... 0 1 0 0 1 0 1
19 20 RL 70.0 7560 Pave NaN Reg ... 0 0 0 0 1 0 1
20 21 RL 101.0 14215 Pave NaN IR1 ... 0 0 1 0 1 0 1
21 22 RM 57.0 7449 Pave Grvl Reg ... 0 1 0 0 1 0 1
22 23 RL 75.0 9742 Pave NaN Reg ... 0 1 0 0 1 0 1
23 24 RM 44.0 4224 Pave NaN Reg ... 0 1 0 0 1 0 1
24 25 RL NaN 8246 Pave NaN IR1 ... 0 1 0 0 1 0 1
25 26 RL 110.0 14230 Pave NaN Reg ... 0 1 0 0 1 0 1
26 27 RL 60.0 7200 Pave NaN Reg ... 0 1 0 0 1 0 1
27 28 RL 98.0 11478 Pave NaN Reg ... 0 1 0 0 1 0 1
28 29 RL 47.0 16321 Pave NaN IR1 ... 0 1 0 0 1 0 1
29 30 RM 60.0 6324 Pave NaN IR1 ... 0 1 0 0 1 1 0
... ... .. ... ... ... ... ... ... .. .. .. .. .. .. ..
1430 1431 RL 60.0 21930 Pave NaN IR3 ... 0 1 0 0 1 0 1
1431 1432 RL NaN 4928 Pave NaN IR1 ... 0 1 0 0 1 0 1
Not only that but some of these columns are also returning NaN
values. I already took care of NaN
values before returning anything.
Caveat: I am using Random Forest right after one-hot encoding my categorical variables so that is part of the reason why the return has such a high dimension.
Here is my implementation so far:
I have gathered the name of my categorical, continuous and binary variables in separate lists:
categorical_columns = ['MSSubClass', 'MSZoning', 'LotShape', 'LandContour', 'LotConfig', 'Neighborhood', 'Condition1',
'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
'Foundation', 'Heating', 'Electrical', 'Functional', 'GarageType', 'PavedDrive', 'Fence',
'MiscFeature', 'SaleType', 'SaleCondition', 'Street', 'CentralAir']
ranked_columns = ['Utilities', 'LandSlope', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond',
'PoolQC', 'OverallQual', 'OverallCond']
numerical_columns = ['LotArea', 'LotFrontage', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
'BsmtUnfSF','TotalBsmtSF', '1stFlrSF', '2ndFlrSf', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
'BsmtHalfBath', 'FullBath', 'HalfBath', 'Bedroom', 'Kitchen', 'TotRmsAbvGrd', 'Fireplaces',
'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
'3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']
I've created a function definition named def feature_encoding(df, categorical_list):
and the following code is from this function definition:
Here, I am going through every categorical variable in categorical_columns
in a loop to one-hot encode each of them. At the end, I am inserting them back into the data frame:
for col in categorical_list:
# take one-hot encoding
OHE_sdf = pd.get_dummies(df[categorical_list])
# drop the old categorical column from original df
df.drop(col, axis = 1, inplace = True)
# attach one-hot encoded columns to original dataframe
df = pd.concat([df, OHE_sdf], axis = 1, ignore_index = True)
return df
Here, I am encoding my ranked values (for example: Excellent
, Good
, Average
) with integers:
df['Utilities'] = df['Utilities'].replace(['AllPub', 'NoSeWa'], [2, 1]) # Utilities
df['ExterQual'] = df['ExterQual'].replace(['Ex', 'Gd', 'TA', 'Fa'], [4, 3, 2, 1]) # Exterior Quality
df['LandSlope'] = df['LandSlope'].replace(['Gtl', 'Mod', 'Sev'], [3, 2, 1]) # Land Slope
df['ExterCond'] = df['ExterCond'].replace(['Ex', 'Gd', 'TA', 'Fa', 'Po'], [4, 3, 2, 1, 0]) # Exterior Condition
df['HeatingQC'] = df['HeatingQC'].replace(['Ex', 'Gd', 'TA', 'Fa', 'Po'], [4, 3, 2, 1, 0]) # Heating Quality and Condition
df['KitchenQual'] = df['KitchenQual'].replace(['Ex', 'Gd', 'TA', 'Fa'], [3, 2, 1, 0]) # Kitchen Quality
Some of the columns had values abbreviated as NA
, which meant something like "No pavement" but pandas interpreted it as NaN
. To avoid this, I replaced each of these abbreviations with something like XX
:
# Replacing the NA values of each column with XX to avoid pandas from listing them as NaN
na_data = ['Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']
for i in na_data:
df[i] = df[i].fillna('XX')
# Replaced the NaN values of LotFrontage and MasVnrArea with the mean of their column
df['LotFrontage'] = df['LotFrontage'].fillna(df['LotFrontage'].mean())
df['MasVnrArea'] = df['MasVnrArea'].fillna(df['MasVnrArea'].mean())
And finally, this is my Random Forest implementation to find correlated variables:
x_train, x_test, y_train, y_test = train_test_split(df, df['SalePrice'], test_size=0.3, random_state=42)
sel = SelectFromModel(RandomForestClassifier(n_estimators=100))
sel.fit(x_train, y_train)
sel.get_support()
selected_feat = x_train.columns[sel.get_support()]
I apologize for such a wordy post. I wanted to be as clear in my question as possible. If you'd like to see the entire .py file, it is in the same repository as the hyperlinked dataset.
Topic kaggle random-forest feature-selection python machine-learning
Category Data Science