Kaggle Challenge Strategy: RandomForestRegressor and Deep Learning, Part 6

Strategy and techniques for Kaggle competitions, focusing on RandomForestRegressor and fastai deep learning models, including hyperparameter optimization and preprocessing.

Series: Kaggle Competition - Deep Dive Tabular Data


Advanced Missing Value Analysis in Tabular Data, Part 1
Decision Tree Feature Selection Methodology, Part 2
RandomForestRegressor Performance Analysis, Part 3
Statistical Interpretation of Tabular Data, Part 4
Addressing the Out-of-Domain Problem in Feature Selection, Part 5
Kaggle Challenge Strategy: RandomForestRegressor and Deep Learning, Part 6
Hyperparameter Optimization in Deep Learning for Kaggle, Part 7

Kaggle Challenge Strategy: RandomForestRegressor and Deep Learning, Part 6

For the final submission, we train several models and combine their predictions in the form of a weighted ensemble prediction. Estimators from the following model types are included. Number of iterations marks the final number of iterations used for the submission.

The hyperparameter optimization for each of them is:

Creating Estimators Optimized For Kaggle

So far, the focus has been on fitting estimators for interpretability and not for the lowest RMSE value. The kaggle competition we want to submit our final predictions to however only scores each submission based on RMSE value on the test set and nothing else. This makes it necessary that we try to create estimators that are the result of hyperparameter tuning, starting with few iterations where we check the resulting RMSE values and building up to using as many iterations that our hardware can handle within a reasonable duration of no more than 5 minutes give or take or stop adding more iterations to the hyperparameter optimization procedure, if RMSE values stop improving despite increasing the number of iterations.

RandomForestRegressor Optimization

Using a manually created test harness, the RMSE values for each iteration on the training and validation set are appended to list m_rmsel and m_rmselv respectively, and it is these lists that are returned by the function.

def rf2(
    xs_final=xs_final,
    y=y,
    valid_xs_final=valid_xs_final,
    valid_y=valid_y,
    nestimators=[60, 50, 40, 30, 20],
    max_samples=[200, 300, 400, 500, 600, 700],
    max_features=0.5,
    min_samples_leaf=5,
    **kwargs,
):
    from itertools import product

    m_rmsel = []
    m_rmselv = []
    setups = product(nestimators, max_samples)
    for ne in setups:
        mt = RandomForestRegressor(
            n_jobs=-1,
            n_estimators=ne[0],
            max_samples=ne[1],
            max_features=max_features,
            min_samples_leaf=min_samples_leaf,
            oob_score=True,
            random_state=seed,
        ).fit(xs_final, y)
        m_rmsel.append((m_rmse(mt, xs_final, y), ne[0], ne[1]))
        m_rmselv.append((m_rmse(mt, valid_xs_final, valid_y), ne[0], ne[1]))
    return m_rmsel, m_rmselv

We run the manual hyperparameter optimization and assign the outputs to m_rmset and m_rmsev respectively.

m_rmset, m_rmsev = rf2()

The evaluation is done by creating a DataFrame and then using pandas .groupby method along with aggregation method .agg where we aggregate by the minimum over each m_rmsev value. We choose the parameter combination found in the first row of the resulting grouped_opt DataFrame.

dfm_rmsev = pd.DataFrame(m_rmsev, columns=["m_rmsev", "n_estimators", "max_samples"])
grouped_opt = dfm_rmsev.groupby(by="m_rmsev").agg(min)
grouped_opt.iloc[:5, :]
n_estimators max_samples
m_rmsev
0.138596 60 600
0.139147 50 600
0.139720 40 600
0.140007 60 700
0.140081 30 600

To avoid using the wrong parameter combination, one that is not the optimal one for the given execution of the code, we assign the values for the optimal number of n_estimators and max_samples directly by the index values that hold the optimal parameter values in of grouped_opt.

Function rff will fit a RandomForestRegressor with the optimal parameter values, as found by the hyperparameter optimization procedure outlined above regardless of execution number.

def rff(
    xs,
    y,
    n_estimators=grouped_opt.iloc[0, 0],
    max_samples=grouped_opt.iloc[0, 1],
    max_features=0.5,
    min_samples_leaf=5,
    **kwargs,
):
    return RandomForestRegressor(
        n_jobs=-1,
        n_estimators=n_estimators,
        max_samples=max_samples,
        max_features=max_features,
        min_samples_leaf=min_samples_leaf,
        oob_score=True,
        random_state=seed,
    ).fit(xs, y)

Final RandomForestRegressor RMSE Values

Executing function rff we get the RMSE values for the fitted estimator.

m = rff(xs_final, y)
m_rmse(m, xs_final, y), m_rmse(m, valid_xs_final, valid_y)
(0.124334, 0.138596)

tabular_learner - Deep Learning Model

While dropping garagearea resulted in a slightly higher accuracy using RandomForestRegressor on the validation set, the increase was marginal. Let’s see what the results are using neural networks.

The original csv files are imported, and we show how to apply the preprocessing steps using the TabularPandas function from the fastai library.

Creating the DataFrames for fitting the deep learning model.

nn_t = base + "/" + "my_competitions/kaggle_competition_house_prices/data/train.csv"
nn_v = base + "/" + "my_competitions/kaggle_competition_house_prices/data/test.csv"
dfnn_t = pd.read_csv(nn_t, low_memory=False).clean_names()
dfnn_v = pd.read_csv(nn_v, low_memory=False).clean_names()
print(len(dfnn_v))
dfnn_v.columns[:3]
1459





Index(['id', 'mssubclass', 'mszoning'], dtype='object')

Assigning the ordered categorical columns to the data, as we did before for the tree based models in a previous part. See Deep Dive Tabular Data Part 1

dfnn_t = cu(dfnn_t, uset, usetna)
dfnn_v = cu(dfnn_v, uset, usetna)
dfnn_t = tl(dfnn_t)
extercond Index(['Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
heatingqc Index(['Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
fireplacequ Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
garagequal Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
garagecond Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
extercond Index(['Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
heatingqc Index(['Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
fireplacequ Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
garagecond Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')

Applying the log function to the dependent variable saleprice.

Only use the columns that were left in the dataset after analyzing the contribution of each of the columns in the previous section.

dfnn_tf = dfnn_t[
    xs_final_ext.columns.tolist() + ["saleprice"]
]  # _tf stands for train and final (train dataset from kaggle)
dfnn_vf = dfnn_v[
    xs_final_ext.columns.tolist()
]  # _vf stands for validation final (test dataset from kaggle)
print(len(dfnn_vf))
dfnn_tf.sample(n=3, random_state=seed)
1459
overallqual grlivarea yearbuilt garagecars 1stflrsf ... lotfrontage fireplaces 2ndflrsf totrmsabvgrd saleprice
892 6 1068 1963 1 1068 ... 70.0 0 0 6 11.947949
1105 8 2622 1994 2 1500 ... 98.0 2 1122 9 12.691580
413 5 1028 1927 2 1028 ... 56.0 1 0 5 11.652687

3 rows × 19 columns

Verify that the number of columns in dfnn_tf is correct.

len(dfnn_tf.columns)
19

Testing Of Different Values For Parameter max_card

Values in the range between 2 and 100 are tested. Output is hidden, for readability.

for i in range(2, 101):
    contnn, catnn = cont_cat_split(dfnn_tf, max_card=i, dep_var="saleprice")
#    print(f"{len(contnn)}, {i}: {contnn}")

Looking at the above output, and the fact that it is hard to find a column in the dataset that can be clearly identified as having continuous values, only columns with more than 100 unique values are assigned as being continuous. The final continuous columns are printed below. The output has the format.

(x,y,z)

x := Number of type continuous columns, given threshold value y
y := Minimum for number of unique values, for a column to be assigned type continuous
z := List of names of columns assigned type continuous

Example given below:

>>> 9, 100: ['grlivarea', 'yearbuilt', '1stflrsf', 'garageyrblt', 'totalbsmtsf',
                'bsmtfinsf1', 'lotarea', 'lotfrontage', '2ndflrsf']

Creating and displaying the continuous and categorical columns using max_card 100.

contnn, catnn = cont_cat_split(dfnn_tf, max_card=100, dep_var="saleprice")
catnn
['overallqual',
 'garagecars',
 'fullbath',
 'fireplacequ',
 'centralair',
 'yearremodadd',
 'garagecond',
 'fireplaces',
 'totrmsabvgrd']
contnn
['grlivarea',
 'yearbuilt',
 '1stflrsf',
 'garageyrblt',
 'totalbsmtsf',
 'bsmtfinsf1',
 'lotarea',
 'lotfrontage',
 '2ndflrsf']

Print the number of unique values for all columns part of subset categorical columns.

dfnn_tf[catnn].nunique().sort_values(ascending=False)
yearremodadd    61
totrmsabvgrd    12
overallqual     10
                ..
fullbath         4
fireplaces       4
centralair       2
Length: 9, dtype: int64

Run TabularPandas Function

Since none of the boolean columns that indicate whether there was or wasn’t a missing value in a row of a column are present in the final training dataset, we drop these columns from the created tabular object below. Doing this now, helps us in making the training and test data compatible, if the test data has missing values in columns, where the training data doesn’t.

procsnn = [Categorify, FillMissing(add_col=False), Normalize]
tonn = TabularPandas(
    dfnn_tf,
    procsnn,
    catnn,
    contnn,
    splits=(train_s, valid_s),
    y_names="saleprice",
)

Create Dataloaders Object

The dataloaders object holds all training and validation sets with the preprocessed TabularPandas object as input.

dls = tonn.dataloaders(1024)
x_nnt, y = dls.train.xs, dls.train.y
x_val_nnt, y_val = dls.valid.xs, dls.valid.y
y.min(), y.max()
(10.46024227142334, 13.534473419189453)

Calculate the RMSE value using the data sets from the dataloaders function.

m2 = rff(x_nnt, y)
m_rmse(m2, x_nnt, y), m_rmse(m2, x_val_nnt, y_val)
(0.124612, 0.135281)

Create tabular_learner estimator

Create the tabular_learner object using the dataloaders object from the previous step. The range of the independent variable saleprice is adjusted to be narrower than the default range.

learn = tabular_learner(dls, y_range=(10.45, 13.55), n_out=1, loss_func=F.mse_loss)

Preprocessing Of The Kaggle Test Dataset

A look at the columns of the DataFrame that holds the independent variables, as given by the Kaggle test dataset. This is the dataset that the final predictions need to be made on.

dfnn_vf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 18 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   overallqual   1459 non-null   int64   
 1   grlivarea     1459 non-null   int64   
 2   yearbuilt     1459 non-null   int64   
 3   garagecars    1458 non-null   float64 
 4   1stflrsf      1459 non-null   int64   
 5   fullbath      1459 non-null   int64   
 6   garageyrblt   1381 non-null   float64 
 7   totalbsmtsf   1458 non-null   float64 
 8   fireplacequ   1459 non-null   category
 9   bsmtfinsf1    1458 non-null   float64 
 10  lotarea       1459 non-null   int64   
 11  centralair    1459 non-null   object  
 12  yearremodadd  1459 non-null   int64   
 13  garagecond    1459 non-null   category
 14  lotfrontage   1232 non-null   float64 
 15  fireplaces    1459 non-null   int64   
 16  2ndflrsf      1459 non-null   int64   
 17  totrmsabvgrd  1459 non-null   int64   
dtypes: category(2), float64(5), int64(10), object(1)
memory usage: 185.8+ KB

Looking at a random sample containing 5 rows of the DataFrame.

dfnn_vf.sample(n=5, random_state=seed)
overallqual grlivarea yearbuilt garagecars 1stflrsf ... garagecond lotfrontage fireplaces 2ndflrsf totrmsabvgrd
1321 4 864 1950 1.0 864 ... TA 60.0 0 0 5
836 8 2100 2007 3.0 958 ... TA 82.0 2 1142 8
413 5 990 1994 1.0 990 ... TA 65.0 0 0 5
522 8 1342 2006 2.0 1342 ... TA 48.0 1 0 6
1035 6 2422 1954 2.0 2422 ... TA 102.0 2 0 6

5 rows × 18 columns

We apply the same procs we used for the training dataset during the call to TabularPandas, followed by creating the dataloaders object and assigning the independent variables to variable x_valid.

Since there is no dependent variable in this dataset, there is no .y part. We omitted the parameter y_names for that reason and not passing the function a value for splits does not split the dataset into training and validation data. All rows will be part of the dlsv.train.xs part.

In order to get predictions using the test data from Kaggle using the fitted estimator, we call the name of the TabularPandas object used for training and apply method .new to it and pass it the training data (it is the Kaggle test data) from dataloaders object dlsv by writing dlsv.train.xs. The data is processed and the dataloaders object with the test data is loaded for predictions.

procsnn = [Categorify, FillMissing(add_col=False), Normalize]
tonn_vf = TabularPandas(dfnn_vf, procsnn, catnn, contnn)
dlsv = tonn_vf.dataloaders(1024)
x_valid = dlsv.train.xs
tonn_vfs = tonn.new(dlsv.train.xs)
tonn_vfs.process()
tonn_vfs.items.head()
tonn_vfs_dl = dls.valid.new(tonn_vfs)
tonn_vfs_dl.show_batch()
overallqual garagecars fullbath fireplacequ centralair yearremodadd garagecond fireplaces totrmsabvgrd grlivarea yearbuilt 1stflrsf garageyrblt totalbsmtsf bsmtfinsf1 lotarea lotfrontage 2ndflrsf
0 5 2 2 #na# #na# #na# #na# 1 3 -1.215545 -0.340839 -0.654589 -0.653021 -0.370719 0.063433 0.363854 0.567329 -0.775266
1 6 2 2 #na# #na# #na# #na# 1 4 -0.323504 -0.439751 0.433328 -0.769719 0.639182 1.063523 0.897789 0.615966 -0.775266
2 5 3 3 #na# #na# #na# #na# 2 4 0.294491 0.844006 -0.574169 0.747364 -0.266790 0.773368 0.809397 0.275533 0.891941
3 6 3 3 #na# #na# #na# #na# 2 5 0.243012 0.876977 -0.579218 0.786203 -0.271295 0.357968 0.031786 0.470066 0.837236
4 8 3 3 #na# #na# #na# #na# 1 3 -0.424442 0.679386 0.310173 0.552805 0.528496 -0.387169 -0.971584 -1.232092 -0.775266
5 6 3 3 #na# #na# #na# #na# 2 5 0.348114 0.712356 -0.988706 0.591644 -0.639604 -0.965220 0.036564 0.324165 1.346189
6 6 3 3 #na# #na# #na# #na# 1 4 -0.616098 0.679386 0.076580 0.552805 0.275484 1.089883 -0.370756 -0.064899 -0.775266
7 6 3 3 #na# #na# #na# #na# 2 5 -0.043400 0.876977 -0.923342 0.786203 -0.580829 -0.965220 -0.285948 -0.259431 0.832477
8 7 3 2 #na# #na# #na# #na# 2 3 -0.298774 0.613678 0.463439 0.474945 0.573757 0.434900 0.072399 0.810498 -0.775266
9 4 3 2 #na# #na# #na# #na# 1 2 -1.244439 -0.044803 -0.689749 -0.302924 -0.370719 0.801959 -0.285948 0.081001 -0.775266

Entire Series:

Advanced Missing Value Analysis in Tabular Data, Part 1
Decision Tree Feature Selection Methodology, Part 2
RandomForestRegressor Performance Analysis, Part 3
Statistical Interpretation of Tabular Data, Part 4
Addressing the Out-of-Domain Problem in Feature Selection, Part 5
Kaggle Challenge Strategy: RandomForestRegressor and Deep Learning, Part 6
Hyperparameter Optimization in Deep Learning for Kaggle, Part 7