Kaggle Challenge Strategy: RandomForestRegressor and Deep Learning, Part 6

Strategy and techniques for Kaggle competitions, focusing on RandomForestRegressor and fastai deep learning models, including hyperparameter optimization and preprocessing.

Tags: kaggle-competition-strategies model-optimization fastai hyperparameter-tuning tabular-data

Category: Tabular Data

Series: Kaggle Competition - Deep Dive Tabular Data

Advanced Missing Value Analysis in Tabular Data, Part 1
Decision Tree Feature Selection Methodology, Part 2
RandomForestRegressor Performance Analysis, Part 3
Statistical Interpretation of Tabular Data, Part 4
Addressing the Out-of-Domain Problem in Feature Selection, Part 5
Kaggle Challenge Strategy: RandomForestRegressor and Deep Learning, Part 6
Hyperparameter Optimization in Deep Learning for Kaggle, Part 7

Kaggle Challenge Strategy: RandomForestRegressor and Deep Learning, Part 6

For the final submission, we train several models and combine their predictions in the form of a weighted ensemble prediction. Estimators from the following model types are included. Number of iterations marks the final number of iterations used for the submission.

RandomForestRegressor from sklearn
XGBRegressor from xgboost
tabular_learner (deep learning model) from fastai

The hyperparameter optimization for each of them is:

RandomForestRegressor
- Manual loop with 30 iterations using parameters:
  - nestimators - number of estimators to use.
  - max_samples - maximum number of samples to use for training a single base estimator (tree).
tabular_learner
- Manual loop with 20 iterations using parameters:
  - lr (learning rate) - values tested depend on lr_find output.
  - epochs Number of epochs to train.
XGBRegressor
- RandomizedSearchCV with 1400 iterations and 8-fold cross-validation for each from sklearn using a parameter distribution dictionary.
- For details, see section ‘XGBRegressor Optimization’.

Creating Estimators Optimized For Kaggle

So far, the focus has been on fitting estimators for interpretability and not for the lowest RMSE value. The kaggle competition we want to submit our final predictions to however only scores each submission based on RMSE value on the test set and nothing else. This makes it necessary that we try to create estimators that are the result of hyperparameter tuning, starting with few iterations where we check the resulting RMSE values and building up to using as many iterations that our hardware can handle within a reasonable duration of no more than 5 minutes give or take or stop adding more iterations to the hyperparameter optimization procedure, if RMSE values stop improving despite increasing the number of iterations.

RandomForestRegressor Optimization

Using a manually created test harness, the RMSE values for each iteration on the training and validation set are appended to list m_rmsel and m_rmselv respectively, and it is these lists that are returned by the function.

def rf2(
    xs_final=xs_final,
    y=y,
    valid_xs_final=valid_xs_final,
    valid_y=valid_y,
    nestimators=[60, 50, 40, 30, 20],
    max_samples=[200, 300, 400, 500, 600, 700],
    max_features=0.5,
    min_samples_leaf=5,
    **kwargs,
):
    from itertools import product

    m_rmsel = []
    m_rmselv = []
    setups = product(nestimators, max_samples)
    for ne in setups:
        mt = RandomForestRegressor(
            n_jobs=-1,
            n_estimators=ne[0],
            max_samples=ne[1],
            max_features=max_features,
            min_samples_leaf=min_samples_leaf,
            oob_score=True,
            random_state=seed,
        ).fit(xs_final, y)
        m_rmsel.append((m_rmse(mt, xs_final, y), ne[0], ne[1]))
        m_rmselv.append((m_rmse(mt, valid_xs_final, valid_y), ne[0], ne[1]))
    return m_rmsel, m_rmselv

We run the manual hyperparameter optimization and assign the outputs to m_rmset and m_rmsev respectively.

m_rmset, m_rmsev = rf2()

The evaluation is done by creating a DataFrame and then using pandas .groupby method along with aggregation method .agg where we aggregate by the minimum over each m_rmsev value. We choose the parameter combination found in the first row of the resulting grouped_opt DataFrame.

dfm_rmsev = pd.DataFrame(m_rmsev, columns=["m_rmsev", "n_estimators", "max_samples"])
grouped_opt = dfm_rmsev.groupby(by="m_rmsev").agg(min)
grouped_opt.iloc[:5, :]

	n_estimators	max_samples
m_rmsev
0.138596	60	600
0.139147	50	600
0.139720	40	600
0.140007	60	700
0.140081	30	600

To avoid using the wrong parameter combination, one that is not the optimal one for the given execution of the code, we assign the values for the optimal number of n_estimators and max_samples directly by the index values that hold the optimal parameter values in of grouped_opt.

Function rff will fit a RandomForestRegressor with the optimal parameter values, as found by the hyperparameter optimization procedure outlined above regardless of execution number.

def rff(
    xs,
    y,
    n_estimators=grouped_opt.iloc[0, 0],
    max_samples=grouped_opt.iloc[0, 1],
    max_features=0.5,
    min_samples_leaf=5,
    **kwargs,
):
    return RandomForestRegressor(
        n_jobs=-1,
        n_estimators=n_estimators,
        max_samples=max_samples,
        max_features=max_features,
        min_samples_leaf=min_samples_leaf,
        oob_score=True,
        random_state=seed,
    ).fit(xs, y)

Final RandomForestRegressor RMSE Values

Executing function rff we get the RMSE values for the fitted estimator.

m = rff(xs_final, y)
m_rmse(m, xs_final, y), m_rmse(m, valid_xs_final, valid_y)

(0.124334, 0.138596)

tabular_learner - Deep Learning Model

While dropping garagearea resulted in a slightly higher accuracy using RandomForestRegressor on the validation set, the increase was marginal. Let’s see what the results are using neural networks.

The original csv files are imported, and we show how to apply the preprocessing steps using the TabularPandas function from the fastai library.

Creating the DataFrames for fitting the deep learning model.

nn_t = base + "/" + "my_competitions/kaggle_competition_house_prices/data/train.csv"
nn_v = base + "/" + "my_competitions/kaggle_competition_house_prices/data/test.csv"
dfnn_t = pd.read_csv(nn_t, low_memory=False).clean_names()
dfnn_v = pd.read_csv(nn_v, low_memory=False).clean_names()
print(len(dfnn_v))
dfnn_v.columns[:3]

1459

Index(['id', 'mssubclass', 'mszoning'], dtype='object')

Assigning the ordered categorical columns to the data, as we did before for the tree based models in a previous part. See Deep Dive Tabular Data Part 1

dfnn_t = cu(dfnn_t, uset, usetna)
dfnn_v = cu(dfnn_v, uset, usetna)
dfnn_t = tl(dfnn_t)

extercond Index(['Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
heatingqc Index(['Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
fireplacequ Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
garagequal Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
garagecond Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
extercond Index(['Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
heatingqc Index(['Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
fireplacequ Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
garagecond Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')

Applying the log function to the dependent variable saleprice.

Only use the columns that were left in the dataset after analyzing the contribution of each of the columns in the previous section.

dfnn_tf = dfnn_t[
    xs_final_ext.columns.tolist() + ["saleprice"]
]  # _tf stands for train and final (train dataset from kaggle)
dfnn_vf = dfnn_v[
    xs_final_ext.columns.tolist()
]  # _vf stands for validation final (test dataset from kaggle)
print(len(dfnn_vf))
dfnn_tf.sample(n=3, random_state=seed)

	overallqual	grlivarea	yearbuilt	garagecars	1stflrsf	...	lotfrontage	fireplaces	2ndflrsf	totrmsabvgrd	saleprice
892	6	1068	1963	1	1068	...	70.0	0	0	6	11.947949
1105	8	2622	1994	2	1500	...	98.0	2	1122	9	12.691580
413	5	1028	1927	2	1028	...	56.0	1	0	5	11.652687

3 rows × 19 columns

Verify that the number of columns in dfnn_tf is correct.

len(dfnn_tf.columns)

Testing Of Different Values For Parameter max_card

Values in the range between 2 and 100 are tested. Output is hidden, for readability.

for i in range(2, 101):
    contnn, catnn = cont_cat_split(dfnn_tf, max_card=i, dep_var="saleprice")
#    print(f"{len(contnn)}, {i}: {contnn}")

Looking at the above output, and the fact that it is hard to find a column in the dataset that can be clearly identified as having continuous values, only columns with more than 100 unique values are assigned as being continuous. The final continuous columns are printed below. The output has the format.

(x,y,z)

x := Number of type continuous columns, given threshold value y
y := Minimum for number of unique values, for a column to be assigned type continuous
z := List of names of columns assigned type continuous

Example given below:

>>> 9, 100: ['grlivarea', 'yearbuilt', '1stflrsf', 'garageyrblt', 'totalbsmtsf',
                'bsmtfinsf1', 'lotarea', 'lotfrontage', '2ndflrsf']

Creating and displaying the continuous and categorical columns using max_card 100.

contnn, catnn = cont_cat_split(dfnn_tf, max_card=100, dep_var="saleprice")
catnn

['overallqual',
 'garagecars',
 'fullbath',
 'fireplacequ',
 'centralair',
 'yearremodadd',
 'garagecond',
 'fireplaces',
 'totrmsabvgrd']

contnn

['grlivarea',
 'yearbuilt',
 '1stflrsf',
 'garageyrblt',
 'totalbsmtsf',
 'bsmtfinsf1',
 'lotarea',
 'lotfrontage',
 '2ndflrsf']

Print the number of unique values for all columns part of subset categorical columns.

dfnn_tf[catnn].nunique().sort_values(ascending=False)

yearremodadd    61
totrmsabvgrd    12
overallqual     10
                ..
fullbath         4
fireplaces       4
centralair       2
Length: 9, dtype: int64

Run TabularPandas Function

Since none of the boolean columns that indicate whether there was or wasn’t a missing value in a row of a column are present in the final training dataset, we drop these columns from the created tabular object below. Doing this now, helps us in making the training and test data compatible, if the test data has missing values in columns, where the training data doesn’t.

procsnn = [Categorify, FillMissing(add_col=False), Normalize]
tonn = TabularPandas(
    dfnn_tf,
    procsnn,
    catnn,
    contnn,
    splits=(train_s, valid_s),
    y_names="saleprice",
)

Create Dataloaders Object

The dataloaders object holds all training and validation sets with the preprocessed TabularPandas object as input.

dls = tonn.dataloaders(1024)
x_nnt, y = dls.train.xs, dls.train.y
x_val_nnt, y_val = dls.valid.xs, dls.valid.y
y.min(), y.max()

(10.46024227142334, 13.534473419189453)

Calculate the RMSE value using the data sets from the dataloaders function.

m2 = rff(x_nnt, y)
m_rmse(m2, x_nnt, y), m_rmse(m2, x_val_nnt, y_val)

(0.124612, 0.135281)

Create tabular_learner estimator

Create the tabular_learner object using the dataloaders object from the previous step. The range of the independent variable saleprice is adjusted to be narrower than the default range.

learn = tabular_learner(dls, y_range=(10.45, 13.55), n_out=1, loss_func=F.mse_loss)

Preprocessing Of The Kaggle Test Dataset

A look at the columns of the DataFrame that holds the independent variables, as given by the Kaggle test dataset. This is the dataset that the final predictions need to be made on.

dfnn_vf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 18 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 overallqual   1459 non-null   int64   
 grlivarea     1459 non-null   int64   
 yearbuilt     1459 non-null   int64   
 garagecars    1458 non-null   float64 
 1stflrsf      1459 non-null   int64   
 fullbath      1459 non-null   int64   
 garageyrblt   1381 non-null   float64 
 totalbsmtsf   1458 non-null   float64 
 fireplacequ   1459 non-null   category
 bsmtfinsf1    1458 non-null   float64 
lotarea       1459 non-null   int64   
centralair    1459 non-null   object  
yearremodadd  1459 non-null   int64   
garagecond    1459 non-null   category
lotfrontage   1232 non-null   float64 
fireplaces    1459 non-null   int64   
2ndflrsf      1459 non-null   int64   
totrmsabvgrd  1459 non-null   int64   
dtypes: category(2), float64(5), int64(10), object(1)
memory usage: 185.8+ KB

Looking at a random sample containing 5 rows of the DataFrame.

dfnn_vf.sample(n=5, random_state=seed)

	overallqual	grlivarea	yearbuilt	garagecars	1stflrsf	...	garagecond	lotfrontage	fireplaces	2ndflrsf	totrmsabvgrd
1321	4	864	1950	1.0	864	...	TA	60.0	0	0	5
836	8	2100	2007	3.0	958	...	TA	82.0	2	1142	8
413	5	990	1994	1.0	990	...	TA	65.0	0	0	5
522	8	1342	2006	2.0	1342	...	TA	48.0	1	0	6
1035	6	2422	1954	2.0	2422	...	TA	102.0	2	0	6

5 rows × 18 columns

We apply the same procs we used for the training dataset during the call to TabularPandas, followed by creating the dataloaders object and assigning the independent variables to variable x_valid.

Since there is no dependent variable in this dataset, there is no .y part. We omitted the parameter y_names for that reason and not passing the function a value for splits does not split the dataset into training and validation data. All rows will be part of the dlsv.train.xs part.

In order to get predictions using the test data from Kaggle using the fitted estimator, we call the name of the TabularPandas object used for training and apply method .new to it and pass it the training data (it is the Kaggle test data) from dataloaders object dlsv by writing dlsv.train.xs. The data is processed and the dataloaders object with the test data is loaded for predictions.

procsnn = [Categorify, FillMissing(add_col=False), Normalize]
tonn_vf = TabularPandas(dfnn_vf, procsnn, catnn, contnn)
dlsv = tonn_vf.dataloaders(1024)
x_valid = dlsv.train.xs
tonn_vfs = tonn.new(dlsv.train.xs)
tonn_vfs.process()
tonn_vfs.items.head()
tonn_vfs_dl = dls.valid.new(tonn_vfs)
tonn_vfs_dl.show_batch()

	overallqual	garagecars	fullbath	fireplacequ	centralair	yearremodadd	garagecond	fireplaces	totrmsabvgrd	grlivarea	yearbuilt	1stflrsf	garageyrblt	totalbsmtsf	bsmtfinsf1	lotarea	lotfrontage	2ndflrsf
0	5	2	2	#na#	#na#	#na#	#na#	1	3	-1.215545	-0.340839	-0.654589	-0.653021	-0.370719	0.063433	0.363854	0.567329	-0.775266
1	6	2	2	#na#	#na#	#na#	#na#	1	4	-0.323504	-0.439751	0.433328	-0.769719	0.639182	1.063523	0.897789	0.615966	-0.775266
2	5	3	3	#na#	#na#	#na#	#na#	2	4	0.294491	0.844006	-0.574169	0.747364	-0.266790	0.773368	0.809397	0.275533	0.891941
3	6	3	3	#na#	#na#	#na#	#na#	2	5	0.243012	0.876977	-0.579218	0.786203	-0.271295	0.357968	0.031786	0.470066	0.837236
4	8	3	3	#na#	#na#	#na#	#na#	1	3	-0.424442	0.679386	0.310173	0.552805	0.528496	-0.387169	-0.971584	-1.232092	-0.775266
5	6	3	3	#na#	#na#	#na#	#na#	2	5	0.348114	0.712356	-0.988706	0.591644	-0.639604	-0.965220	0.036564	0.324165	1.346189
6	6	3	3	#na#	#na#	#na#	#na#	1	4	-0.616098	0.679386	0.076580	0.552805	0.275484	1.089883	-0.370756	-0.064899	-0.775266
7	6	3	3	#na#	#na#	#na#	#na#	2	5	-0.043400	0.876977	-0.923342	0.786203	-0.580829	-0.965220	-0.285948	-0.259431	0.832477
8	7	3	2	#na#	#na#	#na#	#na#	2	3	-0.298774	0.613678	0.463439	0.474945	0.573757	0.434900	0.072399	0.810498	-0.775266
9	4	3	2	#na#	#na#	#na#	#na#	1	2	-1.244439	-0.044803	-0.689749	-0.302924	-0.370719	0.801959	-0.285948	0.081001	-0.775266

Entire Series:

Advanced Missing Value Analysis in Tabular Data, Part 1
Decision Tree Feature Selection Methodology, Part 2
RandomForestRegressor Performance Analysis, Part 3
Statistical Interpretation of Tabular Data, Part 4
Addressing the Out-of-Domain Problem in Feature Selection, Part 5
Kaggle Challenge Strategy: RandomForestRegressor and Deep Learning, Part 6
Hyperparameter Optimization in Deep Learning for Kaggle, Part 7