Strategy and techniques for Kaggle competitions, focusing on RandomForestRegressor and fastai deep learning models, including hyperparameter optimization and preprocessing.
Advanced Missing Value Analysis in Tabular Data, Part 1
Decision Tree Feature Selection Methodology, Part 2
RandomForestRegressor Performance Analysis, Part 3
Statistical Interpretation of Tabular Data, Part 4
Addressing the Out-of-Domain Problem in Feature Selection, Part 5
Kaggle Challenge Strategy: RandomForestRegressor and Deep Learning, Part 6
Hyperparameter Optimization in Deep Learning for Kaggle, Part 7
For the final submission, we train several models and combine their predictions in the form of a weighted ensemble prediction. Estimators from the following model types are included. Number of iterations marks the final number of iterations used for the submission.
RandomForestRegressor from sklearn
XGBRegressor from xgboost
tabular_learner (deep learning model) from fastai
The hyperparameter optimization for each of them is:
RandomForestRegressor
    nestimators - number of estimators to use.max_samples - maximum number of samples to use for training a single
 base estimator (tree).tabular_learner
    lr (learning rate) - values tested depend on lr_find output.epochs Number of epochs to train.XGBRegressor
    RandomizedSearchCV with 1400 iterations and 8-fold cross-validation for each from sklearn using a parameter distribution dictionary.So far, the focus has been on fitting estimators for interpretability and not for the lowest RMSE value. The kaggle competition we want to submit our final predictions to however only scores each submission based on RMSE value on the test set and nothing else. This makes it necessary that we try to create estimators that are the result of hyperparameter tuning, starting with few iterations where we check the resulting RMSE values and building up to using as many iterations that our hardware can handle within a reasonable duration of no more than 5 minutes give or take or stop adding more iterations to the hyperparameter optimization procedure, if RMSE values stop improving despite increasing the number of iterations.
Using a manually created test harness, the RMSE values for each iteration on the
training and validation set are appended to list m_rmsel and m_rmselv
respectively, and it is these lists that are returned by the function.
def rf2(
    xs_final=xs_final,
    y=y,
    valid_xs_final=valid_xs_final,
    valid_y=valid_y,
    nestimators=[60, 50, 40, 30, 20],
    max_samples=[200, 300, 400, 500, 600, 700],
    max_features=0.5,
    min_samples_leaf=5,
    **kwargs,
):
    from itertools import product
    m_rmsel = []
    m_rmselv = []
    setups = product(nestimators, max_samples)
    for ne in setups:
        mt = RandomForestRegressor(
            n_jobs=-1,
            n_estimators=ne[0],
            max_samples=ne[1],
            max_features=max_features,
            min_samples_leaf=min_samples_leaf,
            oob_score=True,
            random_state=seed,
        ).fit(xs_final, y)
        m_rmsel.append((m_rmse(mt, xs_final, y), ne[0], ne[1]))
        m_rmselv.append((m_rmse(mt, valid_xs_final, valid_y), ne[0], ne[1]))
    return m_rmsel, m_rmselv
We run the manual hyperparameter optimization and assign the outputs to
m_rmset and m_rmsev respectively.
m_rmset, m_rmsev = rf2()
The evaluation is done by creating a DataFrame and then using pandas
.groupby method along with aggregation method .agg where we aggregate by the
minimum over each m_rmsev value. We choose the parameter combination found in
the first row of the resulting grouped_opt DataFrame.
dfm_rmsev = pd.DataFrame(m_rmsev, columns=["m_rmsev", "n_estimators", "max_samples"])
grouped_opt = dfm_rmsev.groupby(by="m_rmsev").agg(min)
grouped_opt.iloc[:5, :]
| n_estimators | max_samples | |
|---|---|---|
| m_rmsev | ||
| 0.138596 | 60 | 600 | 
| 0.139147 | 50 | 600 | 
| 0.139720 | 40 | 600 | 
| 0.140007 | 60 | 700 | 
| 0.140081 | 30 | 600 | 
To avoid using the wrong parameter combination, one that is not the optimal one
for the given execution of the code, we assign the values for the optimal number
of n_estimators and max_samples directly by the index values that hold the
optimal parameter values in of grouped_opt.
Function rff will fit a RandomForestRegressor with the optimal parameter
values, as found by the hyperparameter optimization procedure outlined above
regardless of execution number.
def rff(
    xs,
    y,
    n_estimators=grouped_opt.iloc[0, 0],
    max_samples=grouped_opt.iloc[0, 1],
    max_features=0.5,
    min_samples_leaf=5,
    **kwargs,
):
    return RandomForestRegressor(
        n_jobs=-1,
        n_estimators=n_estimators,
        max_samples=max_samples,
        max_features=max_features,
        min_samples_leaf=min_samples_leaf,
        oob_score=True,
        random_state=seed,
    ).fit(xs, y)
Executing function rff we get the RMSE values for the fitted estimator.
m = rff(xs_final, y)
m_rmse(m, xs_final, y), m_rmse(m, valid_xs_final, valid_y)
(0.124334, 0.138596)
While dropping garagearea resulted in a slightly higher accuracy using
RandomForestRegressor on the validation set, the increase was marginal. Let’s
see what the results are using neural networks.
The original csv files are imported, and we show how to apply the preprocessing
steps using the TabularPandas function from the fastai library.
Creating the DataFrames for fitting the deep learning model.
nn_t = base + "/" + "my_competitions/kaggle_competition_house_prices/data/train.csv"
nn_v = base + "/" + "my_competitions/kaggle_competition_house_prices/data/test.csv"
dfnn_t = pd.read_csv(nn_t, low_memory=False).clean_names()
dfnn_v = pd.read_csv(nn_v, low_memory=False).clean_names()
print(len(dfnn_v))
dfnn_v.columns[:3]
1459
Index(['id', 'mssubclass', 'mszoning'], dtype='object')
Assigning the ordered categorical columns to the data, as we did before for the
tree based models in a previous part. See Deep Dive Tabular Data Part 1
dfnn_t = cu(dfnn_t, uset, usetna)
dfnn_v = cu(dfnn_v, uset, usetna)
dfnn_t = tl(dfnn_t)
extercond Index(['Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
heatingqc Index(['Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
fireplacequ Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
garagequal Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
garagecond Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
extercond Index(['Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
heatingqc Index(['Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
fireplacequ Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
garagecond Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
Applying the log function to the dependent variable saleprice.
Only use the columns that were left in the dataset after analyzing the contribution of each of the columns in the previous section.
dfnn_tf = dfnn_t[
    xs_final_ext.columns.tolist() + ["saleprice"]
]  # _tf stands for train and final (train dataset from kaggle)
dfnn_vf = dfnn_v[
    xs_final_ext.columns.tolist()
]  # _vf stands for validation final (test dataset from kaggle)
print(len(dfnn_vf))
dfnn_tf.sample(n=3, random_state=seed)
1459
| overallqual | grlivarea | yearbuilt | garagecars | 1stflrsf | ... | lotfrontage | fireplaces | 2ndflrsf | totrmsabvgrd | saleprice | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 892 | 6 | 1068 | 1963 | 1 | 1068 | ... | 70.0 | 0 | 0 | 6 | 11.947949 | 
| 1105 | 8 | 2622 | 1994 | 2 | 1500 | ... | 98.0 | 2 | 1122 | 9 | 12.691580 | 
| 413 | 5 | 1028 | 1927 | 2 | 1028 | ... | 56.0 | 1 | 0 | 5 | 11.652687 | 
3 rows × 19 columns
Verify that the number of columns in dfnn_tf is correct.
len(dfnn_tf.columns)
19
Values in the range between 2 and 100 are tested. Output is hidden, for readability.
for i in range(2, 101):
    contnn, catnn = cont_cat_split(dfnn_tf, max_card=i, dep_var="saleprice")
#    print(f"{len(contnn)}, {i}: {contnn}")
Looking at the above output, and the fact that it is hard to find a column in the dataset that can be clearly identified as having continuous values, only columns with more than 100 unique values are assigned as being continuous. The final continuous columns are printed below. The output has the format.
(x,y,z)
x := Number of type continuous columns, given threshold value y
y := Minimum for number of unique values, for a column to be assigned type continuous
z := List of names of columns assigned type continuous
Example given below:
>>> 9, 100: ['grlivarea', 'yearbuilt', '1stflrsf', 'garageyrblt', 'totalbsmtsf',
                'bsmtfinsf1', 'lotarea', 'lotfrontage', '2ndflrsf']
Creating and displaying the continuous and categorical columns using max_card
100.
contnn, catnn = cont_cat_split(dfnn_tf, max_card=100, dep_var="saleprice")
catnn
['overallqual',
 'garagecars',
 'fullbath',
 'fireplacequ',
 'centralair',
 'yearremodadd',
 'garagecond',
 'fireplaces',
 'totrmsabvgrd']
contnn
['grlivarea',
 'yearbuilt',
 '1stflrsf',
 'garageyrblt',
 'totalbsmtsf',
 'bsmtfinsf1',
 'lotarea',
 'lotfrontage',
 '2ndflrsf']
Print the number of unique values for all columns part of subset categorical columns.
dfnn_tf[catnn].nunique().sort_values(ascending=False)
yearremodadd    61
totrmsabvgrd    12
overallqual     10
                ..
fullbath         4
fireplaces       4
centralair       2
Length: 9, dtype: int64
Since none of the boolean columns that indicate whether there was or wasn’t a missing value in a row of a column are present in the final training dataset, we drop these columns from the created tabular object below. Doing this now, helps us in making the training and test data compatible, if the test data has missing values in columns, where the training data doesn’t.
procsnn = [Categorify, FillMissing(add_col=False), Normalize]
tonn = TabularPandas(
    dfnn_tf,
    procsnn,
    catnn,
    contnn,
    splits=(train_s, valid_s),
    y_names="saleprice",
)
The dataloaders object holds all training and validation sets with the preprocessed TabularPandas object as input.
dls = tonn.dataloaders(1024)
x_nnt, y = dls.train.xs, dls.train.y
x_val_nnt, y_val = dls.valid.xs, dls.valid.y
y.min(), y.max()
(10.46024227142334, 13.534473419189453)
Calculate the RMSE value using the data sets from the dataloaders function.
m2 = rff(x_nnt, y)
m_rmse(m2, x_nnt, y), m_rmse(m2, x_val_nnt, y_val)
(0.124612, 0.135281)
Create the tabular_learner object using the dataloaders object from
the previous step. The range of the independent variable saleprice is adjusted
to be narrower than the default range.
learn = tabular_learner(dls, y_range=(10.45, 13.55), n_out=1, loss_func=F.mse_loss)
A look at the columns of the DataFrame that holds the independent variables, as given by the Kaggle test dataset. This is the dataset that the final predictions need to be made on.
dfnn_vf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 18 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   overallqual   1459 non-null   int64   
 1   grlivarea     1459 non-null   int64   
 2   yearbuilt     1459 non-null   int64   
 3   garagecars    1458 non-null   float64 
 4   1stflrsf      1459 non-null   int64   
 5   fullbath      1459 non-null   int64   
 6   garageyrblt   1381 non-null   float64 
 7   totalbsmtsf   1458 non-null   float64 
 8   fireplacequ   1459 non-null   category
 9   bsmtfinsf1    1458 non-null   float64 
 10  lotarea       1459 non-null   int64   
 11  centralair    1459 non-null   object  
 12  yearremodadd  1459 non-null   int64   
 13  garagecond    1459 non-null   category
 14  lotfrontage   1232 non-null   float64 
 15  fireplaces    1459 non-null   int64   
 16  2ndflrsf      1459 non-null   int64   
 17  totrmsabvgrd  1459 non-null   int64   
dtypes: category(2), float64(5), int64(10), object(1)
memory usage: 185.8+ KB
Looking at a random sample containing 5 rows of the DataFrame.
dfnn_vf.sample(n=5, random_state=seed)
| overallqual | grlivarea | yearbuilt | garagecars | 1stflrsf | ... | garagecond | lotfrontage | fireplaces | 2ndflrsf | totrmsabvgrd | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1321 | 4 | 864 | 1950 | 1.0 | 864 | ... | TA | 60.0 | 0 | 0 | 5 | 
| 836 | 8 | 2100 | 2007 | 3.0 | 958 | ... | TA | 82.0 | 2 | 1142 | 8 | 
| 413 | 5 | 990 | 1994 | 1.0 | 990 | ... | TA | 65.0 | 0 | 0 | 5 | 
| 522 | 8 | 1342 | 2006 | 2.0 | 1342 | ... | TA | 48.0 | 1 | 0 | 6 | 
| 1035 | 6 | 2422 | 1954 | 2.0 | 2422 | ... | TA | 102.0 | 2 | 0 | 6 | 
5 rows × 18 columns
We apply the same procs we used for the training dataset during the call to
TabularPandas, followed by creating the dataloaders object and assigning the
independent variables to variable x_valid.
Since there is no dependent variable in this dataset, there is no .y part. We
omitted the parameter y_names for that reason and not passing the function a
value for splits does not split the dataset into training and validation data.
All rows will be part of the dlsv.train.xs part.
In order to get predictions using the test data from Kaggle using the fitted
estimator, we call the name of the TabularPandas object used for training and
apply method .new to it and pass it the training data (it is the Kaggle test
data) from dataloaders object dlsv by writing dlsv.train.xs. The data is
processed and the dataloaders object with the test data is loaded for
predictions.
procsnn = [Categorify, FillMissing(add_col=False), Normalize]
tonn_vf = TabularPandas(dfnn_vf, procsnn, catnn, contnn)
dlsv = tonn_vf.dataloaders(1024)
x_valid = dlsv.train.xs
tonn_vfs = tonn.new(dlsv.train.xs)
tonn_vfs.process()
tonn_vfs.items.head()
tonn_vfs_dl = dls.valid.new(tonn_vfs)
tonn_vfs_dl.show_batch()
| overallqual | garagecars | fullbath | fireplacequ | centralair | yearremodadd | garagecond | fireplaces | totrmsabvgrd | grlivarea | yearbuilt | 1stflrsf | garageyrblt | totalbsmtsf | bsmtfinsf1 | lotarea | lotfrontage | 2ndflrsf | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5 | 2 | 2 | #na# | #na# | #na# | #na# | 1 | 3 | -1.215545 | -0.340839 | -0.654589 | -0.653021 | -0.370719 | 0.063433 | 0.363854 | 0.567329 | -0.775266 | 
| 1 | 6 | 2 | 2 | #na# | #na# | #na# | #na# | 1 | 4 | -0.323504 | -0.439751 | 0.433328 | -0.769719 | 0.639182 | 1.063523 | 0.897789 | 0.615966 | -0.775266 | 
| 2 | 5 | 3 | 3 | #na# | #na# | #na# | #na# | 2 | 4 | 0.294491 | 0.844006 | -0.574169 | 0.747364 | -0.266790 | 0.773368 | 0.809397 | 0.275533 | 0.891941 | 
| 3 | 6 | 3 | 3 | #na# | #na# | #na# | #na# | 2 | 5 | 0.243012 | 0.876977 | -0.579218 | 0.786203 | -0.271295 | 0.357968 | 0.031786 | 0.470066 | 0.837236 | 
| 4 | 8 | 3 | 3 | #na# | #na# | #na# | #na# | 1 | 3 | -0.424442 | 0.679386 | 0.310173 | 0.552805 | 0.528496 | -0.387169 | -0.971584 | -1.232092 | -0.775266 | 
| 5 | 6 | 3 | 3 | #na# | #na# | #na# | #na# | 2 | 5 | 0.348114 | 0.712356 | -0.988706 | 0.591644 | -0.639604 | -0.965220 | 0.036564 | 0.324165 | 1.346189 | 
| 6 | 6 | 3 | 3 | #na# | #na# | #na# | #na# | 1 | 4 | -0.616098 | 0.679386 | 0.076580 | 0.552805 | 0.275484 | 1.089883 | -0.370756 | -0.064899 | -0.775266 | 
| 7 | 6 | 3 | 3 | #na# | #na# | #na# | #na# | 2 | 5 | -0.043400 | 0.876977 | -0.923342 | 0.786203 | -0.580829 | -0.965220 | -0.285948 | -0.259431 | 0.832477 | 
| 8 | 7 | 3 | 2 | #na# | #na# | #na# | #na# | 2 | 3 | -0.298774 | 0.613678 | 0.463439 | 0.474945 | 0.573757 | 0.434900 | 0.072399 | 0.810498 | -0.775266 | 
| 9 | 4 | 3 | 2 | #na# | #na# | #na# | #na# | 1 | 2 | -1.244439 | -0.044803 | -0.689749 | -0.302924 | -0.370719 | 0.801959 | -0.285948 | 0.081001 | -0.775266 | 
Entire Series:
Advanced Missing Value Analysis in Tabular Data, Part 1
Decision Tree Feature Selection Methodology, Part 2
RandomForestRegressor Performance Analysis, Part 3
Statistical Interpretation of Tabular Data, Part 4
Addressing the Out-of-Domain Problem in Feature Selection, Part 5
Kaggle Challenge Strategy: RandomForestRegressor and Deep Learning, Part 6
Hyperparameter Optimization in Deep Learning for Kaggle, Part 7