Strategy and techniques for Kaggle competitions, focusing on RandomForestRegressor and fastai deep learning models, including hyperparameter optimization and preprocessing.
Advanced Missing Value Analysis in Tabular Data, Part 1
Decision Tree Feature Selection Methodology, Part 2
RandomForestRegressor Performance Analysis, Part 3
Statistical Interpretation of Tabular Data, Part 4
Addressing the Out-of-Domain Problem in Feature Selection, Part 5
Kaggle Challenge Strategy: RandomForestRegressor and Deep Learning, Part 6
Hyperparameter Optimization in Deep Learning for Kaggle, Part 7
For the final submission, we train several models and combine their predictions in the form of a weighted ensemble prediction. Estimators from the following model types are included. Number of iterations marks the final number of iterations used for the submission.
RandomForestRegressor
from sklearn
XGBRegressor
from xgboost
tabular_learner
(deep learning model) from fastai
The hyperparameter optimization for each of them is:
RandomForestRegressor
nestimators
- number of estimators to use.max_samples
- maximum number of samples to use for training a single
base estimator (tree).tabular_learner
lr
(learning rate) - values tested depend on lr_find
output.epochs
Number of epochs to train.XGBRegressor
RandomizedSearchCV
with 1400 iterations and 8-fold cross-validation for each from sklearn using a parameter distribution dictionary.So far, the focus has been on fitting estimators for interpretability and not for the lowest RMSE value. The kaggle competition we want to submit our final predictions to however only scores each submission based on RMSE value on the test set and nothing else. This makes it necessary that we try to create estimators that are the result of hyperparameter tuning, starting with few iterations where we check the resulting RMSE values and building up to using as many iterations that our hardware can handle within a reasonable duration of no more than 5 minutes give or take or stop adding more iterations to the hyperparameter optimization procedure, if RMSE values stop improving despite increasing the number of iterations.
Using a manually created test harness, the RMSE values for each iteration on the
training and validation set are appended to list m_rmsel
and m_rmselv
respectively, and it is these lists that are returned by the function.
def rf2(
xs_final=xs_final,
y=y,
valid_xs_final=valid_xs_final,
valid_y=valid_y,
nestimators=[60, 50, 40, 30, 20],
max_samples=[200, 300, 400, 500, 600, 700],
max_features=0.5,
min_samples_leaf=5,
**kwargs,
):
from itertools import product
m_rmsel = []
m_rmselv = []
setups = product(nestimators, max_samples)
for ne in setups:
mt = RandomForestRegressor(
n_jobs=-1,
n_estimators=ne[0],
max_samples=ne[1],
max_features=max_features,
min_samples_leaf=min_samples_leaf,
oob_score=True,
random_state=seed,
).fit(xs_final, y)
m_rmsel.append((m_rmse(mt, xs_final, y), ne[0], ne[1]))
m_rmselv.append((m_rmse(mt, valid_xs_final, valid_y), ne[0], ne[1]))
return m_rmsel, m_rmselv
We run the manual hyperparameter optimization and assign the outputs to
m_rmset
and m_rmsev
respectively.
m_rmset, m_rmsev = rf2()
The evaluation is done by creating a DataFrame and then using pandas
.groupby
method along with aggregation method .agg
where we aggregate by the
minimum over each m_rmsev
value. We choose the parameter combination found in
the first row of the resulting grouped_opt
DataFrame.
dfm_rmsev = pd.DataFrame(m_rmsev, columns=["m_rmsev", "n_estimators", "max_samples"])
grouped_opt = dfm_rmsev.groupby(by="m_rmsev").agg(min)
grouped_opt.iloc[:5, :]
n_estimators | max_samples | |
---|---|---|
m_rmsev | ||
0.138596 | 60 | 600 |
0.139147 | 50 | 600 |
0.139720 | 40 | 600 |
0.140007 | 60 | 700 |
0.140081 | 30 | 600 |
To avoid using the wrong parameter combination, one that is not the optimal one
for the given execution of the code, we assign the values for the optimal number
of n_estimators
and max_samples
directly by the index values that hold the
optimal parameter values in of grouped_opt
.
Function rff
will fit a RandomForestRegressor
with the optimal parameter
values, as found by the hyperparameter optimization procedure outlined above
regardless of execution number.
def rff(
xs,
y,
n_estimators=grouped_opt.iloc[0, 0],
max_samples=grouped_opt.iloc[0, 1],
max_features=0.5,
min_samples_leaf=5,
**kwargs,
):
return RandomForestRegressor(
n_jobs=-1,
n_estimators=n_estimators,
max_samples=max_samples,
max_features=max_features,
min_samples_leaf=min_samples_leaf,
oob_score=True,
random_state=seed,
).fit(xs, y)
Executing function rff
we get the RMSE values for the fitted estimator.
m = rff(xs_final, y)
m_rmse(m, xs_final, y), m_rmse(m, valid_xs_final, valid_y)
(0.124334, 0.138596)
While dropping garagearea
resulted in a slightly higher accuracy using
RandomForestRegressor
on the validation set, the increase was marginal. Let’s
see what the results are using neural networks.
The original csv files are imported, and we show how to apply the preprocessing
steps using the TabularPandas
function from the fastai library.
Creating the DataFrames for fitting the deep learning model.
nn_t = base + "/" + "my_competitions/kaggle_competition_house_prices/data/train.csv"
nn_v = base + "/" + "my_competitions/kaggle_competition_house_prices/data/test.csv"
dfnn_t = pd.read_csv(nn_t, low_memory=False).clean_names()
dfnn_v = pd.read_csv(nn_v, low_memory=False).clean_names()
print(len(dfnn_v))
dfnn_v.columns[:3]
1459
Index(['id', 'mssubclass', 'mszoning'], dtype='object')
Assigning the ordered categorical columns to the data, as we did before for the
tree based models in a previous part. See Deep Dive Tabular Data Part 1
dfnn_t = cu(dfnn_t, uset, usetna)
dfnn_v = cu(dfnn_v, uset, usetna)
dfnn_t = tl(dfnn_t)
extercond Index(['Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
heatingqc Index(['Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
fireplacequ Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
garagequal Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
garagecond Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
extercond Index(['Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
heatingqc Index(['Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
fireplacequ Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
garagecond Index(['FM', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], dtype='object')
Applying the log
function to the dependent variable saleprice
.
Only use the columns that were left in the dataset after analyzing the contribution of each of the columns in the previous section.
dfnn_tf = dfnn_t[
xs_final_ext.columns.tolist() + ["saleprice"]
] # _tf stands for train and final (train dataset from kaggle)
dfnn_vf = dfnn_v[
xs_final_ext.columns.tolist()
] # _vf stands for validation final (test dataset from kaggle)
print(len(dfnn_vf))
dfnn_tf.sample(n=3, random_state=seed)
1459
overallqual | grlivarea | yearbuilt | garagecars | 1stflrsf | ... | lotfrontage | fireplaces | 2ndflrsf | totrmsabvgrd | saleprice | |
---|---|---|---|---|---|---|---|---|---|---|---|
892 | 6 | 1068 | 1963 | 1 | 1068 | ... | 70.0 | 0 | 0 | 6 | 11.947949 |
1105 | 8 | 2622 | 1994 | 2 | 1500 | ... | 98.0 | 2 | 1122 | 9 | 12.691580 |
413 | 5 | 1028 | 1927 | 2 | 1028 | ... | 56.0 | 1 | 0 | 5 | 11.652687 |
3 rows × 19 columns
Verify that the number of columns in dfnn_tf
is correct.
len(dfnn_tf.columns)
19
Values in the range between 2 and 100 are tested. Output is hidden, for readability.
for i in range(2, 101):
contnn, catnn = cont_cat_split(dfnn_tf, max_card=i, dep_var="saleprice")
# print(f"{len(contnn)}, {i}: {contnn}")
Looking at the above output, and the fact that it is hard to find a column in the dataset that can be clearly identified as having continuous values, only columns with more than 100 unique values are assigned as being continuous. The final continuous columns are printed below. The output has the format.
(x,y,z)
x := Number of type continuous columns, given threshold value y
y := Minimum for number of unique values, for a column to be assigned type continuous
z := List of names of columns assigned type continuous
Example given below:
>>> 9, 100: ['grlivarea', 'yearbuilt', '1stflrsf', 'garageyrblt', 'totalbsmtsf',
'bsmtfinsf1', 'lotarea', 'lotfrontage', '2ndflrsf']
Creating and displaying the continuous and categorical columns using max_card
100.
contnn, catnn = cont_cat_split(dfnn_tf, max_card=100, dep_var="saleprice")
catnn
['overallqual',
'garagecars',
'fullbath',
'fireplacequ',
'centralair',
'yearremodadd',
'garagecond',
'fireplaces',
'totrmsabvgrd']
contnn
['grlivarea',
'yearbuilt',
'1stflrsf',
'garageyrblt',
'totalbsmtsf',
'bsmtfinsf1',
'lotarea',
'lotfrontage',
'2ndflrsf']
Print the number of unique values for all columns part of subset categorical columns.
dfnn_tf[catnn].nunique().sort_values(ascending=False)
yearremodadd 61
totrmsabvgrd 12
overallqual 10
..
fullbath 4
fireplaces 4
centralair 2
Length: 9, dtype: int64
Since none of the boolean columns that indicate whether there was or wasn’t a missing value in a row of a column are present in the final training dataset, we drop these columns from the created tabular object below. Doing this now, helps us in making the training and test data compatible, if the test data has missing values in columns, where the training data doesn’t.
procsnn = [Categorify, FillMissing(add_col=False), Normalize]
tonn = TabularPandas(
dfnn_tf,
procsnn,
catnn,
contnn,
splits=(train_s, valid_s),
y_names="saleprice",
)
The dataloaders object holds all training and validation sets with the preprocessed TabularPandas object as input.
dls = tonn.dataloaders(1024)
x_nnt, y = dls.train.xs, dls.train.y
x_val_nnt, y_val = dls.valid.xs, dls.valid.y
y.min(), y.max()
(10.46024227142334, 13.534473419189453)
Calculate the RMSE value using the data sets from the dataloaders function.
m2 = rff(x_nnt, y)
m_rmse(m2, x_nnt, y), m_rmse(m2, x_val_nnt, y_val)
(0.124612, 0.135281)
Create the tabular_learner
object using the dataloaders object from
the previous step. The range of the independent variable saleprice
is adjusted
to be narrower than the default range.
learn = tabular_learner(dls, y_range=(10.45, 13.55), n_out=1, loss_func=F.mse_loss)
A look at the columns of the DataFrame that holds the independent variables, as given by the Kaggle test dataset. This is the dataset that the final predictions need to be made on.
dfnn_vf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 overallqual 1459 non-null int64
1 grlivarea 1459 non-null int64
2 yearbuilt 1459 non-null int64
3 garagecars 1458 non-null float64
4 1stflrsf 1459 non-null int64
5 fullbath 1459 non-null int64
6 garageyrblt 1381 non-null float64
7 totalbsmtsf 1458 non-null float64
8 fireplacequ 1459 non-null category
9 bsmtfinsf1 1458 non-null float64
10 lotarea 1459 non-null int64
11 centralair 1459 non-null object
12 yearremodadd 1459 non-null int64
13 garagecond 1459 non-null category
14 lotfrontage 1232 non-null float64
15 fireplaces 1459 non-null int64
16 2ndflrsf 1459 non-null int64
17 totrmsabvgrd 1459 non-null int64
dtypes: category(2), float64(5), int64(10), object(1)
memory usage: 185.8+ KB
Looking at a random sample containing 5 rows of the DataFrame.
dfnn_vf.sample(n=5, random_state=seed)
overallqual | grlivarea | yearbuilt | garagecars | 1stflrsf | ... | garagecond | lotfrontage | fireplaces | 2ndflrsf | totrmsabvgrd | |
---|---|---|---|---|---|---|---|---|---|---|---|
1321 | 4 | 864 | 1950 | 1.0 | 864 | ... | TA | 60.0 | 0 | 0 | 5 |
836 | 8 | 2100 | 2007 | 3.0 | 958 | ... | TA | 82.0 | 2 | 1142 | 8 |
413 | 5 | 990 | 1994 | 1.0 | 990 | ... | TA | 65.0 | 0 | 0 | 5 |
522 | 8 | 1342 | 2006 | 2.0 | 1342 | ... | TA | 48.0 | 1 | 0 | 6 |
1035 | 6 | 2422 | 1954 | 2.0 | 2422 | ... | TA | 102.0 | 2 | 0 | 6 |
5 rows × 18 columns
We apply the same procs we used for the training dataset during the call to
TabularPandas
, followed by creating the dataloaders object and assigning the
independent variables to variable x_valid
.
Since there is no dependent variable in this dataset, there is no .y
part. We
omitted the parameter y_names
for that reason and not passing the function a
value for splits
does not split the dataset into training and validation data.
All rows will be part of the dlsv.train.xs
part.
In order to get predictions using the test data from Kaggle using the fitted
estimator, we call the name of the TabularPandas object used for training and
apply method .new
to it and pass it the training data (it is the Kaggle test
data) from dataloaders object dlsv
by writing dlsv.train.xs
. The data is
processed and the dataloaders object with the test data is loaded for
predictions.
procsnn = [Categorify, FillMissing(add_col=False), Normalize]
tonn_vf = TabularPandas(dfnn_vf, procsnn, catnn, contnn)
dlsv = tonn_vf.dataloaders(1024)
x_valid = dlsv.train.xs
tonn_vfs = tonn.new(dlsv.train.xs)
tonn_vfs.process()
tonn_vfs.items.head()
tonn_vfs_dl = dls.valid.new(tonn_vfs)
tonn_vfs_dl.show_batch()
overallqual | garagecars | fullbath | fireplacequ | centralair | yearremodadd | garagecond | fireplaces | totrmsabvgrd | grlivarea | yearbuilt | 1stflrsf | garageyrblt | totalbsmtsf | bsmtfinsf1 | lotarea | lotfrontage | 2ndflrsf | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5 | 2 | 2 | #na# | #na# | #na# | #na# | 1 | 3 | -1.215545 | -0.340839 | -0.654589 | -0.653021 | -0.370719 | 0.063433 | 0.363854 | 0.567329 | -0.775266 |
1 | 6 | 2 | 2 | #na# | #na# | #na# | #na# | 1 | 4 | -0.323504 | -0.439751 | 0.433328 | -0.769719 | 0.639182 | 1.063523 | 0.897789 | 0.615966 | -0.775266 |
2 | 5 | 3 | 3 | #na# | #na# | #na# | #na# | 2 | 4 | 0.294491 | 0.844006 | -0.574169 | 0.747364 | -0.266790 | 0.773368 | 0.809397 | 0.275533 | 0.891941 |
3 | 6 | 3 | 3 | #na# | #na# | #na# | #na# | 2 | 5 | 0.243012 | 0.876977 | -0.579218 | 0.786203 | -0.271295 | 0.357968 | 0.031786 | 0.470066 | 0.837236 |
4 | 8 | 3 | 3 | #na# | #na# | #na# | #na# | 1 | 3 | -0.424442 | 0.679386 | 0.310173 | 0.552805 | 0.528496 | -0.387169 | -0.971584 | -1.232092 | -0.775266 |
5 | 6 | 3 | 3 | #na# | #na# | #na# | #na# | 2 | 5 | 0.348114 | 0.712356 | -0.988706 | 0.591644 | -0.639604 | -0.965220 | 0.036564 | 0.324165 | 1.346189 |
6 | 6 | 3 | 3 | #na# | #na# | #na# | #na# | 1 | 4 | -0.616098 | 0.679386 | 0.076580 | 0.552805 | 0.275484 | 1.089883 | -0.370756 | -0.064899 | -0.775266 |
7 | 6 | 3 | 3 | #na# | #na# | #na# | #na# | 2 | 5 | -0.043400 | 0.876977 | -0.923342 | 0.786203 | -0.580829 | -0.965220 | -0.285948 | -0.259431 | 0.832477 |
8 | 7 | 3 | 2 | #na# | #na# | #na# | #na# | 2 | 3 | -0.298774 | 0.613678 | 0.463439 | 0.474945 | 0.573757 | 0.434900 | 0.072399 | 0.810498 | -0.775266 |
9 | 4 | 3 | 2 | #na# | #na# | #na# | #na# | 1 | 2 | -1.244439 | -0.044803 | -0.689749 | -0.302924 | -0.370719 | 0.801959 | -0.285948 | 0.081001 | -0.775266 |
Entire Series:
Advanced Missing Value Analysis in Tabular Data, Part 1
Decision Tree Feature Selection Methodology, Part 2
RandomForestRegressor Performance Analysis, Part 3
Statistical Interpretation of Tabular Data, Part 4
Addressing the Out-of-Domain Problem in Feature Selection, Part 5
Kaggle Challenge Strategy: RandomForestRegressor and Deep Learning, Part 6
Hyperparameter Optimization in Deep Learning for Kaggle, Part 7