Expert Analysis in Lasso Regression: The Datacamp Concrete Challenge

This article documents my approach to the Datacamp Concrete Challenge, completed within an hour. It offers detailed explanations for each step, emphasizing my proficiency with the Lasso regression model in a time-constrained environment.


Sub One Hour Regression Challenge

The Challenge

Provide your project leader with a formula that estimates the compressive strength. Include:

  1. The average strength of the concrete samples at 1, 7, 14, and 28 days of age.
  2. The coefficients \(\beta_{0}\), \(\beta_{1}\) … \(\beta_{8}\), to use in the following formula:

\(Concrete \ Strength = \beta_{0} \ + \ \beta_{1}*cement \ + \ \beta_{2}*slag \ + \ \beta_{3}*fly \ ash \ + \ \beta_{4}*water \ +\) \(\beta_{5}*superplasticizer \ + \ \beta_{6}*coarse \ aggregate \ + \ \beta_{7}*fine \ aggregate \ + \ \beta_{8}*age\)

The data

The team has already tested more than a thousand samples (source):

Compressive strength data:

Acknowledgments: I-Cheng Yeh, “Modeling of strength of high-performance concrete using artificial neural networks,” Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).

Imports

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LassoCV
import matplotlib.pyplot as plt
import janitor
import numpy as np

seed = 42
df = pd.read_csv("data/concrete_data.csv")
df.head()
cement slag fly_ash water superplasticizer coarse_aggregate fine_aggregate age strength
0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28 79.986111
1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28 61.887366
2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270 40.269535
3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365 41.052780
4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360 44.296075

Creating Dependent & Independent Variable Splits

indep, dep = df.get_features_targets(
    target_column_names=["strength"],
)

Create Train & Test Split

We split independent variables and the dependent variable into train and test data. By creating a function that does this for us, we make the code more reproducible and more robust against changes in the inputs. It also makes it easier for us to change the train/test ratio without having to add more code to do so. It works for any pandas.DataFrame that has the dependent column specified as a DataFrame.

def trntst(df, dep_col, test_size):
    indep, dep = df.get_features_targets(target_column_names=[dep_col])
    indep, dep = indep.to_numpy(), dep.to_numpy()
    indep_trn, indep_tst, dep_trn, dep_tst = train_test_split(
        indep, dep, test_size=test_size, random_state=seed
    )
    dep_trn, dep_tst = dep_trn.reshape(len(dep_trn),), dep_tst.reshape(
        len(dep_tst),
    )
    return indep_trn, indep_tst, dep_trn, dep_tst

Use the function defined above to create four subsets of the original DataFrame.

The train/test subsets are for the independent variables:

And for the dependent variable these are:

indep_trn, indep_tst, dep_trn, dep_tst = trntst(df, "strength", 0.2)

Data Preprocessing

Using the defaults for sklearn.KFold (5 folds) for cross validation, and the linear model Lasso regression from sklearn.linear_models.LassoCV, which performs cross validation to find the best value for \(\ell_1\) regularization parameter \(\alpha\), when fitted to data. That is, it iterates over different values for \(\alpha\) and subsets of the training data.

Lasso In-Depth

Lasso is an acronym for Least Absolute Selection and Shrinkage Operator. It was first introduced in the paper (Tibshirani 1996). It is trained with the \(\ell_1\)-Norm as regularizer in its default configuration with \(\alpha = 1\). What this means is that the sum of the absolute values of the coefficients (\(w\)) is added to the ordinary least squares term. The complete term is to be minimised regarding \(w\) by the model during coordinate descent. The objective function to minimise is

\[\begin{aligned} &\qquad\qquad\qquad\qquad\qquad\min_{w} {\frac{1}{2n} ||X w - y||_2^2 + \alpha ||w||_1} \\ &\mathrm{Euclidean}\;\mathrm{Norm:}\:\;\| x \|_2 =( \sum_{i=1}^n | x_i |^2)^{1/2},\;\:\: \ell_1\text{-}\mathrm{Norm:}\;\:\| x\|_1 := \sum_{i=1}^{n}| x_i| \end{aligned}\]

With \(n\) the number of observations, \(X\) the vector of the independent variables with coefficients vector \(w\) and \(y\) the dependent variable. The model solves the minimisation problem regarding \(w\) imposed by the added penalty term \(\alpha||w||_1 \ge 0\). Where \(\alpha \ge 0\) is a constant and \(||w||_1\) is the \(\ell_1\)-Norm of \(w\). The value of \(\alpha\) can be altered and \(\alpha = 1\) gives the \(\ell_1\)-Norm of \(w\) and thus tends to shrink many coefficients. The applied shrinkage gives coefficients with values that are either equal to zero or close to zero (sparse coefficients). The lower bound is \(\alpha = 0\), which eliminates the penalty term completely and gives an ordinary least squares linear regression. Hence the usage of the model not only for predictions, but also as a feature selector with \(\alpha > 0\). A special case where ordinary least squares linear regression fails and Lasso is used instead, is the following. Consider \(X\) as a matrix \(X \in \mathbb{R}^{n \times p}\) with the number of rows given by \(n\) (number of observations) and the number of columns given by \(p\) (number of regressors). Lasso is frequently used when there are far more regressors than there are observations that is if \(p \gg n\). As with a general regression model, the assumption is that the observations are either independent or that the \(y_i\)s are conditionally independent given the \(x_{ij}\)s.

Final Coefficients And Intercept

We get the final coefficients of the model using method coef_ on the estimator and the intercept by .intercept_.

kfold = KFold()
model = LassoCV()
fitted = model.fit(indep_trn, dep_trn)
coeffs = model.coef_
intercept = model.intercept_

We map the coefficients to the independent variables. The coefficients of the final model are the output of the following function.

def final_params(indep, coeffs, intercept):
    return dict(
        zip([col for col in indep.columns.tolist()] + ['intercept'], np.append(coeffs,intercept))
    )

Answer To Challenge Question Part 2

All we have to do now is plug in the values for each of the input variables. The output is the answer to the second part of the question posed in the datacamp challenge.

final_params(indep, coeffs, intercept)
{'cement': 0.11084314870584021,
 'slag': 0.09778986094861224,
 'fly_ash': 0.07512685155415182,
 'water': -0.21209276166636368,
 'superplasticizer': 0.0,
 'coarse_aggregate': 0.0,
 'fine_aggregate': 0.009994803080481951,
 'age': 0.11255389788043503,
 'intercept': 18.996750218740445}

Answer To Challenge Question Part 1

The first part of the question is answered using .groupby, a powerful pandas data grouping method. We only select the columns mentioned in the first part of the question that is columns age and strength. We group by age, which automatically sorts the DataFrameGroupBy object by the keys of the column we grouped by. Conveniently, the first five rows are the durations we are supposed to compute the average (=mean) concrete strength for. Using .mean on the groupby object returns the value we are looking for.

grouped = df[["age", "strength"]].groupby("age")
gm = grouped.mean()[0:5]
gm
strength
age
1 9.452716
3 18.981082
7 26.050623
14 28.751038
28 36.748480