This article documents my approach to the Datacamp Concrete Challenge, completed within an hour. It offers detailed explanations for each step, emphasizing my proficiency with the Lasso regression model in a time-constrained environment.
Provide your project leader with a formula that estimates the compressive strength. Include:
The team has already tested more than a thousand samples (source):
Acknowledgments: I-Cheng Yeh, “Modeling of strength of high-performance concrete using artificial neural networks,” Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LassoCV
import matplotlib.pyplot as plt
import janitor
import numpy as np
seed = 42
df = pd.read_csv("data/concrete_data.csv")
df.head()
| cement | slag | fly_ash | water | superplasticizer | coarse_aggregate | fine_aggregate | age | strength | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 540.0 | 0.0 | 0.0 | 162.0 | 2.5 | 1040.0 | 676.0 | 28 | 79.986111 | 
| 1 | 540.0 | 0.0 | 0.0 | 162.0 | 2.5 | 1055.0 | 676.0 | 28 | 61.887366 | 
| 2 | 332.5 | 142.5 | 0.0 | 228.0 | 0.0 | 932.0 | 594.0 | 270 | 40.269535 | 
| 3 | 332.5 | 142.5 | 0.0 | 228.0 | 0.0 | 932.0 | 594.0 | 365 | 41.052780 | 
| 4 | 198.6 | 132.4 | 0.0 | 192.0 | 0.0 | 978.4 | 825.5 | 360 | 44.296075 | 
indep, dep = df.get_features_targets(
    target_column_names=["strength"],
)
We split independent variables and the dependent variable into train and test data.
By creating a function that does this for us, we make the code more reproducible
and more robust against changes in the inputs. It also makes it easier for us to
change the train/test ratio without having to add more code to do so. It works
for any pandas.DataFrame that has the dependent column specified as a
DataFrame.
def trntst(df, dep_col, test_size):
    indep, dep = df.get_features_targets(target_column_names=[dep_col])
    indep, dep = indep.to_numpy(), dep.to_numpy()
    indep_trn, indep_tst, dep_trn, dep_tst = train_test_split(
        indep, dep, test_size=test_size, random_state=seed
    )
    dep_trn, dep_tst = dep_trn.reshape(len(dep_trn),), dep_tst.reshape(
        len(dep_tst),
    )
    return indep_trn, indep_tst, dep_trn, dep_tst
Use the function defined above to create four subsets of the original DataFrame.
The train/test subsets are for the independent variables:
indep_tr and indep_tst
And for the dependent variable these are:
dep_trn and dep_tst
indep_trn, indep_tst, dep_trn, dep_tst = trntst(df, "strength", 0.2)
Using the defaults for sklearn.KFold (5 folds) for cross validation, and the
linear model Lasso regression from sklearn.linear_models.LassoCV, which
performs cross validation to find the best value for 
Lasso is an acronym for Least Absolute Selection and Shrinkage Operator. It
was first introduced in the paper (Tibshirani 1996). It is trained with the
With 
We get the final coefficients of the model using method coef_ on the
estimator and the intercept by .intercept_.
kfold = KFold()
model = LassoCV()
fitted = model.fit(indep_trn, dep_trn)
coeffs = model.coef_
intercept = model.intercept_
We map the coefficients to the independent variables. The coefficients of the final model are the output of the following function.
def final_params(indep, coeffs, intercept):
    return dict(
        zip([col for col in indep.columns.tolist()] + ['intercept'], np.append(coeffs,intercept))
    )
All we have to do now is plug in the values for each of the input variables. The output is the answer to the second part of the question posed in the datacamp challenge.
final_params(indep, coeffs, intercept)
{'cement': 0.11084314870584021,
 'slag': 0.09778986094861224,
 'fly_ash': 0.07512685155415182,
 'water': -0.21209276166636368,
 'superplasticizer': 0.0,
 'coarse_aggregate': 0.0,
 'fine_aggregate': 0.009994803080481951,
 'age': 0.11255389788043503,
 'intercept': 18.996750218740445}
The first part of the question is answered using .groupby, a powerful pandas
data grouping method.
We only select the columns mentioned in the first part of the question that
is columns age and strength. We group by age, which automatically sorts
the DataFrameGroupBy object by the keys of the column we grouped by.
Conveniently, the first five rows are the durations we are supposed to compute
the average (=mean) concrete strength for. Using .mean on the groupby object
returns the value we are looking for.
grouped = df[["age", "strength"]].groupby("age")
gm = grouped.mean()[0:5]
gm
| strength | |
|---|---|
| age | |
| 1 | 9.452716 | 
| 3 | 18.981082 | 
| 7 | 26.050623 | 
| 14 | 28.751038 | 
| 28 | 36.748480 |