Key Tasks in Structured Data Machine Learning Projects

An essential guide to the six fundamental tasks in machine learning projects involving structured data, emphasizing aspects like predictive modeling and hyperparameter optimization.

Tags: structured-data-ml machine-learning-process predictive-modeling hyperparameter-tuning project-guide

Category: machine-learning-concepts

Key Tasks in Structured Data Machine Learning Projects

Summary

A predictive modeling machine learning project can be divided into six main tasks, as described below using Python. These tasks are part of the prototyping process and are tailored for tabular data.

Task No. 1 | Define Problem

Understand the fundamentals: Gain a deep understanding of the goals of the project and the input data, including variables, data structure, and data gathering methods.
Define the model’s prediction target and evaluation metric.

Task No. 2 | Analyze Data

Perform Exploratory Data Analysis (EDA) to understand the raw data, including unique values, missing data, data types, and data distribution.
Create visualizations to analyze univariate and bivariate distributions, correlations, and skewness of variables.

Task No. 3 | Evaluate Candidate Estimators: Baseline

Establish baseline scores using regression/classification models with minimal requirements on input data, such as RandomForestRegressor or RandomForestClassifier.
Perform feature selection using feature_importances_ or permutation_importance.
Handle missing values and do datatype conversions as needed.
Be mindful of data leakage and use proper train/test splits and cross-validation techniques.

Task No. 4 | Feature Engineering

Perform feature engineering to transform variable distributions, consider categorical embeddings, and consider libraries like automl or TPOT for model selection.
Test smaller subsets of independent variables and analyze prediction results.
Go back and forth between steps listed under Task No. 3 as necessary.

Task No. 5 | Improve Results

Design a test harness to select models with the best scores from Task No. 3.
Customize the training metric, if needed, and use hyperparameter optimization techniques such as grid search or random search.
Use proper cross-validation methods and try ensembles of estimators with custom weights.
Iterate between tasks 2-4 as needed.

Task No. 6 | Present Results

Finalize the model, make predictions, and document the process.
Present the work and explain how the final solution addresses the problem defined at the beginning.
Acknowledge limitations and areas for further improvement.