The Integration Of MLflow In This Project

MLflow is central to this project’s experiment tracking, artifact management, and reproducible model development. It is integrated through Hydra configurations, S3 synchronization scripts, and Python modeling code that leverages MLflow’s Pythonic API.


The Integration Of MLflow In This Project

MLflow is central to this project’s experiment tracking, artifact management, and reproducible model development. It is integrated through Hydra configurations, S3 synchronization scripts, and Python modeling code that leverages MLflow’s Pythonic API.

Reasons for Using MLflow

MLflow unifies local experimentation and remote artifact sharing. It supports parameter logging, metric tracking, and model registry for consistent collaboration. Language-agnostic design permits flexible usage in Python-based workflows.

Implementation Details

All MLflow runs store metadata in local mlruns/ directories, which are then synchronized to S3. Configuration YAML files (push_mlruns_s3.yaml and pull_mlruns_s3.yaml) in configs/utility_functions define the AWS bucket name, prefix, local tracking directory, and the exact sync commands. Below is an excerpt from push_mlruns_s3.py showing how the CLI command is executed:

if replace_remote:
    logger.info("Remote objects not present locally will be deleted from S3.")
    confirm = input("Proceed with remote deletion? [y/N]: ")
    if confirm.lower() not in ("y", "yes"):
        logger.info("Aborted by user.")
        return
    sync_command += ["--delete"]

subprocess.run(sync_command, check=True)
logger.info("Push complete.")

The local directory path and remote URI both come from Hydra-based YAML entries, ensuring every collaborator uses consistent settings. The scripts in dependencies/io shield developers from manual synchronization procedures.

Experiment Tracking and Model Runs The modules dependencies/modeling/rf_optuna_trial.py and dependencies/modeling/ridge_optuna_trial.py demonstrate how MLflow is set up:

mlflow.set_tracking_uri("file:./mlruns")
existing = mlflow.get_experiment_by_name(experiment_name)
if existing is None:
    mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)
...
with mlflow.start_run(run_name=f"trial_{trial.number}", nested=True):
    mlflow.log_params(final_params)
    mlflow.log_metrics({"rmse": rmse, "r2": r2})

Each run is named trial_X or final_model, storing critical information like RMSE, R², and hyperparameters. Final model runs also log environment details via mlflow.sklearn.log_model.

Pitfalls Addressed

Comparison with Other Tools

MLflow is more lightweight to maintain compared to large Kubernetes-based platforms (e.g., Kubeflow) or specialized SaaS solutions. It integrates seamlessly with Python scripts and Hydra, which supports complex configuration hierarchies, making it ideal for iterative experiment pipelines in this project.

Video: The Integration Of MLflow In This Project