MLflow is central to this project’s experiment tracking, artifact management, and reproducible model development. It is integrated through Hydra configurations, S3 synchronization scripts, and Python modeling code that leverages MLflow’s Pythonic API.
MLflow is central to this project’s experiment tracking, artifact management, and reproducible model development. It is integrated through Hydra configurations, S3 synchronization scripts, and Python modeling code that leverages MLflow’s Pythonic API.
MLflow unifies local experimentation and remote artifact sharing. It supports parameter logging, metric tracking, and model registry for consistent collaboration. Language-agnostic design permits flexible usage in Python-based workflows.
All MLflow runs store metadata in local mlruns/ directories, which are then synchronized to S3. Configuration YAML files (push_mlruns_s3.yaml and pull_mlruns_s3.yaml) in configs/utility_functions define the AWS bucket name, prefix, local tracking directory, and the exact sync commands. Below is an excerpt from push_mlruns_s3.py showing how the CLI command is executed:
if replace_remote:
logger.info("Remote objects not present locally will be deleted from S3.")
confirm = input("Proceed with remote deletion? [y/N]: ")
if confirm.lower() not in ("y", "yes"):
logger.info("Aborted by user.")
return
sync_command += ["--delete"]
subprocess.run(sync_command, check=True)
logger.info("Push complete.")
The local directory path and remote URI both come from Hydra-based YAML entries, ensuring every collaborator uses consistent settings. The scripts in dependencies/io shield developers from manual synchronization procedures.
Experiment Tracking and Model Runs The modules dependencies/modeling/rf_optuna_trial.py and dependencies/modeling/ridge_optuna_trial.py demonstrate how MLflow is set up:
mlflow.set_tracking_uri("file:./mlruns")
existing = mlflow.get_experiment_by_name(experiment_name)
if existing is None:
mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)
...
with mlflow.start_run(run_name=f"trial_{trial.number}", nested=True):
mlflow.log_params(final_params)
mlflow.log_metrics({"rmse": rmse, "r2": r2})
Each run is named trial_X or final_model, storing critical information like RMSE, R², and hyperparameters. Final model runs also log environment details via mlflow.sklearn.log_model.
MLflow is more lightweight to maintain compared to large Kubernetes-based platforms (e.g., Kubeflow) or specialized SaaS solutions. It integrates seamlessly with Python scripts and Hydra, which supports complex configuration hierarchies, making it ideal for iterative experiment pipelines in this project.
⸻