The dvc.yaml file is the central point for defining a DVC-based pipeline. It specifies each stage, along with the command, dependencies, and outputs. In this project, the entire pipeline—spanning raw data ingestion, transformations, feature engineering, and modeling—is consolidated into a single dvc.yaml.
The dvc.yaml file is the central point for defining a DVC-based pipeline. It specifies each stage, along with the command, dependencies, and outputs. In this project, the entire pipeline—spanning raw data ingestion, transformations, feature engineering, and modeling—is consolidated into a single dvc.yaml.
This design aligns with best practices for MLOps, ensuring a single source of truth and avoiding version control conflicts that could arise from handling multiple dvc.yaml files.
Following the DRY principle (Don’t Repeat Yourself), no Python script exists solely for a single stage. Instead, each transformation script is atomic, data-version-agnostic, and standardized in terms of input and output. The dvc.yaml file not only executes every step but also serves as the authoritative source for overrides, dependencies, and outputs for all stages. Because it is tracked by Git, any version of dvc.yaml generated by dependencies/templates/generate_dvc_yaml_core.py—which uses a Jinja2 template (templates/dvc/generate_dvc.yaml.j2)—can be restored or inspected at any time.
The following sections explain why a single-file approach is beneficial, how it addresses stage-ordering challenges, and how it enhances reproducibility.
⸻
By default, DVC supports either multiple pipeline files or embedding multiple stages in one file. In most cases, however, a single dvc.yaml provides a streamlined, linear flow of stages. This makes it easier for new contributors or stakeholders to see the pipeline’s progression at a glance. Each stage is labeled clearly (for example, v0_download_and_save_data, v0_sanitize_column_names, v1_drop_description_columns, v2_median_profit, etc.), and the order is explicit, reducing confusion about the overall structure.
Stage names follow a simple convention:
Version prefix
v0, v1, v2, …, v13, corresponding to the value of data_version_input
in configs/data_versions/base.yaml.
Transformation name
One motivation for designing transformations in this manner is the clarity it provides. Each stage can reference exactly one file that defines the transformation logic and its typed dataclass. Files within dependencies/transformations/
follow a consistent structure, which simplifies maintenance and review.
⸻
A pipeline with many transformations can become unwieldy in YAML form. Writing each stage by hand is repetitive and prone to errors, particularly when the only differences are minor script names or Hydra overrides.
Solution: Use a Jinja2 template to generate dvc.yaml programmatically. This approach enables enumeration of each transformation’s configuration in Python, followed by automatic completion of a template.
Another challenge involves long-running or computationally expensive jobs. Combining multiple transformations into a single stage forfeits the granularity that makes DVC so powerful. Splitting transformations into individual stages allows DVC to rerun only the stages affected by a change, not the entire pipeline, and it enables more precise tracking of each component.
⸻
⸻
A well-structured dvc.yaml file serves as the backbone of any maintainable MLOps pipeline. By separating each stage explicitly and using a template-based approach, duplication is minimized and reproducibility is maximized. Further details on the template logic and stage architecture can be found in the “Master Doc” or by visiting the GitHub repository. Additional articles in this series describe how code is managed to preserve modularity across all pipeline components.
⸻