Exploring dvc.yaml The Engine of a Reproducible Pipeline

The dvc.yaml file is the central point for defining a DVC-based pipeline. It specifies each stage, along with the command, dependencies, and outputs. In this project, the entire pipeline—spanning raw data ingestion, transformations, feature engineering, and modeling—is consolidated into a single dvc.yaml.


Exploring dvc.yaml: The Engine of a Reproducible Pipeline

The dvc.yaml file is the central point for defining a DVC-based pipeline. It specifies each stage, along with the command, dependencies, and outputs. In this project, the entire pipeline—spanning raw data ingestion, transformations, feature engineering, and modeling—is consolidated into a single dvc.yaml.

This design aligns with best practices for MLOps, ensuring a single source of truth and avoiding version control conflicts that could arise from handling multiple dvc.yaml files.

Following the DRY principle (Don’t Repeat Yourself), no Python script exists solely for a single stage. Instead, each transformation script is atomic, data-version-agnostic, and standardized in terms of input and output. The dvc.yaml file not only executes every step but also serves as the authoritative source for overrides, dependencies, and outputs for all stages. Because it is tracked by Git, any version of dvc.yaml generated by dependencies/templates/generate_dvc_yaml_core.py—which uses a Jinja2 template (templates/dvc/generate_dvc.yaml.j2)—can be restored or inspected at any time.

The following sections explain why a single-file approach is beneficial, how it addresses stage-ordering challenges, and how it enhances reproducibility.

1. Why a Single dvc.yaml?

By default, DVC supports either multiple pipeline files or embedding multiple stages in one file. In most cases, however, a single dvc.yaml provides a streamlined, linear flow of stages. This makes it easier for new contributors or stakeholders to see the pipeline’s progression at a glance. Each stage is labeled clearly (for example, v0_download_and_save_data, v0_sanitize_column_names, v1_drop_description_columns, v2_median_profit, etc.), and the order is explicit, reducing confusion about the overall structure.

Stage names follow a simple convention:

Why Each Transformation Is Atomic

One motivation for designing transformations in this manner is the clarity it provides. Each stage can reference exactly one file that defines the transformation logic and its typed dataclass. Files within dependencies/transformations/ follow a consistent structure, which simplifies maintenance and review.

2. Challenges & Solutions

A pipeline with many transformations can become unwieldy in YAML form. Writing each stage by hand is repetitive and prone to errors, particularly when the only differences are minor script names or Hydra overrides.

Solution: Use a Jinja2 template to generate dvc.yaml programmatically. This approach enables enumeration of each transformation’s configuration in Python, followed by automatic completion of a template.

Another challenge involves long-running or computationally expensive jobs. Combining multiple transformations into a single stage forfeits the granularity that makes DVC so powerful. Splitting transformations into individual stages allows DVC to rerun only the stages affected by a change, not the entire pipeline, and it enables more precise tracking of each component.

3. Key Benefits

  1. Minimal Compute Cost For a pipeline with 15 stages, a change in only one stage triggers a rerun from that specific stage onward, preserving cached outputs for stages that have not changed.
  2. Versioning Since Git tracks dvc.yaml modifications, it is straightforward to revert to a previous commit and restore the exact pipeline configuration, including any transformations or data versions.
  3. Clarity & Maintainability Each transformation is isolated, and the entire pipeline’s structure is defined in one place, making it easier for teams or other stakeholders to understand and maintain.

Conclusion

A well-structured dvc.yaml file serves as the backbone of any maintainable MLOps pipeline. By separating each stage explicitly and using a template-based approach, duplication is minimized and reproducibility is maximized. Further details on the template logic and stage architecture can be found in the “Master Doc” or by visiting the GitHub repository. Additional articles in this series describe how code is managed to preserve modularity across all pipeline components.

Video: Exploring dvc.yaml The Engine of a Reproducible Pipeline