Challenge

Legacy data infrastructure managed through shell scripts, cron jobs, and manual monitoring. No centralized visibility into data workflows; failures discovered by business users (“my report is broken”). Dependency management handled via schedules alone; frequent cascading failures. Oncall rotations burned out; no effective runbooks.

Approach

Technical Architecture:

Migration Strategy:

  1. Pilot phase: Migrate 20% of workflows (2 weeks)
  2. Core platform setup: Establish production infrastructure (3 weeks)
  3. Wave migration: Migrate remaining 80% (8 weeks, 3 teams)
  4. Legacy cleanup: Decommission old infrastructure (2 weeks)

Process Improvements:

Team

Results

Delivery Metrics

Technical Impact

Operational Impact

Business Impact

Key Decisions

  1. Airflow over custom orchestration - Open source; large community; ownership benefits
  2. Wave migration approach - Reduced risk; allowed team to learn and iterate
  3. Containerized operators - Isolated from infrastructure; easier to test and maintain
  4. Mandatory code review - Prevented DAG antipatterns; improved quality upfront