Challenge

Product team needed real-time customer event data to power personalization and retention features. Existing batch pipeline had 24-hour latency, causing loss of time-sensitive opportunities. No single source of truth for events; multiple siloed data sources creating inconsistency and trust issues.

Approach

Technical Architecture:

Apache Kafka cluster (6 brokers) handling event ingestion with 3-way replication
Spark Streaming for real-time aggregations and sessionization
Schema Registry for event schema evolution and validation
Multi-topic design: raw events, processed events, aggregations

Project Management:

Built internal “event streaming guild” with representatives from product, analytics, ML
Created self-service event publishing framework allowing teams to add events
Implemented 4-week driver model rotation for platform on-call
Established SLA: 99.9% uptime, <500ms end-to-end latency (p99)

Team

Size: 8 people (1 PM, 3 platform engineers, 2 data engineers, 1 analyst, 1 DBA)
Duration: 4 months (2 months build + 2 months hardening)
Related teams: 15+ product teams as consumers

Results

Delivery Metrics

On-time: ✓ MVP launched week 8 (planned week 8)
Budget: $450K actual vs $420K planned (+7% variance)
Scope: 100% - All planned features shipped

Technical Impact

Event latency: <500ms p99 (vs 24 hours batch)
Platform throughput: Scaled from 0 to 200M events/day in 3 weeks without degradation
System reliability: 99.92% uptime in first 6 months
Cluster efficiency: 78% resource utilization with auto-scaling

Business Impact

4 new personalization features launched powered by real-time data
Churn reduction: 3.2% improvement in at-risk customer retention
Revenue impact: $2.4M incremental ARR from retention + new features
Time-to-market: Reduced feature delivery from 6 weeks to 2 weeks

Key Decisions

Chose Kafka over managed streaming - Cost savings ($200K/yr) justified operational complexity; team gained platform ownership
Guild governance model - Enabled decentralized adoption while maintaining quality standards
Enforced schema validation - Avoided downstream data quality issues; caught 12 schema conflicts in first 3 months
Invested in observability upfront - Kafka + Spark monitoring prevented 4 major incidents