Challenge
Product team needed real-time customer event data to power personalization and retention features. Existing batch pipeline had 24-hour latency, causing loss of time-sensitive opportunities. No single source of truth for events; multiple siloed data sources creating inconsistency and trust issues.
Approach
Technical Architecture:
- Apache Kafka cluster (6 brokers) handling event ingestion with 3-way replication
- Spark Streaming for real-time aggregations and sessionization
- Schema Registry for event schema evolution and validation
- Multi-topic design: raw events, processed events, aggregations
Project Management:
- Built internal “event streaming guild” with representatives from product, analytics, ML
- Created self-service event publishing framework allowing teams to add events
- Implemented 4-week driver model rotation for platform on-call
- Established SLA: 99.9% uptime, <500ms end-to-end latency (p99)
Team
- Size: 8 people (1 PM, 3 platform engineers, 2 data engineers, 1 analyst, 1 DBA)
- Duration: 4 months (2 months build + 2 months hardening)
- Related teams: 15+ product teams as consumers
Results
Delivery Metrics
- On-time: ✓ MVP launched week 8 (planned week 8)
- Budget: $450K actual vs $420K planned (+7% variance)
- Scope: 100% - All planned features shipped
Technical Impact
- Event latency: <500ms p99 (vs 24 hours batch)
- Platform throughput: Scaled from 0 to 200M events/day in 3 weeks without degradation
- System reliability: 99.92% uptime in first 6 months
- Cluster efficiency: 78% resource utilization with auto-scaling
Business Impact
- 4 new personalization features launched powered by real-time data
- Churn reduction: 3.2% improvement in at-risk customer retention
- Revenue impact: $2.4M incremental ARR from retention + new features
- Time-to-market: Reduced feature delivery from 6 weeks to 2 weeks
Key Decisions
- Chose Kafka over managed streaming - Cost savings ($200K/yr) justified operational complexity; team gained platform ownership
- Guild governance model - Enabled decentralized adoption while maintaining quality standards
- Enforced schema validation - Avoided downstream data quality issues; caught 12 schema conflicts in first 3 months
- Invested in observability upfront - Kafka + Spark monitoring prevented 4 major incidents