Migrating to Event-Driven Architecture Without Stopping the Train

The promise of event-driven architecture is compelling: loosely coupled services that scale independently, react to changes in real time, and can be developed by autonomous teams. The reality of getting there from a running monolith is far messier. Over the past three years, we have guided eight organizations through this migration, and the single most important lesson is that big-bang rewrites fail. Every successful migration we have delivered follows an incremental strangler fig pattern that runs old and new systems in parallel for months.

The migration begins with event capture at the boundaries of the existing monolith. Rather than refactoring internal code, we instrument the monolith to emit domain events at key state transitions — order placed, payment processed, user registered. These events are published to a message broker like Apache Kafka or Amazon EventBridge, and initially nothing consumes them. This is intentional. The first milestone is simply proving that the event stream is complete and accurate by reconciling events against the monolith database. Getting the event schema right at this stage prevents painful downstream migrations later.

Once the event stream is validated, we begin extracting the first bounded context into a standalone service. The choice of which context to extract first is critical. We look for domains with clear boundaries, minimal write-path coupling to other contexts, and high operational pain in the monolith — frequently changing business logic, scaling bottlenecks, or team ownership conflicts. The new service consumes events from the broker and maintains its own read model, but the monolith remains the system of record for writes. This read-path extraction is low risk because the monolith continues to handle all mutations.

The write path migration is where things get interesting. Shifting write responsibility from the monolith to the new service requires careful coordination to avoid data inconsistency. We use a dual-write period where both systems process writes, and a reconciliation job continuously compares their state. Discrepancies are logged, investigated, and resolved before we cut over. The cutover itself uses a feature flag that shifts traffic gradually — 1 percent, 5 percent, 25 percent, 100 percent — with automatic rollback if error rates exceed thresholds. This graduated approach means the blast radius of any issue is contained.

Operational maturity for event-driven systems demands investment in tooling that monoliths never needed. Dead letter queues for failed event processing, distributed tracing that follows an event through multiple services, schema registries that enforce backward compatibility on event payloads, and replay mechanisms for reprocessing historical events after bug fixes. We build this operational foundation in parallel with the migration itself, because an event-driven architecture without observability is a distributed monolith with worse debuggability. By the end of a typical engagement, teams have not just a new architecture but a new operational muscle for running distributed systems.

Tagged

ArchitectureMicroservicesEvent-DrivenKafka

Arjun Mehta

Principal Systems Architect at LUMorion

Writes about engineering, engineering best practices, and building production systems at scale.

Tagged

ArchitectureMicroservicesEvent-DrivenKafka

Arjun Mehta

Principal Systems Architect at LUMorion

Writes about engineering, engineering best practices, and building production systems at scale.

Migrating to Event-Driven Architecture Without Stopping the Train

Related Articles

Next.js at Scale: Our Performance Playbook for Sub-Second LCP

Migrating to Event-Driven Architecture Without Stopping the Train

Related Articles

Next.js at Scale: Our Performance Playbook for Sub-Second LCP