A Zero-Trust Odyssey: Building a Banking App on Event Sourcing in Record Time

7/16/2024

There is, I have discovered, a rather exotic pleasure in simultaneously grappling with an architecture that demands near-total infallibility¹ while also racing against a deadline that practically begs for things to break. This was precisely the predicament in which my team and I found ourselves—charged with building a brand-new banking application (yes, real money, real regulators, real everything) using a zero-trust architecture, underpinned by an event-sourcing model, and all deployed on the volatile wonderland that is AWS. We were armed with Go, Kafka, Kubernetes, Istio, and precisely enough caffeine to cause mild heart palpitations. This is the story of how we sprinted through a labyrinth in record time without (mostly) getting lost.

Why Zero-Trust & Event Sourcing? (Because I Wanted It All)

Starting a new venture means you’re free from the baggage of legacy systems. You can also be free from the “reasonable” constraints people typically impose. For me, that meant an event-sourced ledger—where every deposit, withdrawal, or account closure is recorded as an immutable event in a stream—alongside a zero-trust stance that refused to trust any service call by default. A tall order, sure, but it felt like the right way to handle real financial data in a world where breaches happen almost daily.

The Context: Infinite Complexity Meets Zero-Trust

Zero-Trust is one of those phrases that conjures images of paranoid system admins glaring suspiciously at every packet that even thinks about crossing a firewall. In practice, it means each microservice, each request, and each piece of data has to be treated as if it might be carrying a concealed weapon². No assumptions allowed, no “trusted internal networks”—because we simply trust no one.

At the same time, I wanted to build an event-sourced system for all account ledgers—meaning everything is an immutable chain of events. In your typical monolithic bank system, you might track an account’s current balance by updating it in place. Not so with event sourcing: we record every deposit, every withdrawal, every “customer rage-quits and closes account” action as a discrete, time-sequenced event. Then, the system’s “truth” emerges from replaying these events in order. Simple in concept, devilishly tricky in practice.

The big question: Could we scale this approach without, you know, losing data or completely bungling the transaction ordering? Because if there’s one domain where dropping an event is a cardinal sin, it’s finance. And we had to do it all in a fraction of the time we’d usually like to spend on vetting, testing, and triple-verifying.

Event Sourcing: The Idea of an Immutable Ledger

Banking transactions must be absolutely consistent and auditable. With event sourcing, every state change is a discrete, timestamped record. You want to see the entire transaction history? Replay the log. You need an audit trail for compliance? It’s already there in chronological order. No partial updates, no hidden manipulations. It’s the ideal approach for a ledger—so long as you can handle the complexity of distributed event processing³.

The Kafka Conundrum: Ordering, Exactly-Once, & Other Unicorns

We harnessed Kafka as the central nervous system for our events. Kafka is typically lauded for its throughput and partitioned log model. But once you mix in exactly-once semantics—especially across multiple consumers and topics—things get downright labyrinthine⁴.

Event Ordering
If a deposit event arrives before a withdrawal event, that’s presumably fine—unless the deposit was supposed to happen after the withdrawal, in which case you might incorrectly credit an account. So we had to meticulously partition data by account ID, ensuring all events for that ID landed on the same Kafka partition for deterministic ordering.
Exactly-Once Processing
There’s a common joke that “exactly-once” is more mythical than Bigfoot. Yet, Kafka’s transactions and idempotency keys can approach this ideal. We insisted on using idempotency keys in every message. If a duplicate event slid through—maybe because a consumer retried a half-finished transaction—our handlers would detect that, politely discard it, and proceed as if nothing had happened. This required a global store of processed keys, but it prevented double withdrawals, which, as you can imagine, are seldom welcomed by your customers.
Retries & Fault Tolerance
Because network flakiness is an inevitability, we wrote a mountain of retry logic that was always mindful of these idempotency keys. The big secret? You can retry infinite times if your process is idempotent—like hitting the “walk” button at a traffic light over and over⁵. No harm done, eventually you cross.

Embracing Zero-Trust: The Sidecar Ballet in Istio

We used Istio as our service mesh to bring the concept of zero-trust from theory into reality. For those uninitiated, Istio can be seen as a behemoth that orchestrates:

mTLS (Mutual Transport Layer Security) for every microservice call, so nobody can eavesdrop or impersonate anything else.
RBAC (Role-Based Access Control) to ensure each request is properly authenticated and authorized.
Sidecar Proxies that intercept every request, from “service A calls service B” down to “service A calls itself by accident.”

So, in effect, each microservice is guarded by a vigilant gatekeeper (the sidecar), which enforces the cryptographic handshake and checks policies before letting a single packet pass. The real excitement kicked in when we discovered how easy it was to misconfigure RBAC in such a way that you end up locking out your own services⁶. A flurry of Slack messages about “Why is the transaction processor returning 403?” ensued. Ultimately, we wrote a half-dozen test harnesses to confirm that each new policy did exactly what we intended, because in a zero-trust world, a single slip-up can become an indefinite lockdown.

K8s, Self-Healing, & the Dream of Downtime-Free Deploys

Kubernetes is an ideal fit for big distributed systems—partly because it’s designed to handle container orchestration, and partly because it has that magical self-healing property. If a service crashes, K8s will politely spin it back up as if to say, “There, there, I’ve got you.” But you must feed it the right configuration. We leaned heavily on:

Deployment Replicas so each microservice had enough clones to handle rolling updates without downtime.
Readiness & Liveness Probes to ensure that if a container started misbehaving or froze, K8s would swiftly kill and replace it—no tears shed.
Fault Tolerance courtesy of Kafka’s offset-based replay. A consumer that died mid-transaction resumed from the last committed offset, which drastically reduced data-loss nightmares.

The net result? If you design your microservices to be stateless (beyond Kafka’s event store) and keep external dependencies loosely coupled, you can bounce back from failure so quickly that customers might not even notice. This was crucial for a banking system, because if anything goes more than slightly awry, it’s headlines time.

The Maze of Observability: Monitoring & Tracing Across Multiple Regions

The hidden challenge: debugging. When microservices talk to each other across multiple AWS regions—particularly under the watchful eye of a service mesh—there’s a real risk of losing track of who said what, when, and why. We combated this with:

Prometheus & Grafana
We exported metrics from each service: CPU usage, memory consumption, request latencies, even “number of deposits vs. withdrawals per minute.” This gave us handy dashboards for diagnosing hotspots.
Distributed Tracing with Jaeger
Each request had a unique trace ID that got passed along from microservice to microservice. So if a transaction got stuck, we could see precisely which service caused the holdup. This was especially useful when a bug manifested in the event replayer we had to figure out why a deposit took 30 seconds instead of 3 and which data center it happened in.
AWS IAM & Fine-Grained Access
Because this was zero-trust, we also needed to ensure that only authorized roles could read logs or metrics. That meant careful IAM policies so that random microservices didn’t start sniffing each other’s internals. It’s a delicate dance, writing an IAM policy that’s strict enough to be safe but not so strict that you end up in a ticket labyrinth requesting ephemeral override tokens.

Doing It All… Fast

Perhaps the most head-spinning aspect was the time crunch. Normally, you’d want to do extended load testing, thorough pen-testing, and multiple staging cycles, right? We definitely tried, but we had a fraction of the usual runway. So we:

Automated Everything
From CI/CD pipelines that built Docker images and deployed them to Kubernetes, to automated Kafka topic creation scripts, to security scanning on every commit. We had zero time for manual tinkering.
Parallelized Development
Different squads owned specific microservices (e.g., Payment Orchestrator, Ledger Updater, Notification Manager) and hammered them out concurrently, ensuring consistent contract definitions via an OpenAPI spec. This let us code in parallel without stepping on each other’s toes—most of the time, anyway.
Feature Flags
We used them liberally. If a feature wasn’t 100% tested, we could disable it in production and turn it on later. This was crucial to avoid blocking the entire release over one uncertain piece of functionality.

Did we almost lose our minds? Absolutely. But the sense of collective adrenaline was also somewhat exhilarating. It’s like trying to build a race car while simultaneously driving it around the track, hoping all the wheels stay on.

Final Thoughts: A Triumph of Organized Paranoia

Some folks say “paranoia” like it’s a bad thing, but in a zero-trust environment handling real financial transactions, paranoia⁷ becomes the guiding principle that keeps you afloat. Every microservice is suspicious until proven otherwise, every event must be validated and replayable, and everything—down to the last ephemeral container—needs to be ephemeral and bulletproof. It’s a tall order, but somehow we delivered.

Of course, none of this was a solo effort—far from it. We had specialists for InfoSec, for Kafka, for K8s. We had code reviewers who spotted the smallest oversight in an Istio policy, testers who hammered the system with synthetic transaction loads, and architects who woke up in the middle of the night scribbling flowcharts about event ordering. That synergy was what let us build, test, and deploy an entire banking ecosystem at warp speed.

In the end, we were left with a system that was simultaneously robust (thanks to event sourcing and K8s self-healing) and paranoid (thanks to zero-trust everything). And yes, we definitely encountered bumps along the way—like the time half the microservices refused to talk because a single Istio policy snippet was reversed⁸. But we recovered, we learned, and we delivered a new kind of bank infrastructure: one that’s suspicious by design, unstoppable by resilience, and presumably more stable than many of the humans building it.

Footnotes

Near-total infallibility here meaning: we must never lose a transaction or misplace so much as a penny, under penalty of regulatory nightmares and outraged tweets.
Especially if you’re also paranoid about security— because event logs and streams can reveal patterns an attacker might exploit if you’re not cautious.
Metaphorically. Though sometimes it feels literal if you’ve read enough security incident reports.
The usual disclaimers about “guaranteed ordering” apply: i.e., “guaranteed so long as you carefully manage partitions, avoid cross-partition queries, and recite your favorite incantation thrice before bed.”
Unless your local municipality is particularly vindictive about jammed traffic signals.
Which ironically fulfills the “zero-trust” ethos a bit too well, making your system think everything is suspect—even legitimate calls.
I like to think of it as constructive paranoia: constantly imagining worst-case scenarios and then building guardrails against them.
The dreaded deny: all rule typed where we meant allow: all. The result: total meltdown.

← Read more articles

Comments

No comments yet.