TL;DR

Every service in our platform was sharing one of two credentials: an atlasAdmin MongoDB user or the RDS master password. I replaced both with per-service scoped roles — MongoDB via Atlas Kubernetes Operator + ESO, RDS via IRSA + RDS Proxy — without touching application code and without a single minute of downtime. Along the way I considered Teleport and HashiCorp Boundary, and chose not to use either. Here’s the full reasoning.


The Problem Nobody Talks About

There’s a credential pattern that’s embarrassingly common in microservice platforms, and almost nobody admits to it until something goes wrong: every service shares the same database admin credential.

In our case, it looked like this. MongoDB Atlas had a single user — admin, atlasAdmin role — and its password lived in AWS Secrets Manager as one secret, referenced by every service that touched MongoDB. RDS was the same story: the Pulumi provisioning stack grabbed the master credentials and wrote them verbatim into ~25 service secrets.

Why does this happen? Because it’s zero friction. When you’re moving fast and spinning up a new service, the path of least resistance is to hand it the credential that already works. The “we’ll scope this properly later” decision accumulates across every team until you have a platform where any single compromised pod has full admin access to all customer data across all tenants. Credential rotation means coordinating a simultaneous restart of every service. There’s no audit trail distinguishing which service dropped a collection versus which one just ran a query.

That’s the silent blast radius. And it’s completely fixable — it just requires someone to sit down and actually design the access model before implementing it.


Requirements Before Touching Code

Before writing a single line of Pulumi or YAML, I wrote an RFC and an ADR. Not because I love documentation — because scoping DB access across 25+ services touches every team simultaneously, and you do not want to discover your assumptions were wrong mid-migration.

The non-negotiables I landed on:

  • Per-service scoped roles — a compromised pod may only reach its own schema
  • Automated rotation — no human-coordinated rotation events, ever
  • Zero-downtime migration — parallel-role approach; old credential stays valid for 2 weeks post-cutover per service
  • Developer experience must not degrade — local dev runs identically; staging/prod access via aws sso login (already the daily habit)
  • SOC2 CC6.x evidence — per-operation audit trail to Datadog, quarterly access review runbook

Clean checklist graphic with 5 requirements as checkboxes, engineering notebook aesthetic, green checkmarks, dark mode.


Two Databases, Two Patterns, One Delivery Plane

The platform runs MongoDB Atlas and RDS PostgreSQL. They’re fundamentally different systems, so the management plane differs — but I was deliberate about keeping the delivery plane identical: every service gets a Kubernetes Secret mounted as env vars. The source of that secret differs; the shape doesn’t.

MongoDB: Atlas Operator + ESO + Fixed Username

The Atlas Kubernetes Operator manages Atlas users as CRDs. The key insight that makes the whole pattern work: the operator needs a K8s Secret containing the password before it creates the Atlas user, and the service needs that same secret to connect. One secret, two consumers.

Pulumi generates password → Secrets Manager (prod/mongo-connectors-password)


ESO syncs → K8s Secret (atlas-connectors-password)
       │                    │
       ▼                    ▼
Atlas Operator        Service pod
creates Atlas user    reads password
connectors_prod_rw    as env var

Username convention: <service>_<env>_<role> — for example connectors_prod_rw. This never changes. Only the password rotates (every 90 days). Service configuration is static; the only thing that moves is the secret value.

Rotation propagation was the detail I almost missed. When ESO syncs a new password, the Atlas Operator reconciles and updates the Atlas user — but the running pod is still holding the old password in memory. The answer was already deployed: Reloader (Stakater), which runs cluster-wide via an ArgoCD ApplicationSet with clusters: {}. It watches K8s Secrets and triggers rolling restarts when values change. Zero manual steps on rotation.

MongoDB credential rotation — Secrets Manager → ESO → K8s Secret → Atlas Operator + Reloader

RDS: IRSA + RDS Proxy

RDS has a feature MongoDB doesn’t: native IAM database authentication. A pod with the right IAM role can connect without a password — it generates a 15-minute token on connection. The iam_database_authentication_enabled flag was already set on all our prod RDS instances. The IRSA infrastructure (KubernetesServiceAccount component) was already built for S3 and SQS access.

The catch: a 15-minute token that expires mid-connection is a problem if the application doesn’t refresh it. Enter RDS Proxy — it sits between the pod and RDS, handles token refresh transparently, and the application just sees a normal PostgreSQL connection string. No app code changes. No credential in any Secret.

Pod (IRSA role) → assumes IAM role → RDS Proxy → RDS
                                      ↑ handles 15-min token refresh

The management plane: Pulumi creates the per-service PostgreSQL role and GRANT statements in postgresql_provision/provision.py. ESO delivers the connection metadata (host, port, db name) — but not the password, because there isn’t one.

One thing I noted in the ADR but deferred: EKS Pod Identity is a cleaner alternative to IRSA for new clusters. Instead of annotating every service account with an IAM role ARN, you create an EksPodIdentityAssociation at the cluster layer — the binding lives outside Kubernetes manifests, trust policies don’t need per-cluster OIDC conditions, and AWS’s tooling handles credential delivery via a DaemonSet. The application code doesn’t change; it’s a pure infrastructure refactor. If I were building this from scratch on a new cluster, I’d use Pod Identity from day one. For an existing platform, IRSA works and the migration cost isn’t worth it until you have dedicated headspace.


The Decision I Almost Got Wrong

Early on I considered having each service deploy its own AtlasDatabaseUser CRD via Helm — self-service, decoupled, teams own their DB users. It seemed elegant until I traced the dependency chain.

The Atlas Operator requires a K8s Secret with the password to exist before the CRD is applied. If the Helm chart creates the CRD, it also needs to create the Secret. If it creates the Secret, it needs a password. Where does the password come from? Either Helm generates it (regenerates on every helm upgrade, causing password churn on every deployment) or it pulls from Secrets Manager (which must be pre-populated — a centralized step anyway).

Per-service Helm doesn’t eliminate the centralized provisioning requirement. It just hides it and couples DB privilege changes to service deployments, where reviewers are focused on application code, not IAM scope creep.

Centralized is the right answer. All AtlasDatabaseUser CRDs live in the infra/argocd repo with CODEOWNERS enforcement on the atlas/users/ path. DB privilege changes are explicit, separately reviewed, and independently auditable. This is a feature, not bureaucracy.


Break-Glass Without New Tooling

The hardest part of any least-privilege design is the break-glass path — how does the on-call engineer get full access during a production incident without undermining the entire security model?

I looked at what was already running. There was a GitHub Actions workflow (.github/workflows/on_callers.yaml) that polls PagerDuty every hour and updates Slack usergroups with the current on-call person. The scripts were already calling both the PagerDuty API and the Okta API in separate scripts.

The break-glass path turned out to be a third step in the existing workflow:

PagerDuty (prod-db-admin schedule) → polled hourly by GHA
  step 1: get-pagerduty-on-callers  → on_callers.json  ✅ exists
  step 2: set-slack-on-callers       → Slack groups     ✅ exists
  step 3: set-okta-db-admins         → sentra-db-admins ← to build

The slack_assign_on_call_groups.py script is the exact template — same logic, same structure, just targeting the Okta groups API instead of the Slack SDK. One new script, one new GHA step. The on-call engineer gets atlasAdmin and break_glass_admin (RDS) automatically when their shift starts, and loses it when it ends. No commands to run. Every access fires a P1 Datadog alert.

Break-glass access — PagerDuty on-call schedule drives Okta group membership and Atlas/RDS admin grants


Did I Consider Teleport and HashiCorp Boundary?

Yes. Both are purpose-built for privileged access management and worth evaluating seriously.

Teleport handles database access natively — it issues short-lived certificates, provides session recording, and integrates with your IdP for MFA-gated access. The UX is genuinely good: tsh db connect <db-name> and you’re in. Audit logs are first-class. If you’re starting from scratch with no existing tooling, Teleport is a strong choice.

HashiCorp Boundary takes a different approach — it’s a network-layer proxy that brokers access to targets. You authenticate through Boundary and it proxies your connection. Clean separation of network access from credential management.

Why didn’t I use either? Three reasons:

  1. We already had the pieces. PagerDuty, Okta, GHA, Atlas Operator, ESO, Reloader — every component of the break-glass and credential delivery flow was already running. Adding Teleport or Boundary means operating one more system, training the team on one more tool, and creating a dependency on one more SLA.

  2. New tooling for an already-solved problem. The actual security properties I needed — scoped credentials, automated rotation, JIT break-glass with audit trail — are achievable with what exists. Teleport would improve the UX of the break-glass path, but it wouldn’t change the threat model meaningfully.

  3. Migration cost vs. incremental value. Retrofitting Teleport into an existing Kubernetes platform isn’t trivial. It requires deploying the Teleport operator, configuring database services for each RDS instance and Atlas cluster, and migrating the existing access patterns. That’s months of work for a platform team that already has a defined migration path.

If I were designing a greenfield platform today, Teleport would be in the architecture conversation from day one. Retrofitting it into an existing system with working alternatives is a different calculation.

The honest principle: reach for existing tooling before introducing new dependencies. You can always add Teleport in v3 when the team has bandwidth and the ROI is clear.


The v2 Path: MongoDB Without Passwords

The current design uses username + password for MongoDB service connections — scoped, rotated, delivered via K8s Secret. It’s secure. But it’s not as clean as the RDS IRSA model where there’s no stored credential at all.

MongoDB Atlas supports OIDC Workload Identity Federation: a service pod authenticates using its AWS IAM identity (IRSA role), Atlas validates the IAM role ARN, and the connection is established without a password. No Secret. No rotation event. No Reloader restart.

The catch is code changes. Every service needs its MongoDB driver upgraded (PyMongo ≥ 4.7, Node.js driver ≥ 6.3) and its connection string updated to use authMechanism=MONGODB-OIDC. That’s a migration across every MongoDB-consuming service — worthwhile, but not v1 work.

On EKS, the concrete setup looks like this. The pod gets a projected service account token mounted at a known path, with the audience set to mongodb:

volumes:
  - name: mongo-token
    projected:
      sources:
        - serviceAccountToken:
            audience: mongodb      # must match the Atlas IdP audience
            expirationSeconds: 86400
            path: token
containers:
  - volumeMounts:
      - name: mongo-token
        mountPath: /var/run/secrets/mongodb

The application reads that token file and passes it as the OIDC callback — no password, no secret:

# PyMongo ≥ 4.7
client = MongoClient(
    uri,
    authMechanism="MONGODB-OIDC",
    authMechanismProperties={
        "OIDC_CALLBACK": lambda _: {
            "access_token": open("/var/run/secrets/mongodb/token").read()
        }
    }
)

Atlas validates the token against the configured OIDC identity provider (your EKS cluster’s OIDC issuer URL), maps the sub claim (system:serviceaccount:<ns>:<sa>) to an AtlasDatabaseUser, and grants the connection. No password anywhere in the chain.

The upgrade path is non-destructive: services migrate one at a time, the Atlas Operator CRD changes from passwordSecretRef to oidcAuthType, and the ESO ExternalSecret for that service’s MongoDB password gets removed. Full parity with RDS in terms of stored credentials: zero.


Lessons from Designing This

A few things I’d tell myself at the start:

Write the ADR before the RFC. I wrote the RFC first and iterated on it extensively. The ADR forced me to commit to specific decisions and document rejected alternatives — that discipline would have shortened the RFC process.

Validate what actually exists before designing what you need. I assumed Atlas Identity Federation was configured. It wasn’t. I assumed RDS IAM auth needed to be enabled. It was already on. Checking the actual state of your infrastructure before writing the design doc saves you from building on false assumptions.

The management plane and delivery plane are separate problems. MongoDB and RDS have different management tools — that’s fine. What matters is that the delivery interface to services is identical. Uniform delivery enables uniform service configuration, which enables uniform operations.

The full RFC and ADR for this design are available at github.com/OrenOren1/db-access-least-privilege — genericised for reuse.


Conclusion

Shared database admin credentials are a solved problem. The tooling to fix it — Kubernetes operators, ESO, IRSA, IAM DB auth — is mature, well-documented, and probably already running in your cluster. The work is in the design: defining the role taxonomy, tracing the credential delivery chain, building the migration strategy, and getting the governance model right.

The hardest part isn’t the technology. It’s convincing your team to treat “the admin credential works fine” as the security debt it actually is — before something forces the conversation.