Scaling authentication from 100 → 10,000 users

The jump from 100 to 10,000 users is the easiest order-of-magnitude scale-up in most SaaS. Your database is still small, your request rate is still under 100 RPS, and most of your queries still fit in memory. But a few specific things start to creak — and they’re the ones that take a production incident to find if you don’t anticipate.

1. Session-table queries

At 100 users, you have maybe 300 session rows. At 10,000, you might have 100,000 — one per active device per user, plus old expired sessions nobody’s cleaned up.

What to do:

Index on (user_id, expires_at) for “is this session valid” lookups
Index on expires_at for cleanup sweeps
Run a nightly cron that deletes expired sessions. Without this, the table grows forever.
Consider moving to Redis/Valkey if the session lookup is on your hot path. At 10k users it’s fine in Postgres; at 100k you’ll want an in-memory store.

2. Password hash CPU

Argon2id at the recommended parameters takes 100-500ms per hash. At 100 users logging in once a day, that’s nothing. At 10,000 with usage spikes, you can end up CPU-bound during Monday morning logins.

What to do:

Separate auth traffic onto its own pool of workers (or a dedicated service). Login spikes shouldn’t crash your API.
Tune Argon2 parameters to your hardware. Target 300ms per hash on a production-sized instance.
Cache successful auth results for the session duration, not just the in-request cache.

3. IdP rate limits

If you use a third-party IdP (Google, Okta), they enforce per-client rate limits on JWKS fetches, token exchanges, and userinfo lookups. At 100 users you never hit them. At 10k, if you’re fetching JWKS on every request, you will.

What to do:

Cache the JWKS response with a 1-hour TTL. Refetch on cache miss or when you see a new kid.
Don’t call /userinfo on every request — the ID token already has the data you need.
Batch backchannel calls if you’re doing offline token refresh.

4. Email deliverability

At 100 users, your email deliverability is perfect because you barely send any. At 10k, password resets and magic links hit Gmail’s volume filters. Random blocks start.

What to do:

Authenticate your sending domain (SPF, DKIM, DMARC). Table stakes.
Warm up your sending reputation if you’re using a new sending domain.
Monitor bounce/complaint rates via your ESP. A sudden spike often means a mailing-list got the magic-link URL and Gmail marked it as spam.

At 100 users, 20 failed logins in an hour is obviously one confused person. At 10k, it might be a real credential stuffing attack. Without per-account rate limiting AND alerting, you won’t notice.

What to do:

Rate limit per account, not just per IP (see previous posts).
Alert on aggregate failure rate anomalies (via Datadog, Grafana, or whatever you use).
Lock and notify the account after a threshold is crossed.

None of this is rocket science. All of it is easy to defer. Do the index work and cron job this week — they’re the cheap, high-value ones.

1. Session-table queries

2. Password hash CPU

3. IdP rate limits

4. Email deliverability

5. Failed login alerting