Scaling authentication from 100 → 10,000 users

The jump from 100 to 10,000 users is the easiest order of magnitude scale-up a SaaS will see. Your DB is fine, your RPS is manageable, your memory is plentiful. But three specific things break if you ignore them, and all three are easy to fix proactively.

At 100 users, your password hashing is invisible. At 10,000 during Monday morning, you might have 200+ logins in the first hour. Argon2id at OWASP-recommended parameters takes 100-500ms per hash. 200 hashes × 300ms = 60 CPU-seconds concentrated in 15 minutes. On a single-core instance, you just saturated the CPU for your entire app.

Fix: either separate your auth traffic onto its own worker pool, or tune Argon2 parameters to your instance size. Measure: time argon2 -t 3 -m 16 -p 4 on your production-sized box. Target 300ms per hash. If your server can only do 3 hashes per second, you have 10,800 hashes per hour capacity — fine for 10k users, not for 100k.

2. Session lookups start to add up

At 100 users, your sessions table has 200 rows, the query is a single primary-key lookup, and it’s sub-millisecond. At 10,000 users, if your indexing is wrong, it might be a few milliseconds. Multiplied by every authenticated request, that’s a real latency budget.

Fix:

Covering index on (user_id, expires_at) for “is this session valid”
Partial index WHERE expires_at > NOW() for hot-path queries
Nightly cleanup DELETE FROM sessions WHERE expires_at < NOW() - INTERVAL '7 days'

At 10k users you’re fine in Postgres. At 100k, consider moving the session lookup to Redis. Don’t do it preemptively; measure first.

3. IdP rate limits bite

If you use a third-party IdP (Google, Okta), they enforce per-OAuth-app rate limits:

JWKS endpoint: fetching the public keys to verify ID tokens. If you fetch on every request, you’ll hit the limit.
Token endpoint: exchanging auth codes. Not usually hit, but batch flows can.
UserInfo endpoint: this one bites if you call /userinfo on every request instead of using the ID token claims.

Fix:

Cache JWKS with a 1-hour TTL. Refetch on a cache miss or when you see a kid you don’t recognize.
Stop calling /userinfo on every request. The ID token already has the user’s identity — decode it once on login, store the claims in your session.
Batch calls when doing migrations or bulk operations.

Bonus: email deliverability

At 1,000 users, you send maybe 1,000 emails per month. At 10,000, you might send 20,000 (password resets, magic links, verifications, digests). Gmail’s volume filter notices. If your sending domain reputation is weak, deliverability drops.

Fix: authenticate your sending domain (SPF, DKIM, DMARC), warm it up over weeks, and monitor your bounce/complaint rate via your ESP. Separate transactional (auth) from marketing sending.

At 100 users, 20 failed logins in an hour is obvious. At 10,000, it might be credential stuffing. Add alerting:

Rate threshold: failed logins > 100/hour → alert
Account threshold: 10 failed attempts on one account in a day → lock and notify user
Geographic anomaly: successful login from an unusual country → require step-up auth

None of these are sophisticated. All are cheap. Skip them and you’ll find out the hard way during an attack.

The pattern

None of the 10k-user problems are exotic. All of them have one-hour fixes. The challenge is noticing them before they cause an incident. Do the fixes proactively and your auth system stays boring all the way to 100k.

1. Password hashing dominates CPU during login spikes

2. Session lookups start to add up

3. IdP rate limits bite

Bonus: email deliverability

Bonus bonus: failed-login alerting

The pattern