fix(coderd/externalauth): detect rate-limit 403 and narrow isFailedRefresh by mafredri · Pull Request #24334 · coder/coder

mafredri · 2026-04-14T14:31:33Z

ValidateToken treated all 403 responses as "token invalid," including
GitHub rate limits (X-RateLimit-Remaining: 0, Retry-After). When a
rate-limited 403 followed a successful token refresh, the token was
marked invalid and the retry loop exhausted, returning
InvalidTokenError. With PR 1 the token is now saved before
validation, but the caller still sees a false "invalid token" error.

isFailedRefresh included http.StatusForbidden in the status code
fallthrough (Switch 2), destroying tokens when any unknown error code
arrived with a 403. No known provider returns 403 from the token
endpoint.

Stacked on #24332.

This PR:

Splits the 401 || 403 check in ValidateToken into separate
branches. On 403, inspects X-RateLimit-Remaining and Retry-After
headers. If either indicates a rate limit, returns (true, nil, nil)
(optimistically valid). Plain 403 without rate-limit headers preserves
the existing (false, nil, nil) behavior.
Adds incorrect_client_credentials and invalid_client to
isFailedRefresh Switch 1 (error codes), making those permanent
failure classifications explicit rather than relying on the status
code fallthrough.
Removes http.StatusForbidden from Switch 2 (status codes). Unknown
error codes with 403 are now treated as transient (safe default: retry
rather than destroy).

Implementation plan

External Auth Token Resilience: Plan

Research: external-auth-token-resilience-research.md

Problem

Coder's external auth token refresh for GitHub Apps silently discards
successfully refreshed tokens when post-refresh validation fails due to
API rate limits. GitHub rotates refresh tokens on use ("Once you use a
refresh token, that refresh token and the old user access token will no
longer work." -- GitHub docs, Refreshing user access tokens). When the
new token is not saved, the old refresh token in the DB is stale on
GitHub's side. The next refresh attempt fails permanently, destroying
the token and forcing manual re-authentication.

Secondary issues compound this: ValidateToken treats all 403 responses
as "token invalid" (including rate limits), and isFailedRefresh treats
403 as a permanent failure, destroying tokens on transient conditions.

Decisions

D1: Save refreshed token before validating

Chosen: Move the DB save from after validation to after
TokenSource.Token() succeeds.
Classification: Human ratified agent recommendation.
Full deliberation: Research document, Decisions section.

D2: Distinguish rate-limit 403, separate change

Chosen: Detect rate-limit headers in ValidateToken and
isFailedRefresh 403 paths. Ship as a follow-up, not bundled with D1.
Classification: Human ratified agent recommendation.
Full deliberation: Research document, Decisions section.

D2a: Return `(true, nil, nil)` for rate-limited ValidateToken

Chosen: When rate-limit headers indicate the 403 is a rate limit,
return (true, nil, nil) (optimistically valid) rather than an error.
Classification: Agent-recommended. Identified during trace phase.
Research Approach 2 originally proposed returning an error, but trace
revealed this would cause provisioner failJob regression (line 678:
non-InvalidTokenError errors fail the build). Returning
(true, nil, nil) avoids that regression. The token was just issued
by the IDP; the validation endpoint is failing for a transient reason.
Trade-off: If a token is genuinely revoked AND the account is
simultaneously rate-limited, the revoked token is treated as valid
during the rate-limit window. Recovery happens when the rate limit
lifts (next validation correctly returns 403 without rate-limit
headers).

D3: Narrow isFailedRefresh fallthrough alongside D2

Chosen: Add incorrect_client_credentials and invalid_client
to the known error code list. Remove http.StatusForbidden from the
status code fallthrough.
Classification: Human ratified agent recommendation.
Full deliberation: Research document, Decisions section.

D4: Defer RateLimitError type

Chosen: Defer. System is correct after D1+D2+D3. UX polish is
separate scope.
Classification: Human ratified agent recommendation.
Full deliberation: Research document, Decisions section.

Implementation Flow

Two PRs. PR 1 is a standalone bug fix. PR 2 depends on PR 1 (stacked)
or lands after PR 1 merges. They can be reviewed in parallel but must
merge in order.

Each stage follows red-green TDD: write a failing test first (red),
then implement the minimum code to make it pass (green), then refactor.
The stages below are ordered accordingly: test stages precede or are
interleaved with implementation stages.

PR 1: Save refreshed token before validating (D1)

Scope: coderd/externalauth/externalauth.go, tests.

Stage 1 (RED): Write failing test for the data loss scenario

Add a test that reproduces Finding 1's chain:

Configure a GitHub-type provider with an expired token and a valid
refresh token.
Mock TokenSource to return a new token (simulating successful
refresh with rotated refresh token).
Mock ValidateToken endpoint to return 403 (simulating rate limit).
Call RefreshToken. It should return an error (validation failed).
Assert: the DB contains the NEW access token and NEW refresh token
(not the old ones). This is the critical assertion -- the whole
point of the fix.
Call RefreshToken again (simulating next caller). Assert: the
refreshed token is read from DB, TokenSource does NOT attempt
another refresh (token is not expired), validation is retried.

Artifact: New test TestRefreshToken/SaveBeforeValidate (or similar).
Test fails on current code (step 5 fails: DB still has old tokens).

Verification: Test compiles and runs. Step 5 assertion fails (red).

Stage 2 (GREEN): Move the DB save

In RefreshToken (line 164), after TokenSource.Token() succeeds
(line 194) and GenerateTokenExtra completes (line 259), save the
token to the DB immediately when the access token changed. This is the
existing UpdateExternalAuthLink call currently at line 288, moved
earlier.

The early save must assign the returned DB row back to
externalAuthLink (as the current save does at line 303:
externalAuthLink = updatedAuthLink). This ensures the condition at
line 288 (token.AccessToken != externalAuthLink.OAuthAccessToken)
becomes false, skipping the now-redundant write.

If the early save fails (DB error), return immediately with the error
(same pattern as current line 300-301). The refresh succeeded on the
IDP side but the token couldn't be persisted. The caller gets a
non-InvalidTokenError, which provisioner treats as failJob and
other callers treat as HTTP 500. This is the correct behavior: the
system is in a degraded state and should surface it.

Note: UpdateExternalAuthLink clears oauth_refresh_failure_reason
(externalauth.sql line 47). This is correct: the refresh succeeded.

Note: UpdateExternalAuthLink has no optimistic lock (writes
unconditionally by provider_id + user_id). For providers that rotate
refresh tokens (GitHub), only one concurrent refresh can succeed, so
races are self-resolving. For providers that don't rotate, both callers
get valid tokens and last-writer-wins is benign.

After saving, proceed to validation as before. If validation fails, the
token is already persisted. If validation succeeds, the save at line
288 is skipped (access token unchanged).

Artifact: Modified RefreshToken function with the save moved before
the validate: label.

Verification: Stage 1 test now passes (green). Existing
TestRefreshToken subtests pass. The Updates test (line 309 of test
file) exercises the post-refresh validation path and should also pass.

Stage 3 (RED/GREEN): Concurrent behavior test

Write a test (or extend Stage 1's test) that verifies: caller A saves
new token, caller B's stale refresh fails, caller B's destructive
UpdateExternalAuthLinkRefreshToken is a no-op due to the optimistic
lock (WHERE oauth_refresh_token = @old_oauth_refresh_token).

Artifact: Concurrent test added or existing one verified.

Verification: Run make test RUN=TestRefreshToken with race detector.

PR 2: Rate-limit 403 detection and isFailedRefresh narrowing (D2+D2a+D3)

Depends on PR 1. Can be stacked on top of PR 1's branch or land after
PR 1 merges.

Scope: coderd/externalauth/externalauth.go, coderd/promoauth/,
tests.

Stage 1 (RED): Write failing tests for rate-limit detection

New tests for ValidateToken:

Mock endpoint returning 403 with X-RateLimit-Remaining: 0. Assert
return is (true, nil, nil). Currently fails (returns
(false, nil, nil)).
Mock endpoint returning 403 with Retry-After: 60. Assert return is
(true, nil, nil). Currently fails.
Mock endpoint returning 403 with no rate-limit headers. Assert return
is (false, nil, nil) (existing behavior preserved). Currently
passes.

New tests for isFailedRefresh:

Mock a RetrieveError with ErrorCode: "incorrect_client_credentials"
and StatusCode: 200. Assert isFailedRefresh returns true.
Currently passes (caught by StatusOK fallthrough), but the test makes
the behavior explicit.
Mock a RetrieveError with ErrorCode: "invalid_client" and
StatusCode: 401. Assert returns true. Currently passes
(StatusUnauthorized fallthrough).
Mock a RetrieveError with ErrorCode: "unknown_code" and
StatusCode: 403. Assert returns false (transient). Currently
fails (returns true via StatusForbidden fallthrough).

Artifact: New tests, three failing (red).

Verification: Tests compile. Three fail as expected.

Stage 2 (GREEN): Detect rate limits in ValidateToken

In ValidateToken, before the StatusForbidden check at line 348,
inspect the response for rate-limit indicators:

Primary rate limits: X-RateLimit-Remaining: 0 header.
Secondary rate limits: Retry-After header present.

If either indicator is present on a 403, return (true, nil, nil) --
optimistically treat the token as valid (Decision D2a). The IDP just
issued the token; the validation endpoint is failing for a known
transient reason.

For 403 responses with neither X-RateLimit-Remaining: 0 nor
Retry-After header, preserve the existing (false, nil, nil) behavior
(likely a genuine token revocation or permission error).

The existing promoauth/github.go already parses X-RateLimit-*
headers (lines 31-71) but only for 401 responses. The header-parsing
logic can be reused or extracted.

Artifact: Modified ValidateToken.

Verification: Rate-limit tests pass (green). Existing
ValidateFailure test still passes.

Stage 3 (GREEN): Expand isFailedRefresh and remove 403

Add to the ErrorCode switch (Switch 1):

"incorrect_client_credentials" (GitHub: wrong client_id/secret,
HTTP 200)
"invalid_client" (RFC 6749 Section 5.2: client auth failed,
HTTP 400/401)

Remove http.StatusForbidden from the StatusCode switch (Switch 2).
No known provider returns 403 from the token endpoint (Finding 12). The
403 case was speculative defense that causes real harm (destroys tokens
on rate-limited refresh attempts).

After expanding Switch 1, all known permanent error codes are explicit.
The 403 removal only affects unknown error codes arriving with 403
status, which will now be treated as transient (function returns
false). This is the safe default: retry rather than destroy.

Artifact: Expanded Switch 1, narrowed Switch 2.

Verification: All three failing tests pass (green). Existing
RefreshRetries test passes.

Stage 4: Integration verification

Run the full external auth test suite. Verify that PR 1 + PR 2 together
handle the complete Finding 1 chain: expired token, successful refresh,
rate-limited validation, token saved, no destruction, recovery when rate
limit lifts.

Artifact: All tests green.

Verification: make test RUN=TestRefreshToken, make test RUN=TestExternalAuth, race detector enabled.

Research document

External Auth Token Resilience: Research Document

Status: Frozen as of 2026-04-14. Changes require plan owner approval.

Problem Context

What exists

Coder's external auth subsystem manages OAuth2 tokens for Git providers
(GitHub, GitLab, Gitea, etc.). The core logic lives in
coderd/externalauth/externalauth.go. Tokens are stored in
external_auth_links (primary key: (provider_id, user_id)) with fields
for access token, refresh token, expiry, and a cached failure reason.

The token lifecycle has three operations:

RefreshToken (line 164): attempts oauth2 refresh if token is expired,
then validates the new token.
ValidateToken (line 325): sends a GET to the provider's validate URL
with the access token to confirm it's still good.
isFailedRefresh (line 1256): classifies refresh errors as permanent
vs. transient.

Multiple callers trigger RefreshToken:

Provisioner server during workspace builds (provisionerdserver.go:677)
Agent external-auth handler (workspaceagents.go:2101)
Template version auth check (templateversions.go:403)
User external auth list (coderd/externalauth.go:384)

The externalAuthByID handler (the "Test validate" UI) calls
ValidateToken directly without going through RefreshToken.

What's wrong

A user with both a GitHub App and a GitHub OAuth App configured reports:

"token expired and refreshing failed 2 hours ago with: oauth2:
'incorrect_client_credentials' 'The client_id and/or client_secret
passed are incorrect.'"

And separately:

Tokens appear to be destroyed when GitHub rate limits are hit
"Test validate" errors out during rate-limited states
Manual device re-auth fixes it, but only for 2-3 hours
Must re-auth 5-6 times per day

Why it matters

Token destruction is irreversible without manual re-authentication. If
transient conditions (rate limits, temporary server errors) are
misclassified as permanent failures, users are forced into a re-auth loop
that disrupts development workflow.

Finding 1: Rate limit during post-refresh validation silently discards a successfully refreshed token (PRIMARY)

Category: Verified

This is the most critical finding. RefreshToken (lines 194-299)
performs refresh and validation as a two-step sequence. A rate limit
during validation causes the refreshed token to be silently lost, which
then corrupts the refresh token state on GitHub's side.

The chain:

Step 1. GitHub App token expires (~8 hours).

Step 2. TokenSource.Token() (line 194) contacts GitHub's token
endpoint, successfully refreshes. Returns a new access_token and a
new refresh_token. GitHub rotates refresh tokens on use: the old
refresh token is now invalid on GitHub's side.

Step 3. ValidateToken (line 269) is called with the new token against
https://api.github.com/user. This endpoint is rate limited. GitHub
returns 403.

// line 348
if res.StatusCode == http.StatusUnauthorized || res.StatusCode == http.StatusForbidden {
    return false, nil, nil
}

Step 4. The 1-second retry loop (line 264-282) exhausts. Every retry
also gets 403 (rate limits last minutes to hours).

// line 273-285
if !valid {
    if c.Type == string(codersdk.EnhancedExternalAuthProviderGitHub) && r.Wait(retryCtx) {
        goto validate
    }
    return externalAuthLink, InvalidTokenError("token failed to validate")
}

Step 5. Returns InvalidTokenError("token failed to validate") at line
285. Lines 288-299 are never reached. The new token (with new
access_token AND new refresh_token) is never saved to the DB:

// line 288-299 -- NEVER EXECUTED when validation fails
if token.AccessToken != externalAuthLink.OAuthAccessToken {
    updatedAuthLink, err := db.UpdateExternalAuthLink(ctx, database.UpdateExternalAuthLinkParams{
        // ...
        OAuthAccessToken:  token.AccessToken,
        OAuthRefreshToken: token.RefreshToken,
        OAuthExpiry:       token.Expiry,
        // ...
    })

Step 6. DB still has: expired access_token + old refresh_token (which
GitHub already invalidated in step 2).

Step 7. Next call to RefreshToken reads the stale DB record.
TokenSource.Token() sees expired token, attempts refresh with the old
(now invalid) refresh token. GitHub returns an error.

Step 8. The oauth2 library's auth style probing (the AuthStyleCache is
in-memory and may be empty after restarts) first tries Basic auth.
GitHub receives Basic auth with the old refresh token and returns an
error. The specific error code depends on how GitHub classifies a stale
refresh token sent via Basic auth. This can produce
incorrect_client_credentials or bad_refresh_token depending on
GitHub's internal handling.

Step 9. isFailedRefresh returns true (either via the ErrorCode match
for bad_refresh_token or via the StatusCode fallthrough for
http.StatusOK). Refresh token is wiped from DB. Error is cached in
oauth_refresh_failure_reason.

Step 10. All subsequent calls see: expired token, no refresh token,
cached failure reason. Returns: "token expired and refreshing failed
2 hours ago with: ".

This is a data loss bug. A successful refresh (step 2) is discarded
because a transient rate limit on a completely different endpoint
(/user vs /login/oauth/access_token) causes the validation to fail.
The refresh token rotation means there is no recovery without manual
re-authentication.

Influences: All approaches. This is the primary bug to fix.

Finding 2: ValidateToken treats GitHub 403 (rate limit) as "token invalid"

Category: Verified

ValidateToken (line 348-351):

if res.StatusCode == http.StatusUnauthorized || res.StatusCode == http.StatusForbidden {
    // The token is no longer valid!
    return false, nil, nil
}

This returns (false, nil, nil) -- "not valid, no error." GitHub uses
HTTP 403 for at least three distinct conditions:

Token revoked / insufficient permissions
Primary rate limit exceeded (X-RateLimit-Remaining: 0)
Secondary rate limit exceeded (includes Retry-After header)

The code does not inspect X-RateLimit-Remaining, X-RateLimit-Reset,
Retry-After, or the response body ("API rate limit exceeded" message) to
distinguish these cases. All three are treated as "token invalid."

The 1-second retry loop (line 264-282) was added for GitHub read-replica
lag after token refresh (comment at line 274-278). It was not designed
for rate limits and is too short.

Influences: Finding 1 (this is the mechanism that triggers the
silent token discard), Approaches 1, 2, 3.

Finding 3: isFailedRefresh treats 403 as permanent token failure

Category: Verified

isFailedRefresh (line 1267-1284):

switch oauthErr.ErrorCode {
case "bad_refresh_token",        // Github
    "invalid_grant",             // Gitlab & Spec
    "unauthorized_client",       // Gitea & Spec
    "unsupported_grant_type":    // Spec
    return true
}

switch oauthErr.Response.StatusCode {
case http.StatusBadRequest, http.StatusUnauthorized, http.StatusForbidden, http.StatusOK:
    // Status codes that indicate the request was processed, and rejected.
    return true
case http.StatusInternalServerError, http.StatusTooManyRequests:
    // These do not indicate a failed refresh, but could be a temporary issue.
    return false
}

When isFailedRefresh returns true, the refresh token is permanently
destroyed (line 209-222). The code correctly handles 429 as transient,
but GitHub also returns 403 for rate limits. A 403 during refresh falls
through to the status code switch where 403 returns true, causing
permanent token destruction.

The comment says "Status codes that indicate the request was processed,
and rejected." This assumption is incorrect for GitHub's 403 rate limits.

Additionally, unknown error codes (like incorrect_client_credentials)
fall through the ErrorCode switch to the status code match. GitHub
returns token endpoint errors with HTTP 200, so http.StatusOK matches
and classifies them as permanent. This is a catch-all that is too broad.

Influences: Approaches 1, 2, 3.

Finding 4: Device auth and token refresh have an asymmetric client_secret requirement

Category: Verified

formatDeviceTokenURL (line 640-651, called by ExchangeDeviceCode at line 599) only sends client_id:

tok.RawQuery = url.Values{
    "client_id":   {c.ClientID},
    "device_code": {deviceCode},
    "grant_type":  {"urn:ietf:params:oauth:grant-type:device_code"},
}.Encode()

Token refresh goes through oauth2.TokenSource, which uses the
oauth2.Config constructed at line 706-708 with ClientSecret.
Device auth always works; refresh can fail if credentials are wrong.

GitHub App user-to-server tokens expire (~8 hours) and include a refresh
token. GitHub OAuth App tokens do not expire and have no refresh token.

Influences: Finding 1 (explains why device re-auth works but tokens
break again after expiry).

Finding 5: No rate limit awareness in the token lifecycle

Category: Verified

promoauth/github.go (line 31-34) only tracks rate limit headers for
401 (unauthenticated) responses:

if resp.StatusCode != http.StatusUnauthorized {
    return rateLimits{}, false
}

Authenticated request rate limits (5000/hour primary bucket) are
invisible to Prometheus. No code in coderd/externalauth/ reads
X-RateLimit-Remaining, X-RateLimit-Reset, or Retry-After from validation
or refresh responses.

No code throttles, backs off, or queues token validation/refresh
operations. Multiple callers independently trigger validation:

gitsync polls every 10 seconds (defaultInterval = 10 * time.Second in worker.go:23, uses its own TokenResolver)
Agent handler on every git credential request
Provisioner on every workspace build
Template version check during workspace creation (polled every 1000ms)
User visiting external auth page

Influences: Finding 1 (this is why rate limits are hit so often).

Finding 6: Singleflight was NOT implemented despite PR claim

Category: Verified

PR #22904 description claims singleflight deduplication was added. The
actual commit (53e52ae) only added the optimistic lock
(OldOauthRefreshToken WHERE clause). grep -rn singleflight coderd/externalauth/ returns zero results.

This means concurrent callers (provisioner + agent + template version
check) each create a separate TokenSource instance (line 194) and can
all independently attempt refresh. With GitHub's refresh token rotation,
this creates a race: the first caller's refresh invalidates the old
token, and the second caller's refresh with the now-stale token fails
with bad_refresh_token.

The optimistic lock on UpdateExternalAuthLinkRefreshToken prevents the
losing caller from clobbering a winner's NEW token. But it does not
prevent the losing caller from attempting the refresh in the first place
(which wastes a token endpoint request and can hit the 2000/hr token
endpoint rate limit).

Influences: Finding 1 (concurrent callers increase rate limit
pressure and can trigger the token discard chain).

Finding 7: oauth2 library auth style probing doubles token endpoint traffic

Category: Verified

GitHub's endpoint does not set AuthStyle:

var GitHub = oauth2.Endpoint{
    AuthURL:       "https://github.com/login/oauth/authorize",
    TokenURL:      "https://github.com/login/oauth/access_token",
    DeviceAuthURL: "https://github.com/login/device/code",
}

The AuthStyleCache is in-memory per oauth2.Config. On every Coder
server restart (or first refresh per Config), the oauth2 library probes:
first sends Basic auth (rejected by GitHub with
incorrect_client_credentials), then retries with form-encoded params.
(Recalled from oauth2 library documentation; not empirically tested this session.)
This is normally transparent but doubles token endpoint traffic toward
the 2000/hr limit.

The device flow (ExchangeDeviceCode) uses http.DefaultClient
directly, bypassing the oauth2.Config. It never populates the
AuthStyleCache. So every first refresh after a device auth exchange
triggers probing.

Influences: Finding 1 (contributes to token endpoint rate limit
pressure), explains why incorrect_client_credentials can appear even
with correct credentials (it's the first probe attempt, normally masked
by the retry).

Finding 8: "Test validate" calls ValidateToken directly

Category: Verified

externalAuthByID handler (coderd/externalauth.go line 62-63):

eg.Go(func() (err error) {
    res.Authenticated, res.User, err = config.ValidateToken(ctx, link.OAuthToken())
    return err
})

Calls ValidateToken without attempting a refresh first. If rate limited
(403), returns Authenticated: false. This explains the "Test validate
also errors out" symptom -- it hits the same rate-limited
/user endpoint.

Influences: Approach 3.

Finding 9: PR #15608 introduced the "destroy on failed refresh" behavior

Category: Verified

Commit 78f9f43 (PR #15608) introduced the pattern of zeroing the
refresh token on failure. The commit message:

"Once a token refresh fails, we remove the oauth_refresh_token from
the database. This will prevent the token from hitting the IDP for
subsequent refresh attempts."

The concern was legitimate: prevent repeated failed refreshes from
exhausting rate limits. But the implementation conflates "this refresh
failed permanently" with "this refresh failed right now."

PR #19339 (commit 4926410) added oauth_refresh_failure_reason to
cache and display the original error, producing the exact error format
the user reported.

Influences: Approaches 1, 2, 3.

Finding 10: gitsync has its own rate limit handling

Category: Verified

coderd/x/gitsync/gitsync.go uses a per-group
atomic.Pointer[gitprovider.RateLimitError] to short-circuit remaining
rows when a rate limit is hit. The worker has a 30-second tick timeout
and backs off on errors (120s default, 10min for no-token). This shows
the codebase already has rate-limit-aware patterns not present in the
core externalauth package.

Influences: Approach 2 precedent.

Finding 11: Repercussions of each token failure mode per caller

Category: Verified

Provisioner AcquireJob (provisionerdserver.go:648-689)

Scenario	Code path	User effect
Non-InvalidToken error	`failJob(...)` at line 679	Build fails
Blocks for extended period	Hangs at line 677	Build stuck "Pending"; worker blocked
InvalidTokenError	`continue` at line 683	Token silently omitted; Terraform sees `access_token=""`

The optional flag on coder_external_auth only gates the UI submit
button. If bypassed (API call, auto-start), the build proceeds with an
empty token.

Agent external-auth handler (workspaceagents.go:2056-2128)

Scenario	Code path	User effect
Non-InvalidToken error	HTTP 500 at line 2102	Git clone fails
Blocks for extended period	Hangs at line 2101	Git operations stall
InvalidTokenError	HTTP 200 + auth URL	Agent told to re-auth

Template version auth check (templateversions.go:386-416)

Scenario	Code path	User effect
Non-InvalidToken error	HTTP 500 at line 404	Create Workspace errors
Blocks for extended period	Hangs at line 403	Loading spinner
InvalidTokenError	`Authenticated = false`	UI prompts re-auth

listUserExternalAuths (coderd/externalauth.go:370-406)

Iterates ALL links sequentially. Any error sets
Authenticated=false + ValidateError. Blocking stalls the whole page.

Approaches

Approach 1: Save the refreshed token before validating

Description: Persist the refreshed token to the DB immediately after
a successful TokenSource.Token() call, BEFORE calling ValidateToken.
This prevents the data loss bug in Finding 1 where a successfully
refreshed token is silently discarded because validation fails.

Concrete precedent: The code at line 288-299 already saves the token
when validation succeeds. This approach moves the save earlier. The
optimistic lock on UpdateExternalAuthLinkRefreshToken (Finding 6)
provides a model for safe concurrent writes.

Strongest argument for: Fixes the primary bug directly. A successful
refresh (which consumed the old refresh token on GitHub's side) is never
silently lost. Even if validation fails afterward, the new token is
in the DB and can be retried. This is the smallest change that
eliminates the data loss.

Strongest argument against: Saves a token that hasn't been validated
yet. However, validation is a liveness check, not a security gate.
The original ValidateURL comment (PR #9996, commit 45b53c2) says: "ensures an access
token is valid before returning it to the user. If omitted, tokens will
not be validated before being returned." It's optional. No commit or PR
in the history motivates validation as a defense against bad IDP
responses.

The only known validation failure after refresh is GitHub read-replica
lag (the Australian customer incident at line 274-278), which is
transient -- the token IS valid, the API just can't confirm it yet.
Saving first is strictly better for this case.

If a token somehow doesn't work: it stays in the DB, fails validation
on the next call, and the user is prompted to re-auth. Same outcome as
today, but the new refresh token is preserved instead of being lost.

Makes easy: Fixing the immediate data loss. No new error types, no
new caller handling, no rate-limit detection needed for this specific
fix.

Makes hard: Nothing. This change is additive and does not conflict
with the other approaches. It should be done regardless of what else is
chosen.

Changes to existing codebase: Move db.UpdateExternalAuthLink from
after validation (line 288) to after TokenSource.Token() succeeds
(around line 258). If validation subsequently fails, the token is still
in the DB. If validation succeeds and the token was already saved, no
second write is needed.

Approach 2: Distinguish rate limit 403 from token-invalid 403

Description: Inspect GitHub API response headers and body to
distinguish rate-limit 403 from token-revoked 403 before classifying the
result. Changes both ValidateToken (line 348) and isFailedRefresh
(line 1278) to check for rate limit signals before treating 403 as
permanent.

Concrete precedent: GitHub's documentation specifies that rate limit
responses include X-RateLimit-Remaining: 0 header and response body
containing "message": "API rate limit exceeded". The promoauth
package already parses these headers (for metrics only). The gitsync
package has gitprovider.RateLimitError handling.

Strongest argument for: Prevents the rate-limit 403 from triggering
Finding 1's chain in the first place. ValidateToken would return an
error (not (false, nil, nil)) for rate limits, letting the caller
distinguish "can't verify right now" from "token is dead."

Strongest argument against: GitHub-specific. Other providers signal
rate limits differently. Also, even with correct 403 classification,
the 1-second retry window is still inadequate for rate limits.

Makes easy: Fixing the classification bug. Tokens that are actually
revoked are still promptly invalidated.

Makes hard: Generalizing. Each provider needs its own rate limit
detection.

Changes to existing codebase: ValidateToken checks
X-RateLimit-Remaining before treating 403 as invalid. isFailedRefresh
checks for rate limit indicators before the status code fallthrough.
New tests for 403-with-rate-limit-headers.

Approach 3: Add rate-limit-aware error type for callers

Description: Return a specific RateLimitError from token
operations when a rate limit is detected. Each caller handles it
according to its own constraints: provisioner fails fast with a clear
message, agent enters listen mode, template version check shows "try
later."

Concrete precedent: The gitsync package's
atomic.Pointer[gitprovider.RateLimitError] pattern. The device auth
flow's explicit 429 handling (line 562-575).

Strongest argument for: Gives each caller enough information to
make the right decision. A provisioner build shouldn't wait 60 minutes
for a rate limit. An agent in listen mode already has the infrastructure
to poll.

Strongest argument against: Requires changes at every caller site.
The provisioner, agent handler, template version check, and user auth
list all need new error branches.

Makes easy: Accurate error reporting. Users see "rate limited" not
"please re-authenticate."

Makes hard: Coordination. Without shared rate limit state, each
caller still independently discovers the rate limit.

Finding 12: The status code fallthrough in isFailedRefresh can be safely replaced with explicit error codes

Category: Verified

The status code fallthrough (Switch 2) was added in a single commit (PR

15608) as a catch-all. Analysis of what each status code catches

Status	What it catches beyond Switch 1	Risk if removed
400	`invalid_client`, `invalid_request`, `invalid_scope` (RFC codes not in Switch 1)	Misses spec-defined permanent errors
401	`invalid_client` via Basic auth (RFC requires 401 for this)	Misses the RFC-mandated status for bad client credentials
403	No spec basis. No known provider verified to return 403 from token endpoint. Catches rate limits on API endpoints.	Stops destroying tokens on rate-limited refresh
200	GitHub-specific: all token errors arrive as HTTP 200	Misses `incorrect_client_credentials` and future GitHub error codes

The safe replacement: expand Switch 1 with RFC 6749 Section 5.2 error
codes and GitHub-specific codes, then narrow or remove Switch 2.

Codes to add to Switch 1:

"incorrect_client_credentials" (GitHub: wrong client_id/secret)
"invalid_client" (RFC 6749 Section 5.2: client auth failed)
"invalid_request" (RFC 6749 Section 5.2: malformed request)

After adding these, Switch 2 only catches truly unknown codes. The 403
case can be removed (no spec basis, causes rate limit misclassification).
The 200 case becomes nearly redundant for GitHub (all known codes are
now in Switch 1). The 400/401 cases become redundant for spec providers.

Switch 2 could be kept as a lower-confidence fallback (log a warning
rather than silently treating unknown codes as permanent), or removed
entirely with the understanding that genuinely unknown permanent errors
will be retried until the token expires naturally.

Influences: Approach 2.

Finding 13: Saving before validating is safe

Category: Verified

The only realistic scenario where a refresh returns a "bad" token is
GitHub read-replica lag (the Australian customer incident documented at
line 274-278). In that case the token IS valid; GitHub's read replicas
haven't propagated it yet. Saving first is strictly better: the token
works, validation just can't confirm it yet.

For a genuinely bad token (provider bug): the bad token in the DB still
fails validation on the next call, prompting re-auth. Same outcome as
today, but the new refresh token is preserved instead of being lost.

GenerateTokenExtra (line 259) is purely local (no network, no
validation dependency). All data needed for the DB save is available
after line 262.

Concurrency is safe: the optimistic lock on
UpdateExternalAuthLinkRefreshToken prevents a losing concurrent
refresher from overwriting the winner's saved token.

UpdateExternalAuthLink clears oauth_refresh_failure_reason, which
is correct: the refresh succeeded (IDP issued a token); only the
validation failed (different endpoint, different concern).

Influences: Approach 1.

Decisions

D1: Save refreshed token before validating

Question: How to prevent the data loss in Finding 1 where a
successfully refreshed token is silently discarded when post-refresh
validation fails?

Options considered: (a) Save before validating, (b) Extend the
validation retry window, (c) Skip validation after refresh entirely.

Chosen: (a) Save before validating.

Classification: Human ratified agent recommendation.

Reasoning: Agent recommended based on Finding 1 (data loss chain),
Finding 13 (safety analysis), and the ValidateURL comment (PR #9996,
commit 45b53c2) showing validation is an optional liveness check.
Human agreed: "I agree with your recommendations."

D2: Distinguish rate-limit 403 from token-invalid 403, separate change

Question: Should rate-limit 403 responses be classified differently
from token-revocation 403 responses in ValidateToken and isFailedRefresh?

Options considered: (a) Do it in the same change as D1, (b) Do it
as a separate follow-up change, (c) Don't do it.

Chosen: (b) Separate follow-up change.

Classification: Human ratified agent recommendation.

Reasoning: Agent recommended separating because D1 is a clear bug
fix with minimal risk, while D2 changes error semantics at every caller
site. Mixing them makes review harder and rollback riskier. Human
agreed.

D3: Narrow isFailedRefresh status code fallthrough alongside D2

Question: Should the status code fallthrough in isFailedRefresh be
narrowed?

Options considered: (a) Expand Switch 1 with known error codes and
remove 403 from Switch 2, (b) Remove Switch 2 entirely, (c) Keep as-is.

Chosen: (a) Expand Switch 1, remove 403 from Switch 2.

Classification: Human ratified agent recommendation.

Reasoning: Agent recommended adding incorrect_client_credentials
and invalid_client to Switch 1, removing http.StatusForbidden from
Switch 2 (no known provider returns 403 from token endpoint per
Finding 12). Same concern as D2 (403 misclassification) applied to the
refresh path. Human agreed.

D4: Defer RateLimitError type for callers

Question: Should a dedicated RateLimitError type be added for
caller-specific handling?

Options considered: (a) Do it now, (b) Defer.

Chosen: (b) Defer.

Classification: Human ratified agent recommendation.

Reasoning: Agent recommended deferring because with D1+D2+D3 the
system is correct (no data loss, no false InvalidTokenError for rate
limits). The remaining gap is UX polish ("Failed to refresh" vs. "rate
limited, retry after X"). Touches every caller site for marginal gain.
Human agreed.

Trace document

External Auth Token Resilience: Trace Document

Traces each causal claim in the implementation flow against the code.

Change 1, Stage 1: Move the DB save

Claim: After TokenSource.Token() succeeds (line 194) and
GenerateTokenExtra completes (line 259), save the token to the DB.

Code: Line 194 is token, err := c.TokenSource(ctx, existingToken).Token().
Line 259 is extra, err := c.GenerateTokenExtra(token). Lines 262-268
have a clean insertion point between the GenerateTokenExtra error
check and the retry setup. All required variables (token, extra,
externalAuthLink, c.ID) are in scope.

Holds.

Claim: If validation succeeds after the early save, the existing
save at line 288 is skipped because token.AccessToken == externalAuthLink.OAuthAccessToken.

Code: Line 288: if token.AccessToken != externalAuthLink.OAuthAccessToken.
The early save uses UpdateExternalAuthLink which returns the updated
row (:one query). The implementation must assign the result back to
externalAuthLink (as the current code does at line 303). Then the
condition at 288 is false, skipping the write.

Holds -- with requirement that the early save assigns the DB result
back to externalAuthLink.

Change 1, Stage 2: Test for data loss scenario

Claim: TokenSource can be mocked in tests.

Code: testutil.OAuth2Config has TokenSourceFunc (test file line
101-117). The FakeIDP also controls refresh via oidctest.WithRefresh.

Holds.

Claim (corrected): The Updates test (line 309) exercises the
post-refresh validation path.

Code: The Updates test sets link.OAuthExpiry = expired (a past
time), which causes TokenSource.Token() to attempt a real refresh.
After refresh, ValidateToken is called. This exercises the full
refresh-then-validate path.

Note: an earlier draft referenced ValidateRetryGitHub (line 247),
which tests non-expired token validation retry without refresh.
The plan was corrected to reference Updates instead.

Holds (after correction).

Change 1, Stage 3: Concurrent behavior

Claim: Caller A saves new token, caller B's stale refresh fails,
caller B's destructive write is a no-op.

Code: The UpdateExternalAuthLinkRefreshToken WHERE clause:
AND oauth_refresh_token = @old_oauth_refresh_token. The
@old_oauth_refresh_token comes from
externalAuthLink.OAuthRefreshToken (line 221), which is the value at
read-time. After caller A saves via UpdateExternalAuthLink (which
updates unconditionally by provider_id + user_id), the DB has A's new
refresh token. Caller B's WHERE clause matches against the original
token, which no longer matches the DB row. The write is a no-op.

Holds.

Change 2, Stage 1: Rate limit detection in ValidateToken

Claim (revised): For rate-limited 403, return (true, nil, nil).

Code: ValidateToken receives the raw HTTP response from
c.InstrumentedOAuth2Config.Do() (line 343). The promoauth
instrumentation only adds metrics counters
(instrumentedTripper.RoundTrip, promoauth/oauth2.go line 281-298). It
does not modify headers or body. Rate-limit headers
(X-RateLimit-Remaining, X-RateLimit-Reset) are available on res.

Returning (true, nil, nil) means the caller treats the token as
valid. Back in RefreshToken, line 273 if !valid is false, so
execution proceeds to line 288 (the DB save, which with Change 1 is
already done). The caller gets a successful result with the token.

This avoids a regression identified during trace: returning an error
would flow through line 270-271 as a non-InvalidTokenError. The
provisioner (line 678) treats non-InvalidToken errors as failJob.
Currently rate-limited validation returns InvalidTokenError which the
provisioner skips (line 682-683). Returning (true, nil, nil) is better
than both.

Holds.

Change 2, Stage 3: Remove 403 from isFailedRefresh

Claim: After removing http.StatusForbidden, a provider returning
403 with unknown error code from the token endpoint is treated as
transient.

Code: isFailedRefresh at line 1256. With 403 removed from Switch
2: Switch 1 (error codes) -- unknown code, no match. Switch 2 (status
codes) -- 403 removed, no match. Falls through to line 1288:
return false.

isFailedRefresh returning false skips the destructive DB write at
line 209. Execution continues to line 256:
return externalAuthLink, InvalidTokenError(fmt.Sprintf("refresh token: %s", err.Error())).

This returns InvalidTokenError to callers. Provisioner skips the
token (line 682-683). Agent returns auth URL. Template version check
marks as unauthenticated. The refresh token survives in the DB. Next
call retries the refresh.

Holds.

🤖 This PR was created with the help of Coder Agents, and will be reviewed by a human. 🏂🏻

…fresh ValidateToken treated all 403 responses as "token invalid," including GitHub rate limits. isFailedRefresh included 403 in the status code fallthrough, destroying tokens on rate-limited refresh attempts. Detect rate-limit indicators (X-RateLimit-Remaining: 0, Retry-After) on 403 responses in ValidateToken and return optimistically valid. Add incorrect_client_credentials and invalid_client to isFailedRefresh error code switch, remove 403 from the status code fallthrough.

Add end-to-end test for RefreshToken with rate-limited validation, boundary test for non-zero X-RateLimit-Remaining, fix GitHub casing in comment, and update ValidateToken doc comment to document optimistic return on rate-limited 403.

…omment Add Unauthorized_WithRateLimitHeaders test to lock the invariant that 401 always means revocation regardless of rate-limit headers. Fix doc comment to say valid=true instead of (true, nil, nil) since the provider-confirmed case returns (true, &user, nil).

… docs Handle 429 (Too Many Requests) in ValidateToken the same as rate-limited 403. GitHub can return either status for rate limits; before this change 403 returned optimistic success but 429 fell through to the error path, causing provisioner failJob. Add refresh token assertion to SaveBeforeValidate_RateLimited. Narrow isRateLimited doc to acknowledge secondary limit edge cases. Replace "Switch 1"/"Switch 2" jargon with descriptive names.

Expand isRateLimited doc comment to explain why OR (not AND) is correct, why CDN/WAF false positives are not a practical concern, and what secondary rate limit patterns are not covered. Prevents the same design question from arising in future reviews. Based on investigation of Cloudflare, AWS, Azure, and nginx docs plus GitHub secondary rate limit behavior across five independent sources (octokit, go-github-ratelimit, hub4j, community reports).

Update ValidateToken doc to cover the 429 path added in the previous commit. Use "rate-limited response (403 with rate-limit headers or 429)" instead of counting cases. Fix misleading comment on test fixture: oauth2.Token.Valid() requires AccessToken != "" AND not expired, not just non-zero expiry.

mafredri · 2026-04-28T10:07:42Z

Filed coder/internal#1503 to track the observability gap (DEREM-2). Resolving the thread.

🤖 Generated with the help of Coder Agents on behalf of @mafredri.

johnstcn · 2026-04-28T10:32:24Z

+func isRateLimited(resp *http.Response) bool {
+	if resp == nil {
+		return false
+	}
+	if resp.Header.Get("Retry-After") != "" {
+		return true
+	}
+	if resp.Header.Get("X-RateLimit-Remaining") == "0" {
+		return true
+	}
+	return false
+}


Suggestion for follow-up: could be a useful coderd/util function IsRateLimited(*http.Response) (bool, time.Duration)

Agreed it could be useful if more callers need the check. Currently only used in one place, so extracting it now would be premature.

🤖 Posted using /amend-review skill via Coder Agents.

Trim isRateLimited doc comment per reviewer nit. Remove line number from test comment (drifts). Convert ValidateToken status code if-chain to switch.

github-actions Bot assigned mafredri Apr 14, 2026

This comment was marked as resolved.

Sign in to view

mafredri force-pushed the fix/rate-limit-403-detection branch from 079cfb9 to 07f0c70 Compare April 15, 2026 13:23

This comment was marked as resolved.

Sign in to view

mafredri mentioned this pull request Apr 15, 2026

fix(coderd/externalauth): save refreshed token before validation #24332

Merged

mafredri force-pushed the fix/rate-limit-403-detection branch from 07f0c70 to 346317b Compare April 16, 2026 11:01

Base automatically changed from fix/save-refreshed-token-before-validation to main April 18, 2026 11:28

jasonwbarnett mentioned this pull request Apr 22, 2026

fix(coderd/externalauth): detect concurrent refresh race to prevent cache poisoning #24228

Open

mafredri force-pushed the fix/rate-limit-403-detection branch from 346317b to f051fad Compare April 23, 2026 04:09

mafredri marked this pull request as ready for review April 23, 2026 04:09

This comment was marked as resolved.

Sign in to view

mafredri added 3 commits April 27, 2026 14:57

fix(coderd/externalauth): address review feedback

2f3ad22

Add end-to-end test for RefreshToken with rate-limited validation, boundary test for non-zero X-RateLimit-Remaining, fix GitHub casing in comment, and update ValidateToken doc comment to document optimistic return on rate-limited 403.

mafredri force-pushed the fix/rate-limit-403-detection branch from d3c147a to 018db10 Compare April 27, 2026 14:58

This comment was marked as resolved.

Sign in to view

mafredri added 2 commits April 27, 2026 15:41

This comment was marked as resolved.

Sign in to view

mafredri mentioned this pull request Apr 28, 2026

Rate-limited token validation is invisible to operators coder/internal#1503

Open

mafredri requested review from hugodutka and johnstcn April 28, 2026 10:17

johnstcn approved these changes Apr 28, 2026

View reviewed changes

fix(coderd/externalauth): address human review feedback

84afdbc

Trim isRateLimited doc comment per reviewer nit. Remove line number from test comment (drifts). Convert ValidateToken status code if-chain to switch.

mafredri force-pushed the fix/rate-limit-403-detection branch from 92d7ae4 to 84afdbc Compare April 28, 2026 10:53

mafredri merged commit 1926b7e into main Apr 28, 2026
26 checks passed

mafredri deleted the fix/rate-limit-403-detection branch April 28, 2026 15:03

github-actions Bot locked and limited conversation to collaborators Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(coderd/externalauth): detect rate-limit 403 and narrow isFailedRefresh#24334

fix(coderd/externalauth): detect rate-limit 403 and narrow isFailedRefresh#24334
mafredri merged 7 commits intomainfrom
fix/rate-limit-403-detection

mafredri commented Apr 14, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

mafredri commented Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johnstcn Apr 28, 2026

Uh oh!

mafredri Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mafredri commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

External Auth Token Resilience: Plan

Problem

Decisions

D1: Save refreshed token before validating

D2: Distinguish rate-limit 403, separate change

D2a: Return (true, nil, nil) for rate-limited ValidateToken

D3: Narrow isFailedRefresh fallthrough alongside D2

D4: Defer RateLimitError type

Implementation Flow

PR 1: Save refreshed token before validating (D1)

PR 2: Rate-limit 403 detection and isFailedRefresh narrowing (D2+D2a+D3)

External Auth Token Resilience: Research Document

Problem Context

What exists

What's wrong

Why it matters

Finding 1: Rate limit during post-refresh validation silently discards a successfully refreshed token (PRIMARY)

Finding 2: ValidateToken treats GitHub 403 (rate limit) as "token invalid"

Finding 3: isFailedRefresh treats 403 as permanent token failure

Finding 4: Device auth and token refresh have an asymmetric client_secret requirement

Finding 5: No rate limit awareness in the token lifecycle

Finding 6: Singleflight was NOT implemented despite PR claim

Finding 7: oauth2 library auth style probing doubles token endpoint traffic

Finding 8: "Test validate" calls ValidateToken directly

Finding 9: PR #15608 introduced the "destroy on failed refresh" behavior

Finding 10: gitsync has its own rate limit handling

Finding 11: Repercussions of each token failure mode per caller

Provisioner AcquireJob (provisionerdserver.go:648-689)

Agent external-auth handler (workspaceagents.go:2056-2128)

Template version auth check (templateversions.go:386-416)

listUserExternalAuths (coderd/externalauth.go:370-406)

Approaches

Approach 1: Save the refreshed token before validating

Approach 2: Distinguish rate limit 403 from token-invalid 403

Approach 3: Add rate-limit-aware error type for callers

Finding 12: The status code fallthrough in isFailedRefresh can be safely replaced with explicit error codes

15608) as a catch-all. Analysis of what each status code catches

Finding 13: Saving before validating is safe

Decisions

D1: Save refreshed token before validating

D2: Distinguish rate-limit 403 from token-invalid 403, separate change

D3: Narrow isFailedRefresh status code fallthrough alongside D2

D4: Defer RateLimitError type for callers

External Auth Token Resilience: Trace Document

Change 1, Stage 1: Move the DB save

Change 1, Stage 2: Test for data loss scenario

Change 1, Stage 3: Concurrent behavior

Change 2, Stage 1: Rate limit detection in ValidateToken

Change 2, Stage 3: Remove 403 from isFailedRefresh

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

mafredri commented Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johnstcn Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

mafredri Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

mafredri commented Apr 14, 2026 •

edited

Loading

D2a: Return `(true, nil, nil)` for rate-limited ValidateToken