Skip to content

fix(coderd/externalauth): detect rate-limit 403 and narrow isFailedRefresh#24334

Merged
mafredri merged 7 commits intomainfrom
fix/rate-limit-403-detection
Apr 28, 2026
Merged

fix(coderd/externalauth): detect rate-limit 403 and narrow isFailedRefresh#24334
mafredri merged 7 commits intomainfrom
fix/rate-limit-403-detection

Conversation

@mafredri
Copy link
Copy Markdown
Member

@mafredri mafredri commented Apr 14, 2026

ValidateToken treated all 403 responses as "token invalid," including
GitHub rate limits (X-RateLimit-Remaining: 0, Retry-After). When a
rate-limited 403 followed a successful token refresh, the token was
marked invalid and the retry loop exhausted, returning
InvalidTokenError. With PR 1 the token is now saved before
validation, but the caller still sees a false "invalid token" error.

isFailedRefresh included http.StatusForbidden in the status code
fallthrough (Switch 2), destroying tokens when any unknown error code
arrived with a 403. No known provider returns 403 from the token
endpoint.

Stacked on #24332.

This PR:

  • Splits the 401 || 403 check in ValidateToken into separate
    branches. On 403, inspects X-RateLimit-Remaining and Retry-After
    headers. If either indicates a rate limit, returns (true, nil, nil)
    (optimistically valid). Plain 403 without rate-limit headers preserves
    the existing (false, nil, nil) behavior.
  • Adds incorrect_client_credentials and invalid_client to
    isFailedRefresh Switch 1 (error codes), making those permanent
    failure classifications explicit rather than relying on the status
    code fallthrough.
  • Removes http.StatusForbidden from Switch 2 (status codes). Unknown
    error codes with 403 are now treated as transient (safe default: retry
    rather than destroy).
Implementation plan

External Auth Token Resilience: Plan

Research: external-auth-token-resilience-research.md

Problem

Coder's external auth token refresh for GitHub Apps silently discards
successfully refreshed tokens when post-refresh validation fails due to
API rate limits. GitHub rotates refresh tokens on use ("Once you use a
refresh token, that refresh token and the old user access token will no
longer work." -- GitHub docs, Refreshing user access tokens). When the
new token is not saved, the old refresh token in the DB is stale on
GitHub's side. The next refresh attempt fails permanently, destroying
the token and forcing manual re-authentication.

Secondary issues compound this: ValidateToken treats all 403 responses
as "token invalid" (including rate limits), and isFailedRefresh treats
403 as a permanent failure, destroying tokens on transient conditions.

Decisions

D1: Save refreshed token before validating

  • Chosen: Move the DB save from after validation to after
    TokenSource.Token() succeeds.
  • Classification: Human ratified agent recommendation.
  • Full deliberation: Research document, Decisions section.

D2: Distinguish rate-limit 403, separate change

  • Chosen: Detect rate-limit headers in ValidateToken and
    isFailedRefresh 403 paths. Ship as a follow-up, not bundled with D1.
  • Classification: Human ratified agent recommendation.
  • Full deliberation: Research document, Decisions section.

D2a: Return (true, nil, nil) for rate-limited ValidateToken

  • Chosen: When rate-limit headers indicate the 403 is a rate limit,
    return (true, nil, nil) (optimistically valid) rather than an error.
  • Classification: Agent-recommended. Identified during trace phase.
    Research Approach 2 originally proposed returning an error, but trace
    revealed this would cause provisioner failJob regression (line 678:
    non-InvalidTokenError errors fail the build). Returning
    (true, nil, nil) avoids that regression. The token was just issued
    by the IDP; the validation endpoint is failing for a transient reason.
  • Trade-off: If a token is genuinely revoked AND the account is
    simultaneously rate-limited, the revoked token is treated as valid
    during the rate-limit window. Recovery happens when the rate limit
    lifts (next validation correctly returns 403 without rate-limit
    headers).

D3: Narrow isFailedRefresh fallthrough alongside D2

  • Chosen: Add incorrect_client_credentials and invalid_client
    to the known error code list. Remove http.StatusForbidden from the
    status code fallthrough.
  • Classification: Human ratified agent recommendation.
  • Full deliberation: Research document, Decisions section.

D4: Defer RateLimitError type

  • Chosen: Defer. System is correct after D1+D2+D3. UX polish is
    separate scope.
  • Classification: Human ratified agent recommendation.
  • Full deliberation: Research document, Decisions section.

Implementation Flow

Two PRs. PR 1 is a standalone bug fix. PR 2 depends on PR 1 (stacked)
or lands after PR 1 merges. They can be reviewed in parallel but must
merge in order.

Each stage follows red-green TDD: write a failing test first (red),
then implement the minimum code to make it pass (green), then refactor.
The stages below are ordered accordingly: test stages precede or are
interleaved with implementation stages.

PR 1: Save refreshed token before validating (D1)

Scope: coderd/externalauth/externalauth.go, tests.

Stage 1 (RED): Write failing test for the data loss scenario

Add a test that reproduces Finding 1's chain:

  1. Configure a GitHub-type provider with an expired token and a valid
    refresh token.
  2. Mock TokenSource to return a new token (simulating successful
    refresh with rotated refresh token).
  3. Mock ValidateToken endpoint to return 403 (simulating rate limit).
  4. Call RefreshToken. It should return an error (validation failed).
  5. Assert: the DB contains the NEW access token and NEW refresh token
    (not the old ones). This is the critical assertion -- the whole
    point of the fix.
  6. Call RefreshToken again (simulating next caller). Assert: the
    refreshed token is read from DB, TokenSource does NOT attempt
    another refresh (token is not expired), validation is retried.

Artifact: New test TestRefreshToken/SaveBeforeValidate (or similar).
Test fails on current code (step 5 fails: DB still has old tokens).

Verification: Test compiles and runs. Step 5 assertion fails (red).

Stage 2 (GREEN): Move the DB save

In RefreshToken (line 164), after TokenSource.Token() succeeds
(line 194) and GenerateTokenExtra completes (line 259), save the
token to the DB immediately when the access token changed. This is the
existing UpdateExternalAuthLink call currently at line 288, moved
earlier.

The early save must assign the returned DB row back to
externalAuthLink (as the current save does at line 303:
externalAuthLink = updatedAuthLink). This ensures the condition at
line 288 (token.AccessToken != externalAuthLink.OAuthAccessToken)
becomes false, skipping the now-redundant write.

If the early save fails (DB error), return immediately with the error
(same pattern as current line 300-301). The refresh succeeded on the
IDP side but the token couldn't be persisted. The caller gets a
non-InvalidTokenError, which provisioner treats as failJob and
other callers treat as HTTP 500. This is the correct behavior: the
system is in a degraded state and should surface it.

Note: UpdateExternalAuthLink clears oauth_refresh_failure_reason
(externalauth.sql line 47). This is correct: the refresh succeeded.

Note: UpdateExternalAuthLink has no optimistic lock (writes
unconditionally by provider_id + user_id). For providers that rotate
refresh tokens (GitHub), only one concurrent refresh can succeed, so
races are self-resolving. For providers that don't rotate, both callers
get valid tokens and last-writer-wins is benign.

After saving, proceed to validation as before. If validation fails, the
token is already persisted. If validation succeeds, the save at line
288 is skipped (access token unchanged).

Artifact: Modified RefreshToken function with the save moved before
the validate: label.

Verification: Stage 1 test now passes (green). Existing
TestRefreshToken subtests pass. The Updates test (line 309 of test
file) exercises the post-refresh validation path and should also pass.

Stage 3 (RED/GREEN): Concurrent behavior test

Write a test (or extend Stage 1's test) that verifies: caller A saves
new token, caller B's stale refresh fails, caller B's destructive
UpdateExternalAuthLinkRefreshToken is a no-op due to the optimistic
lock (WHERE oauth_refresh_token = @old_oauth_refresh_token).

Artifact: Concurrent test added or existing one verified.

Verification: Run make test RUN=TestRefreshToken with race detector.

PR 2: Rate-limit 403 detection and isFailedRefresh narrowing (D2+D2a+D3)

Depends on PR 1. Can be stacked on top of PR 1's branch or land after
PR 1 merges.

Scope: coderd/externalauth/externalauth.go, coderd/promoauth/,
tests.

Stage 1 (RED): Write failing tests for rate-limit detection

New tests for ValidateToken:

  • Mock endpoint returning 403 with X-RateLimit-Remaining: 0. Assert
    return is (true, nil, nil). Currently fails (returns
    (false, nil, nil)).
  • Mock endpoint returning 403 with Retry-After: 60. Assert return is
    (true, nil, nil). Currently fails.
  • Mock endpoint returning 403 with no rate-limit headers. Assert return
    is (false, nil, nil) (existing behavior preserved). Currently
    passes.

New tests for isFailedRefresh:

  • Mock a RetrieveError with ErrorCode: "incorrect_client_credentials"
    and StatusCode: 200. Assert isFailedRefresh returns true.
    Currently passes (caught by StatusOK fallthrough), but the test makes
    the behavior explicit.
  • Mock a RetrieveError with ErrorCode: "invalid_client" and
    StatusCode: 401. Assert returns true. Currently passes
    (StatusUnauthorized fallthrough).
  • Mock a RetrieveError with ErrorCode: "unknown_code" and
    StatusCode: 403. Assert returns false (transient). Currently
    fails (returns true via StatusForbidden fallthrough).

Artifact: New tests, three failing (red).

Verification: Tests compile. Three fail as expected.

Stage 2 (GREEN): Detect rate limits in ValidateToken

In ValidateToken, before the StatusForbidden check at line 348,
inspect the response for rate-limit indicators:

  • Primary rate limits: X-RateLimit-Remaining: 0 header.
  • Secondary rate limits: Retry-After header present.

If either indicator is present on a 403, return (true, nil, nil) --
optimistically treat the token as valid (Decision D2a). The IDP just
issued the token; the validation endpoint is failing for a known
transient reason.

For 403 responses with neither X-RateLimit-Remaining: 0 nor
Retry-After header, preserve the existing (false, nil, nil) behavior
(likely a genuine token revocation or permission error).

The existing promoauth/github.go already parses X-RateLimit-*
headers (lines 31-71) but only for 401 responses. The header-parsing
logic can be reused or extracted.

Artifact: Modified ValidateToken.

Verification: Rate-limit tests pass (green). Existing
ValidateFailure test still passes.

Stage 3 (GREEN): Expand isFailedRefresh and remove 403

Add to the ErrorCode switch (Switch 1):

  • "incorrect_client_credentials" (GitHub: wrong client_id/secret,
    HTTP 200)
  • "invalid_client" (RFC 6749 Section 5.2: client auth failed,
    HTTP 400/401)

Remove http.StatusForbidden from the StatusCode switch (Switch 2).
No known provider returns 403 from the token endpoint (Finding 12). The
403 case was speculative defense that causes real harm (destroys tokens
on rate-limited refresh attempts).

After expanding Switch 1, all known permanent error codes are explicit.
The 403 removal only affects unknown error codes arriving with 403
status, which will now be treated as transient (function returns
false). This is the safe default: retry rather than destroy.

Artifact: Expanded Switch 1, narrowed Switch 2.

Verification: All three failing tests pass (green). Existing
RefreshRetries test passes.

Stage 4: Integration verification

Run the full external auth test suite. Verify that PR 1 + PR 2 together
handle the complete Finding 1 chain: expired token, successful refresh,
rate-limited validation, token saved, no destruction, recovery when rate
limit lifts.

Artifact: All tests green.

Verification: make test RUN=TestRefreshToken, make test RUN=TestExternalAuth, race detector enabled.

Research document

External Auth Token Resilience: Research Document

Status: Frozen as of 2026-04-14. Changes require plan owner approval.

Problem Context

What exists

Coder's external auth subsystem manages OAuth2 tokens for Git providers
(GitHub, GitLab, Gitea, etc.). The core logic lives in
coderd/externalauth/externalauth.go. Tokens are stored in
external_auth_links (primary key: (provider_id, user_id)) with fields
for access token, refresh token, expiry, and a cached failure reason.

The token lifecycle has three operations:

  1. RefreshToken (line 164): attempts oauth2 refresh if token is expired,
    then validates the new token.
  2. ValidateToken (line 325): sends a GET to the provider's validate URL
    with the access token to confirm it's still good.
  3. isFailedRefresh (line 1256): classifies refresh errors as permanent
    vs. transient.

Multiple callers trigger RefreshToken:

  • Provisioner server during workspace builds (provisionerdserver.go:677)
  • Agent external-auth handler (workspaceagents.go:2101)
  • Template version auth check (templateversions.go:403)
  • User external auth list (coderd/externalauth.go:384)

The externalAuthByID handler (the "Test validate" UI) calls
ValidateToken directly without going through RefreshToken.

What's wrong

A user with both a GitHub App and a GitHub OAuth App configured reports:

"token expired and refreshing failed 2 hours ago with: oauth2:
'incorrect_client_credentials' 'The client_id and/or client_secret
passed are incorrect.'"

And separately:

  • Tokens appear to be destroyed when GitHub rate limits are hit
  • "Test validate" errors out during rate-limited states
  • Manual device re-auth fixes it, but only for 2-3 hours
  • Must re-auth 5-6 times per day

Why it matters

Token destruction is irreversible without manual re-authentication. If
transient conditions (rate limits, temporary server errors) are
misclassified as permanent failures, users are forced into a re-auth loop
that disrupts development workflow.


Finding 1: Rate limit during post-refresh validation silently discards a successfully refreshed token (PRIMARY)

Category: Verified

This is the most critical finding. RefreshToken (lines 194-299)
performs refresh and validation as a two-step sequence. A rate limit
during validation causes the refreshed token to be silently lost, which
then corrupts the refresh token state on GitHub's side.

The chain:

Step 1. GitHub App token expires (~8 hours).

Step 2. TokenSource.Token() (line 194) contacts GitHub's token
endpoint, successfully refreshes. Returns a new access_token and a
new refresh_token. GitHub rotates refresh tokens on use: the old
refresh token is now invalid on GitHub's side.

Step 3. ValidateToken (line 269) is called with the new token against
https://api.github.com/user. This endpoint is rate limited. GitHub
returns 403.

// line 348
if res.StatusCode == http.StatusUnauthorized || res.StatusCode == http.StatusForbidden {
    return false, nil, nil
}

Step 4. The 1-second retry loop (line 264-282) exhausts. Every retry
also gets 403 (rate limits last minutes to hours).

// line 273-285
if !valid {
    if c.Type == string(codersdk.EnhancedExternalAuthProviderGitHub) && r.Wait(retryCtx) {
        goto validate
    }
    return externalAuthLink, InvalidTokenError("token failed to validate")
}

Step 5. Returns InvalidTokenError("token failed to validate") at line
285. Lines 288-299 are never reached. The new token (with new
access_token AND new refresh_token) is never saved to the DB:

// line 288-299 -- NEVER EXECUTED when validation fails
if token.AccessToken != externalAuthLink.OAuthAccessToken {
    updatedAuthLink, err := db.UpdateExternalAuthLink(ctx, database.UpdateExternalAuthLinkParams{
        // ...
        OAuthAccessToken:  token.AccessToken,
        OAuthRefreshToken: token.RefreshToken,
        OAuthExpiry:       token.Expiry,
        // ...
    })

Step 6. DB still has: expired access_token + old refresh_token (which
GitHub already invalidated in step 2).

Step 7. Next call to RefreshToken reads the stale DB record.
TokenSource.Token() sees expired token, attempts refresh with the old
(now invalid) refresh token. GitHub returns an error.

Step 8. The oauth2 library's auth style probing (the AuthStyleCache is
in-memory and may be empty after restarts) first tries Basic auth.
GitHub receives Basic auth with the old refresh token and returns an
error. The specific error code depends on how GitHub classifies a stale
refresh token sent via Basic auth. This can produce
incorrect_client_credentials or bad_refresh_token depending on
GitHub's internal handling.

Step 9. isFailedRefresh returns true (either via the ErrorCode match
for bad_refresh_token or via the StatusCode fallthrough for
http.StatusOK). Refresh token is wiped from DB. Error is cached in
oauth_refresh_failure_reason.

Step 10. All subsequent calls see: expired token, no refresh token,
cached failure reason. Returns: "token expired and refreshing failed
2 hours ago with: ".

This is a data loss bug. A successful refresh (step 2) is discarded
because a transient rate limit on a completely different endpoint
(/user vs /login/oauth/access_token) causes the validation to fail.
The refresh token rotation means there is no recovery without manual
re-authentication.

Influences: All approaches. This is the primary bug to fix.

Finding 2: ValidateToken treats GitHub 403 (rate limit) as "token invalid"

Category: Verified

ValidateToken (line 348-351):

if res.StatusCode == http.StatusUnauthorized || res.StatusCode == http.StatusForbidden {
    // The token is no longer valid!
    return false, nil, nil
}

This returns (false, nil, nil) -- "not valid, no error." GitHub uses
HTTP 403 for at least three distinct conditions:

  • Token revoked / insufficient permissions
  • Primary rate limit exceeded (X-RateLimit-Remaining: 0)
  • Secondary rate limit exceeded (includes Retry-After header)

The code does not inspect X-RateLimit-Remaining, X-RateLimit-Reset,
Retry-After, or the response body ("API rate limit exceeded" message) to
distinguish these cases. All three are treated as "token invalid."

The 1-second retry loop (line 264-282) was added for GitHub read-replica
lag after token refresh (comment at line 274-278). It was not designed
for rate limits and is too short.

Influences: Finding 1 (this is the mechanism that triggers the
silent token discard), Approaches 1, 2, 3.

Finding 3: isFailedRefresh treats 403 as permanent token failure

Category: Verified

isFailedRefresh (line 1267-1284):

switch oauthErr.ErrorCode {
case "bad_refresh_token",        // Github
    "invalid_grant",             // Gitlab & Spec
    "unauthorized_client",       // Gitea & Spec
    "unsupported_grant_type":    // Spec
    return true
}

switch oauthErr.Response.StatusCode {
case http.StatusBadRequest, http.StatusUnauthorized, http.StatusForbidden, http.StatusOK:
    // Status codes that indicate the request was processed, and rejected.
    return true
case http.StatusInternalServerError, http.StatusTooManyRequests:
    // These do not indicate a failed refresh, but could be a temporary issue.
    return false
}

When isFailedRefresh returns true, the refresh token is permanently
destroyed (line 209-222). The code correctly handles 429 as transient,
but GitHub also returns 403 for rate limits. A 403 during refresh falls
through to the status code switch where 403 returns true, causing
permanent token destruction.

The comment says "Status codes that indicate the request was processed,
and rejected." This assumption is incorrect for GitHub's 403 rate limits.

Additionally, unknown error codes (like incorrect_client_credentials)
fall through the ErrorCode switch to the status code match. GitHub
returns token endpoint errors with HTTP 200, so http.StatusOK matches
and classifies them as permanent. This is a catch-all that is too broad.

Influences: Approaches 1, 2, 3.

Finding 4: Device auth and token refresh have an asymmetric client_secret requirement

Category: Verified

formatDeviceTokenURL (line 640-651, called by ExchangeDeviceCode at line 599) only sends client_id:

tok.RawQuery = url.Values{
    "client_id":   {c.ClientID},
    "device_code": {deviceCode},
    "grant_type":  {"urn:ietf:params:oauth:grant-type:device_code"},
}.Encode()

Token refresh goes through oauth2.TokenSource, which uses the
oauth2.Config constructed at line 706-708 with ClientSecret.
Device auth always works; refresh can fail if credentials are wrong.

GitHub App user-to-server tokens expire (~8 hours) and include a refresh
token. GitHub OAuth App tokens do not expire and have no refresh token.

Influences: Finding 1 (explains why device re-auth works but tokens
break again after expiry).

Finding 5: No rate limit awareness in the token lifecycle

Category: Verified

promoauth/github.go (line 31-34) only tracks rate limit headers for
401 (unauthenticated) responses:

if resp.StatusCode != http.StatusUnauthorized {
    return rateLimits{}, false
}

Authenticated request rate limits (5000/hour primary bucket) are
invisible to Prometheus. No code in coderd/externalauth/ reads
X-RateLimit-Remaining, X-RateLimit-Reset, or Retry-After from validation
or refresh responses.

No code throttles, backs off, or queues token validation/refresh
operations. Multiple callers independently trigger validation:

  • gitsync polls every 10 seconds (defaultInterval = 10 * time.Second in worker.go:23, uses its own TokenResolver)
  • Agent handler on every git credential request
  • Provisioner on every workspace build
  • Template version check during workspace creation (polled every 1000ms)
  • User visiting external auth page

Influences: Finding 1 (this is why rate limits are hit so often).

Finding 6: Singleflight was NOT implemented despite PR claim

Category: Verified

PR #22904 description claims singleflight deduplication was added. The
actual commit (53e52ae) only added the optimistic lock
(OldOauthRefreshToken WHERE clause). grep -rn singleflight coderd/externalauth/ returns zero results.

This means concurrent callers (provisioner + agent + template version
check) each create a separate TokenSource instance (line 194) and can
all independently attempt refresh. With GitHub's refresh token rotation,
this creates a race: the first caller's refresh invalidates the old
token, and the second caller's refresh with the now-stale token fails
with bad_refresh_token.

The optimistic lock on UpdateExternalAuthLinkRefreshToken prevents the
losing caller from clobbering a winner's NEW token. But it does not
prevent the losing caller from attempting the refresh in the first place
(which wastes a token endpoint request and can hit the 2000/hr token
endpoint rate limit).

Influences: Finding 1 (concurrent callers increase rate limit
pressure and can trigger the token discard chain).

Finding 7: oauth2 library auth style probing doubles token endpoint traffic

Category: Verified

GitHub's endpoint does not set AuthStyle:

var GitHub = oauth2.Endpoint{
    AuthURL:       "https://github.com/login/oauth/authorize",
    TokenURL:      "https://github.com/login/oauth/access_token",
    DeviceAuthURL: "https://github.com/login/device/code",
}

The AuthStyleCache is in-memory per oauth2.Config. On every Coder
server restart (or first refresh per Config), the oauth2 library probes:
first sends Basic auth (rejected by GitHub with
incorrect_client_credentials), then retries with form-encoded params.
(Recalled from oauth2 library documentation; not empirically tested this session.)
This is normally transparent but doubles token endpoint traffic toward
the 2000/hr limit.

The device flow (ExchangeDeviceCode) uses http.DefaultClient
directly, bypassing the oauth2.Config. It never populates the
AuthStyleCache. So every first refresh after a device auth exchange
triggers probing.

Influences: Finding 1 (contributes to token endpoint rate limit
pressure), explains why incorrect_client_credentials can appear even
with correct credentials (it's the first probe attempt, normally masked
by the retry).

Finding 8: "Test validate" calls ValidateToken directly

Category: Verified

externalAuthByID handler (coderd/externalauth.go line 62-63):

eg.Go(func() (err error) {
    res.Authenticated, res.User, err = config.ValidateToken(ctx, link.OAuthToken())
    return err
})

Calls ValidateToken without attempting a refresh first. If rate limited
(403), returns Authenticated: false. This explains the "Test validate
also errors out" symptom -- it hits the same rate-limited
/user endpoint.

Influences: Approach 3.

Finding 9: PR #15608 introduced the "destroy on failed refresh" behavior

Category: Verified

Commit 78f9f43 (PR #15608) introduced the pattern of zeroing the
refresh token on failure. The commit message:

"Once a token refresh fails, we remove the oauth_refresh_token from
the database. This will prevent the token from hitting the IDP for
subsequent refresh attempts."

The concern was legitimate: prevent repeated failed refreshes from
exhausting rate limits. But the implementation conflates "this refresh
failed permanently" with "this refresh failed right now."

PR #19339 (commit 4926410) added oauth_refresh_failure_reason to
cache and display the original error, producing the exact error format
the user reported.

Influences: Approaches 1, 2, 3.

Finding 10: gitsync has its own rate limit handling

Category: Verified

coderd/x/gitsync/gitsync.go uses a per-group
atomic.Pointer[gitprovider.RateLimitError] to short-circuit remaining
rows when a rate limit is hit. The worker has a 30-second tick timeout
and backs off on errors (120s default, 10min for no-token). This shows
the codebase already has rate-limit-aware patterns not present in the
core externalauth package.

Influences: Approach 2 precedent.

Finding 11: Repercussions of each token failure mode per caller

Category: Verified

Provisioner AcquireJob (provisionerdserver.go:648-689)

Scenario Code path User effect
Non-InvalidToken error failJob(...) at line 679 Build fails
Blocks for extended period Hangs at line 677 Build stuck "Pending"; worker blocked
InvalidTokenError continue at line 683 Token silently omitted; Terraform sees access_token=""

The optional flag on coder_external_auth only gates the UI submit
button. If bypassed (API call, auto-start), the build proceeds with an
empty token.

Agent external-auth handler (workspaceagents.go:2056-2128)

Scenario Code path User effect
Non-InvalidToken error HTTP 500 at line 2102 Git clone fails
Blocks for extended period Hangs at line 2101 Git operations stall
InvalidTokenError HTTP 200 + auth URL Agent told to re-auth

Template version auth check (templateversions.go:386-416)

Scenario Code path User effect
Non-InvalidToken error HTTP 500 at line 404 Create Workspace errors
Blocks for extended period Hangs at line 403 Loading spinner
InvalidTokenError Authenticated = false UI prompts re-auth

listUserExternalAuths (coderd/externalauth.go:370-406)

Iterates ALL links sequentially. Any error sets
Authenticated=false + ValidateError. Blocking stalls the whole page.


Approaches

Approach 1: Save the refreshed token before validating

Description: Persist the refreshed token to the DB immediately after
a successful TokenSource.Token() call, BEFORE calling ValidateToken.
This prevents the data loss bug in Finding 1 where a successfully
refreshed token is silently discarded because validation fails.

Concrete precedent: The code at line 288-299 already saves the token
when validation succeeds. This approach moves the save earlier. The
optimistic lock on UpdateExternalAuthLinkRefreshToken (Finding 6)
provides a model for safe concurrent writes.

Strongest argument for: Fixes the primary bug directly. A successful
refresh (which consumed the old refresh token on GitHub's side) is never
silently lost. Even if validation fails afterward, the new token is
in the DB and can be retried. This is the smallest change that
eliminates the data loss.

Strongest argument against: Saves a token that hasn't been validated
yet. However, validation is a liveness check, not a security gate.
The original ValidateURL comment (PR #9996, commit 45b53c2) says: "ensures an access
token is valid before returning it to the user. If omitted, tokens will
not be validated before being returned." It's optional. No commit or PR
in the history motivates validation as a defense against bad IDP
responses.

The only known validation failure after refresh is GitHub read-replica
lag (the Australian customer incident at line 274-278), which is
transient -- the token IS valid, the API just can't confirm it yet.
Saving first is strictly better for this case.

If a token somehow doesn't work: it stays in the DB, fails validation
on the next call, and the user is prompted to re-auth. Same outcome as
today, but the new refresh token is preserved instead of being lost.

Makes easy: Fixing the immediate data loss. No new error types, no
new caller handling, no rate-limit detection needed for this specific
fix.

Makes hard: Nothing. This change is additive and does not conflict
with the other approaches. It should be done regardless of what else is
chosen.

Changes to existing codebase: Move db.UpdateExternalAuthLink from
after validation (line 288) to after TokenSource.Token() succeeds
(around line 258). If validation subsequently fails, the token is still
in the DB. If validation succeeds and the token was already saved, no
second write is needed.

Approach 2: Distinguish rate limit 403 from token-invalid 403

Description: Inspect GitHub API response headers and body to
distinguish rate-limit 403 from token-revoked 403 before classifying the
result. Changes both ValidateToken (line 348) and isFailedRefresh
(line 1278) to check for rate limit signals before treating 403 as
permanent.

Concrete precedent: GitHub's documentation specifies that rate limit
responses include X-RateLimit-Remaining: 0 header and response body
containing "message": "API rate limit exceeded". The promoauth
package already parses these headers (for metrics only). The gitsync
package has gitprovider.RateLimitError handling.

Strongest argument for: Prevents the rate-limit 403 from triggering
Finding 1's chain in the first place. ValidateToken would return an
error (not (false, nil, nil)) for rate limits, letting the caller
distinguish "can't verify right now" from "token is dead."

Strongest argument against: GitHub-specific. Other providers signal
rate limits differently. Also, even with correct 403 classification,
the 1-second retry window is still inadequate for rate limits.

Makes easy: Fixing the classification bug. Tokens that are actually
revoked are still promptly invalidated.

Makes hard: Generalizing. Each provider needs its own rate limit
detection.

Changes to existing codebase: ValidateToken checks
X-RateLimit-Remaining before treating 403 as invalid. isFailedRefresh
checks for rate limit indicators before the status code fallthrough.
New tests for 403-with-rate-limit-headers.

Approach 3: Add rate-limit-aware error type for callers

Description: Return a specific RateLimitError from token
operations when a rate limit is detected. Each caller handles it
according to its own constraints: provisioner fails fast with a clear
message, agent enters listen mode, template version check shows "try
later."

Concrete precedent: The gitsync package's
atomic.Pointer[gitprovider.RateLimitError] pattern. The device auth
flow's explicit 429 handling (line 562-575).

Strongest argument for: Gives each caller enough information to
make the right decision. A provisioner build shouldn't wait 60 minutes
for a rate limit. An agent in listen mode already has the infrastructure
to poll.

Strongest argument against: Requires changes at every caller site.
The provisioner, agent handler, template version check, and user auth
list all need new error branches.

Makes easy: Accurate error reporting. Users see "rate limited" not
"please re-authenticate."

Makes hard: Coordination. Without shared rate limit state, each
caller still independently discovers the rate limit.

Finding 12: The status code fallthrough in isFailedRefresh can be safely replaced with explicit error codes

Category: Verified

The status code fallthrough (Switch 2) was added in a single commit (PR

15608) as a catch-all. Analysis of what each status code catches

Status What it catches beyond Switch 1 Risk if removed
400 invalid_client, invalid_request, invalid_scope (RFC codes not in Switch 1) Misses spec-defined permanent errors
401 invalid_client via Basic auth (RFC requires 401 for this) Misses the RFC-mandated status for bad client credentials
403 No spec basis. No known provider verified to return 403 from token endpoint. Catches rate limits on API endpoints. Stops destroying tokens on rate-limited refresh
200 GitHub-specific: all token errors arrive as HTTP 200 Misses incorrect_client_credentials and future GitHub error codes

The safe replacement: expand Switch 1 with RFC 6749 Section 5.2 error
codes and GitHub-specific codes, then narrow or remove Switch 2.

Codes to add to Switch 1:

  • "incorrect_client_credentials" (GitHub: wrong client_id/secret)
  • "invalid_client" (RFC 6749 Section 5.2: client auth failed)
  • "invalid_request" (RFC 6749 Section 5.2: malformed request)

After adding these, Switch 2 only catches truly unknown codes. The 403
case can be removed (no spec basis, causes rate limit misclassification).
The 200 case becomes nearly redundant for GitHub (all known codes are
now in Switch 1). The 400/401 cases become redundant for spec providers.

Switch 2 could be kept as a lower-confidence fallback (log a warning
rather than silently treating unknown codes as permanent), or removed
entirely with the understanding that genuinely unknown permanent errors
will be retried until the token expires naturally.

Influences: Approach 2.

Finding 13: Saving before validating is safe

Category: Verified

The only realistic scenario where a refresh returns a "bad" token is
GitHub read-replica lag (the Australian customer incident documented at
line 274-278). In that case the token IS valid; GitHub's read replicas
haven't propagated it yet. Saving first is strictly better: the token
works, validation just can't confirm it yet.

For a genuinely bad token (provider bug): the bad token in the DB still
fails validation on the next call, prompting re-auth. Same outcome as
today, but the new refresh token is preserved instead of being lost.

GenerateTokenExtra (line 259) is purely local (no network, no
validation dependency). All data needed for the DB save is available
after line 262.

Concurrency is safe: the optimistic lock on
UpdateExternalAuthLinkRefreshToken prevents a losing concurrent
refresher from overwriting the winner's saved token.

UpdateExternalAuthLink clears oauth_refresh_failure_reason, which
is correct: the refresh succeeded (IDP issued a token); only the
validation failed (different endpoint, different concern).

Influences: Approach 1.


Decisions

D1: Save refreshed token before validating

Question: How to prevent the data loss in Finding 1 where a
successfully refreshed token is silently discarded when post-refresh
validation fails?

Options considered: (a) Save before validating, (b) Extend the
validation retry window, (c) Skip validation after refresh entirely.

Chosen: (a) Save before validating.

Classification: Human ratified agent recommendation.

Reasoning: Agent recommended based on Finding 1 (data loss chain),
Finding 13 (safety analysis), and the ValidateURL comment (PR #9996,
commit 45b53c2) showing validation is an optional liveness check.
Human agreed: "I agree with your recommendations."

D2: Distinguish rate-limit 403 from token-invalid 403, separate change

Question: Should rate-limit 403 responses be classified differently
from token-revocation 403 responses in ValidateToken and isFailedRefresh?

Options considered: (a) Do it in the same change as D1, (b) Do it
as a separate follow-up change, (c) Don't do it.

Chosen: (b) Separate follow-up change.

Classification: Human ratified agent recommendation.

Reasoning: Agent recommended separating because D1 is a clear bug
fix with minimal risk, while D2 changes error semantics at every caller
site. Mixing them makes review harder and rollback riskier. Human
agreed.

D3: Narrow isFailedRefresh status code fallthrough alongside D2

Question: Should the status code fallthrough in isFailedRefresh be
narrowed?

Options considered: (a) Expand Switch 1 with known error codes and
remove 403 from Switch 2, (b) Remove Switch 2 entirely, (c) Keep as-is.

Chosen: (a) Expand Switch 1, remove 403 from Switch 2.

Classification: Human ratified agent recommendation.

Reasoning: Agent recommended adding incorrect_client_credentials
and invalid_client to Switch 1, removing http.StatusForbidden from
Switch 2 (no known provider returns 403 from token endpoint per
Finding 12). Same concern as D2 (403 misclassification) applied to the
refresh path. Human agreed.

D4: Defer RateLimitError type for callers

Question: Should a dedicated RateLimitError type be added for
caller-specific handling?

Options considered: (a) Do it now, (b) Defer.

Chosen: (b) Defer.

Classification: Human ratified agent recommendation.

Reasoning: Agent recommended deferring because with D1+D2+D3 the
system is correct (no data loss, no false InvalidTokenError for rate
limits). The remaining gap is UX polish ("Failed to refresh" vs. "rate
limited, retry after X"). Touches every caller site for marginal gain.
Human agreed.

Trace document

External Auth Token Resilience: Trace Document

Traces each causal claim in the implementation flow against the code.

Change 1, Stage 1: Move the DB save

Claim: After TokenSource.Token() succeeds (line 194) and
GenerateTokenExtra completes (line 259), save the token to the DB.

Code: Line 194 is token, err := c.TokenSource(ctx, existingToken).Token().
Line 259 is extra, err := c.GenerateTokenExtra(token). Lines 262-268
have a clean insertion point between the GenerateTokenExtra error
check and the retry setup. All required variables (token, extra,
externalAuthLink, c.ID) are in scope.

Holds.


Claim: If validation succeeds after the early save, the existing
save at line 288 is skipped because token.AccessToken == externalAuthLink.OAuthAccessToken.

Code: Line 288: if token.AccessToken != externalAuthLink.OAuthAccessToken.
The early save uses UpdateExternalAuthLink which returns the updated
row (:one query). The implementation must assign the result back to
externalAuthLink (as the current code does at line 303). Then the
condition at 288 is false, skipping the write.

Holds -- with requirement that the early save assigns the DB result
back to externalAuthLink.

Change 1, Stage 2: Test for data loss scenario

Claim: TokenSource can be mocked in tests.

Code: testutil.OAuth2Config has TokenSourceFunc (test file line
101-117). The FakeIDP also controls refresh via oidctest.WithRefresh.

Holds.


Claim (corrected): The Updates test (line 309) exercises the
post-refresh validation path.

Code: The Updates test sets link.OAuthExpiry = expired (a past
time), which causes TokenSource.Token() to attempt a real refresh.
After refresh, ValidateToken is called. This exercises the full
refresh-then-validate path.

Note: an earlier draft referenced ValidateRetryGitHub (line 247),
which tests non-expired token validation retry without refresh.
The plan was corrected to reference Updates instead.

Holds (after correction).

Change 1, Stage 3: Concurrent behavior

Claim: Caller A saves new token, caller B's stale refresh fails,
caller B's destructive write is a no-op.

Code: The UpdateExternalAuthLinkRefreshToken WHERE clause:
AND oauth_refresh_token = @old_oauth_refresh_token. The
@old_oauth_refresh_token comes from
externalAuthLink.OAuthRefreshToken (line 221), which is the value at
read-time. After caller A saves via UpdateExternalAuthLink (which
updates unconditionally by provider_id + user_id), the DB has A's new
refresh token. Caller B's WHERE clause matches against the original
token, which no longer matches the DB row. The write is a no-op.

Holds.

Change 2, Stage 1: Rate limit detection in ValidateToken

Claim (revised): For rate-limited 403, return (true, nil, nil).

Code: ValidateToken receives the raw HTTP response from
c.InstrumentedOAuth2Config.Do() (line 343). The promoauth
instrumentation only adds metrics counters
(instrumentedTripper.RoundTrip, promoauth/oauth2.go line 281-298). It
does not modify headers or body. Rate-limit headers
(X-RateLimit-Remaining, X-RateLimit-Reset) are available on res.

Returning (true, nil, nil) means the caller treats the token as
valid. Back in RefreshToken, line 273 if !valid is false, so
execution proceeds to line 288 (the DB save, which with Change 1 is
already done). The caller gets a successful result with the token.

This avoids a regression identified during trace: returning an error
would flow through line 270-271 as a non-InvalidTokenError. The
provisioner (line 678) treats non-InvalidToken errors as failJob.
Currently rate-limited validation returns InvalidTokenError which the
provisioner skips (line 682-683). Returning (true, nil, nil) is better
than both.

Holds.

Change 2, Stage 3: Remove 403 from isFailedRefresh

Claim: After removing http.StatusForbidden, a provider returning
403 with unknown error code from the token endpoint is treated as
transient.

Code: isFailedRefresh at line 1256. With 403 removed from Switch
2: Switch 1 (error codes) -- unknown code, no match. Switch 2 (status
codes) -- 403 removed, no match. Falls through to line 1288:
return false.

isFailedRefresh returning false skips the destructive DB write at
line 209. Execution continues to line 256:
return externalAuthLink, InvalidTokenError(fmt.Sprintf("refresh token: %s", err.Error())).

This returns InvalidTokenError to callers. Provisioner skips the
token (line 682-683). Agent returns auth URL. Template version check
marks as unauthenticated. The refresh token survives in the DB. Next
call retries the refresh.

Holds.

🤖 This PR was created with the help of Coder Agents, and will be reviewed by a human. 🏂🏻

mafredri

This comment was marked as resolved.

This comment was marked as resolved.

@mafredri mafredri force-pushed the fix/rate-limit-403-detection branch from 079cfb9 to 07f0c70 Compare April 15, 2026 13:23

This comment was marked as resolved.

This comment was marked as resolved.

mafredri

This comment was marked as resolved.

@mafredri mafredri force-pushed the fix/rate-limit-403-detection branch from 07f0c70 to 346317b Compare April 16, 2026 11:01
Base automatically changed from fix/save-refreshed-token-before-validation to main April 18, 2026 11:28
@mafredri mafredri force-pushed the fix/rate-limit-403-detection branch from 346317b to f051fad Compare April 23, 2026 04:09
@mafredri mafredri marked this pull request as ready for review April 23, 2026 04:09

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

…fresh

ValidateToken treated all 403 responses as "token invalid," including
GitHub rate limits. isFailedRefresh included 403 in the status code
fallthrough, destroying tokens on rate-limited refresh attempts.

Detect rate-limit indicators (X-RateLimit-Remaining: 0, Retry-After)
on 403 responses in ValidateToken and return optimistically valid.
Add incorrect_client_credentials and invalid_client to isFailedRefresh
error code switch, remove 403 from the status code fallthrough.
Add end-to-end test for RefreshToken with rate-limited validation,
boundary test for non-zero X-RateLimit-Remaining, fix GitHub casing
in comment, and update ValidateToken doc comment to document
optimistic return on rate-limited 403.
…omment

Add Unauthorized_WithRateLimitHeaders test to lock the invariant that
401 always means revocation regardless of rate-limit headers. Fix doc
comment to say valid=true instead of (true, nil, nil) since the
provider-confirmed case returns (true, &user, nil).
@mafredri mafredri force-pushed the fix/rate-limit-403-detection branch from d3c147a to 018db10 Compare April 27, 2026 14:58
coder-agents-review[bot]

This comment was marked as resolved.

… docs

Handle 429 (Too Many Requests) in ValidateToken the same as
rate-limited 403. GitHub can return either status for rate limits;
before this change 403 returned optimistic success but 429 fell
through to the error path, causing provisioner failJob.

Add refresh token assertion to SaveBeforeValidate_RateLimited.
Narrow isRateLimited doc to acknowledge secondary limit edge cases.
Replace "Switch 1"/"Switch 2" jargon with descriptive names.
Expand isRateLimited doc comment to explain why OR (not AND) is
correct, why CDN/WAF false positives are not a practical concern,
and what secondary rate limit patterns are not covered. Prevents
the same design question from arising in future reviews.

Based on investigation of Cloudflare, AWS, Azure, and nginx docs
plus GitHub secondary rate limit behavior across five independent
sources (octokit, go-github-ratelimit, hub4j, community reports).
coder-agents-review[bot]

This comment was marked as resolved.

Update ValidateToken doc to cover the 429 path added in the previous
commit. Use "rate-limited response (403 with rate-limit headers or
429)" instead of counting cases.

Fix misleading comment on test fixture: oauth2.Token.Valid() requires
AccessToken != "" AND not expired, not just non-zero expiry.
Copy link
Copy Markdown
Member Author

Filed coder/internal#1503 to track the observability gap (DEREM-2). Resolving the thread.

🤖 Generated with the help of Coder Agents on behalf of @mafredri.

@mafredri mafredri requested review from hugodutka and johnstcn April 28, 2026 10:17
Comment thread coderd/externalauth/externalauth.go Outdated
Comment thread coderd/externalauth/externalauth_internal_test.go Outdated
Comment thread coderd/externalauth/externalauth.go Outdated
Comment on lines +1305 to +1316
func isRateLimited(resp *http.Response) bool {
if resp == nil {
return false
}
if resp.Header.Get("Retry-After") != "" {
return true
}
if resp.Header.Get("X-RateLimit-Remaining") == "0" {
return true
}
return false
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for follow-up: could be a useful coderd/util function IsRateLimited(*http.Response) (bool, time.Duration)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it could be useful if more callers need the check. Currently only used in one place, so extracting it now would be premature.

🤖 Posted using /amend-review skill via Coder Agents.

Trim isRateLimited doc comment per reviewer nit. Remove line number
from test comment (drifts). Convert ValidateToken status code
if-chain to switch.
@mafredri mafredri force-pushed the fix/rate-limit-403-detection branch from 92d7ae4 to 84afdbc Compare April 28, 2026 10:53
@mafredri mafredri merged commit 1926b7e into main Apr 28, 2026
26 checks passed
@mafredri mafredri deleted the fix/rate-limit-403-detection branch April 28, 2026 15:03
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 28, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants