fix(coderd/externalauth): detect rate-limit 403 and narrow isFailedRefresh#24334
Merged
fix(coderd/externalauth): detect rate-limit 403 and narrow isFailedRefresh#24334
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
079cfb9 to
07f0c70
Compare
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
07f0c70 to
346317b
Compare
Base automatically changed from
fix/save-refreshed-token-before-validation
to
main
April 18, 2026 11:28
346317b to
f051fad
Compare
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
…fresh ValidateToken treated all 403 responses as "token invalid," including GitHub rate limits. isFailedRefresh included 403 in the status code fallthrough, destroying tokens on rate-limited refresh attempts. Detect rate-limit indicators (X-RateLimit-Remaining: 0, Retry-After) on 403 responses in ValidateToken and return optimistically valid. Add incorrect_client_credentials and invalid_client to isFailedRefresh error code switch, remove 403 from the status code fallthrough.
Add end-to-end test for RefreshToken with rate-limited validation, boundary test for non-zero X-RateLimit-Remaining, fix GitHub casing in comment, and update ValidateToken doc comment to document optimistic return on rate-limited 403.
…omment Add Unauthorized_WithRateLimitHeaders test to lock the invariant that 401 always means revocation regardless of rate-limit headers. Fix doc comment to say valid=true instead of (true, nil, nil) since the provider-confirmed case returns (true, &user, nil).
d3c147a to
018db10
Compare
… docs Handle 429 (Too Many Requests) in ValidateToken the same as rate-limited 403. GitHub can return either status for rate limits; before this change 403 returned optimistic success but 429 fell through to the error path, causing provisioner failJob. Add refresh token assertion to SaveBeforeValidate_RateLimited. Narrow isRateLimited doc to acknowledge secondary limit edge cases. Replace "Switch 1"/"Switch 2" jargon with descriptive names.
Expand isRateLimited doc comment to explain why OR (not AND) is correct, why CDN/WAF false positives are not a practical concern, and what secondary rate limit patterns are not covered. Prevents the same design question from arising in future reviews. Based on investigation of Cloudflare, AWS, Azure, and nginx docs plus GitHub secondary rate limit behavior across five independent sources (octokit, go-github-ratelimit, hub4j, community reports).
Update ValidateToken doc to cover the 429 path added in the previous commit. Use "rate-limited response (403 with rate-limit headers or 429)" instead of counting cases. Fix misleading comment on test fixture: oauth2.Token.Valid() requires AccessToken != "" AND not expired, not just non-zero expiry.
Member
Author
|
Filed coder/internal#1503 to track the observability gap (DEREM-2). Resolving the thread.
|
johnstcn
approved these changes
Apr 28, 2026
Comment on lines
+1305
to
+1316
| func isRateLimited(resp *http.Response) bool { | ||
| if resp == nil { | ||
| return false | ||
| } | ||
| if resp.Header.Get("Retry-After") != "" { | ||
| return true | ||
| } | ||
| if resp.Header.Get("X-RateLimit-Remaining") == "0" { | ||
| return true | ||
| } | ||
| return false | ||
| } |
Member
There was a problem hiding this comment.
Suggestion for follow-up: could be a useful coderd/util function IsRateLimited(*http.Response) (bool, time.Duration)
Member
Author
There was a problem hiding this comment.
Agreed it could be useful if more callers need the check. Currently only used in one place, so extracting it now would be premature.
🤖 Posted using
/amend-reviewskill via Coder Agents.
Trim isRateLimited doc comment per reviewer nit. Remove line number from test comment (drifts). Convert ValidateToken status code if-chain to switch.
92d7ae4 to
84afdbc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ValidateTokentreated all 403 responses as "token invalid," includingGitHub rate limits (
X-RateLimit-Remaining: 0,Retry-After). When arate-limited 403 followed a successful token refresh, the token was
marked invalid and the retry loop exhausted, returning
InvalidTokenError. With PR 1 the token is now saved beforevalidation, but the caller still sees a false "invalid token" error.
isFailedRefreshincludedhttp.StatusForbiddenin the status codefallthrough (Switch 2), destroying tokens when any unknown error code
arrived with a 403. No known provider returns 403 from the token
endpoint.
Stacked on #24332.
This PR:
401 || 403check inValidateTokeninto separatebranches. On 403, inspects
X-RateLimit-RemainingandRetry-Afterheaders. If either indicates a rate limit, returns
(true, nil, nil)(optimistically valid). Plain 403 without rate-limit headers preserves
the existing
(false, nil, nil)behavior.incorrect_client_credentialsandinvalid_clienttoisFailedRefreshSwitch 1 (error codes), making those permanentfailure classifications explicit rather than relying on the status
code fallthrough.
http.StatusForbiddenfrom Switch 2 (status codes). Unknownerror codes with 403 are now treated as transient (safe default: retry
rather than destroy).
Implementation plan
External Auth Token Resilience: Plan
Research: external-auth-token-resilience-research.md
Problem
Coder's external auth token refresh for GitHub Apps silently discards
successfully refreshed tokens when post-refresh validation fails due to
API rate limits. GitHub rotates refresh tokens on use ("Once you use a
refresh token, that refresh token and the old user access token will no
longer work." -- GitHub docs, Refreshing user access tokens). When the
new token is not saved, the old refresh token in the DB is stale on
GitHub's side. The next refresh attempt fails permanently, destroying
the token and forcing manual re-authentication.
Secondary issues compound this:
ValidateTokentreats all 403 responsesas "token invalid" (including rate limits), and
isFailedRefreshtreats403 as a permanent failure, destroying tokens on transient conditions.
Decisions
D1: Save refreshed token before validating
TokenSource.Token()succeeds.D2: Distinguish rate-limit 403, separate change
ValidateTokenandisFailedRefresh403 paths. Ship as a follow-up, not bundled with D1.D2a: Return
(true, nil, nil)for rate-limited ValidateTokenreturn
(true, nil, nil)(optimistically valid) rather than an error.Research Approach 2 originally proposed returning an error, but trace
revealed this would cause provisioner
failJobregression (line 678:non-
InvalidTokenErrorerrors fail the build). Returning(true, nil, nil)avoids that regression. The token was just issuedby the IDP; the validation endpoint is failing for a transient reason.
simultaneously rate-limited, the revoked token is treated as valid
during the rate-limit window. Recovery happens when the rate limit
lifts (next validation correctly returns 403 without rate-limit
headers).
D3: Narrow isFailedRefresh fallthrough alongside D2
incorrect_client_credentialsandinvalid_clientto the known error code list. Remove
http.StatusForbiddenfrom thestatus code fallthrough.
D4: Defer RateLimitError type
separate scope.
Implementation Flow
Two PRs. PR 1 is a standalone bug fix. PR 2 depends on PR 1 (stacked)
or lands after PR 1 merges. They can be reviewed in parallel but must
merge in order.
Each stage follows red-green TDD: write a failing test first (red),
then implement the minimum code to make it pass (green), then refactor.
The stages below are ordered accordingly: test stages precede or are
interleaved with implementation stages.
PR 1: Save refreshed token before validating (D1)
Scope:
coderd/externalauth/externalauth.go, tests.Stage 1 (RED): Write failing test for the data loss scenario
Add a test that reproduces Finding 1's chain:
refresh token.
TokenSourceto return a new token (simulating successfulrefresh with rotated refresh token).
ValidateTokenendpoint to return 403 (simulating rate limit).RefreshToken. It should return an error (validation failed).(not the old ones). This is the critical assertion -- the whole
point of the fix.
RefreshTokenagain (simulating next caller). Assert: therefreshed token is read from DB,
TokenSourcedoes NOT attemptanother refresh (token is not expired), validation is retried.
Artifact: New test
TestRefreshToken/SaveBeforeValidate(or similar).Test fails on current code (step 5 fails: DB still has old tokens).
Verification: Test compiles and runs. Step 5 assertion fails (red).
Stage 2 (GREEN): Move the DB save
In
RefreshToken(line 164), afterTokenSource.Token()succeeds(line 194) and
GenerateTokenExtracompletes (line 259), save thetoken to the DB immediately when the access token changed. This is the
existing
UpdateExternalAuthLinkcall currently at line 288, movedearlier.
The early save must assign the returned DB row back to
externalAuthLink(as the current save does at line 303:externalAuthLink = updatedAuthLink). This ensures the condition atline 288 (
token.AccessToken != externalAuthLink.OAuthAccessToken)becomes false, skipping the now-redundant write.
If the early save fails (DB error), return immediately with the error
(same pattern as current line 300-301). The refresh succeeded on the
IDP side but the token couldn't be persisted. The caller gets a
non-
InvalidTokenError, which provisioner treats asfailJobandother callers treat as HTTP 500. This is the correct behavior: the
system is in a degraded state and should surface it.
Note:
UpdateExternalAuthLinkclearsoauth_refresh_failure_reason(externalauth.sql line 47). This is correct: the refresh succeeded.
Note:
UpdateExternalAuthLinkhas no optimistic lock (writesunconditionally by
provider_id + user_id). For providers that rotaterefresh tokens (GitHub), only one concurrent refresh can succeed, so
races are self-resolving. For providers that don't rotate, both callers
get valid tokens and last-writer-wins is benign.
After saving, proceed to validation as before. If validation fails, the
token is already persisted. If validation succeeds, the save at line
288 is skipped (access token unchanged).
Artifact: Modified
RefreshTokenfunction with the save moved beforethe
validate:label.Verification: Stage 1 test now passes (green). Existing
TestRefreshTokensubtests pass. TheUpdatestest (line 309 of testfile) exercises the post-refresh validation path and should also pass.
Stage 3 (RED/GREEN): Concurrent behavior test
Write a test (or extend Stage 1's test) that verifies: caller A saves
new token, caller B's stale refresh fails, caller B's destructive
UpdateExternalAuthLinkRefreshTokenis a no-op due to the optimisticlock (
WHERE oauth_refresh_token = @old_oauth_refresh_token).Artifact: Concurrent test added or existing one verified.
Verification: Run
make test RUN=TestRefreshTokenwith race detector.PR 2: Rate-limit 403 detection and isFailedRefresh narrowing (D2+D2a+D3)
Depends on PR 1. Can be stacked on top of PR 1's branch or land after
PR 1 merges.
Scope:
coderd/externalauth/externalauth.go,coderd/promoauth/,tests.
Stage 1 (RED): Write failing tests for rate-limit detection
New tests for
ValidateToken:X-RateLimit-Remaining: 0. Assertreturn is
(true, nil, nil). Currently fails (returns(false, nil, nil)).Retry-After: 60. Assert return is(true, nil, nil). Currently fails.is
(false, nil, nil)(existing behavior preserved). Currentlypasses.
New tests for
isFailedRefresh:RetrieveErrorwithErrorCode: "incorrect_client_credentials"and
StatusCode: 200. AssertisFailedRefreshreturnstrue.Currently passes (caught by StatusOK fallthrough), but the test makes
the behavior explicit.
RetrieveErrorwithErrorCode: "invalid_client"andStatusCode: 401. Assert returnstrue. Currently passes(StatusUnauthorized fallthrough).
RetrieveErrorwithErrorCode: "unknown_code"andStatusCode: 403. Assert returnsfalse(transient). Currentlyfails (returns
truevia StatusForbidden fallthrough).Artifact: New tests, three failing (red).
Verification: Tests compile. Three fail as expected.
Stage 2 (GREEN): Detect rate limits in ValidateToken
In
ValidateToken, before theStatusForbiddencheck at line 348,inspect the response for rate-limit indicators:
X-RateLimit-Remaining: 0header.Retry-Afterheader present.If either indicator is present on a 403, return
(true, nil, nil)--optimistically treat the token as valid (Decision D2a). The IDP just
issued the token; the validation endpoint is failing for a known
transient reason.
For 403 responses with neither
X-RateLimit-Remaining: 0norRetry-Afterheader, preserve the existing(false, nil, nil)behavior(likely a genuine token revocation or permission error).
The existing
promoauth/github.goalready parsesX-RateLimit-*headers (lines 31-71) but only for 401 responses. The header-parsing
logic can be reused or extracted.
Artifact: Modified
ValidateToken.Verification: Rate-limit tests pass (green). Existing
ValidateFailuretest still passes.Stage 3 (GREEN): Expand isFailedRefresh and remove 403
Add to the
ErrorCodeswitch (Switch 1):"incorrect_client_credentials"(GitHub: wrong client_id/secret,HTTP 200)
"invalid_client"(RFC 6749 Section 5.2: client auth failed,HTTP 400/401)
Remove
http.StatusForbiddenfrom theStatusCodeswitch (Switch 2).No known provider returns 403 from the token endpoint (Finding 12). The
403 case was speculative defense that causes real harm (destroys tokens
on rate-limited refresh attempts).
After expanding Switch 1, all known permanent error codes are explicit.
The 403 removal only affects unknown error codes arriving with 403
status, which will now be treated as transient (function returns
false). This is the safe default: retry rather than destroy.Artifact: Expanded Switch 1, narrowed Switch 2.
Verification: All three failing tests pass (green). Existing
RefreshRetriestest passes.Stage 4: Integration verification
Run the full external auth test suite. Verify that PR 1 + PR 2 together
handle the complete Finding 1 chain: expired token, successful refresh,
rate-limited validation, token saved, no destruction, recovery when rate
limit lifts.
Artifact: All tests green.
Verification:
make test RUN=TestRefreshToken,make test RUN=TestExternalAuth, race detector enabled.Research document
External Auth Token Resilience: Research Document
Status: Frozen as of 2026-04-14. Changes require plan owner approval.
Problem Context
What exists
Coder's external auth subsystem manages OAuth2 tokens for Git providers
(GitHub, GitLab, Gitea, etc.). The core logic lives in
coderd/externalauth/externalauth.go. Tokens are stored inexternal_auth_links(primary key:(provider_id, user_id)) with fieldsfor access token, refresh token, expiry, and a cached failure reason.
The token lifecycle has three operations:
then validates the new token.
with the access token to confirm it's still good.
vs. transient.
Multiple callers trigger
RefreshToken:provisionerdserver.go:677)workspaceagents.go:2101)templateversions.go:403)coderd/externalauth.go:384)The
externalAuthByIDhandler (the "Test validate" UI) callsValidateTokendirectly without going throughRefreshToken.What's wrong
A user with both a GitHub App and a GitHub OAuth App configured reports:
And separately:
Why it matters
Token destruction is irreversible without manual re-authentication. If
transient conditions (rate limits, temporary server errors) are
misclassified as permanent failures, users are forced into a re-auth loop
that disrupts development workflow.
Finding 1: Rate limit during post-refresh validation silently discards a successfully refreshed token (PRIMARY)
Category: Verified
This is the most critical finding.
RefreshToken(lines 194-299)performs refresh and validation as a two-step sequence. A rate limit
during validation causes the refreshed token to be silently lost, which
then corrupts the refresh token state on GitHub's side.
The chain:
Step 1. GitHub App token expires (~8 hours).
Step 2.
TokenSource.Token()(line 194) contacts GitHub's tokenendpoint, successfully refreshes. Returns a new access_token and a
new refresh_token. GitHub rotates refresh tokens on use: the old
refresh token is now invalid on GitHub's side.
Step 3.
ValidateToken(line 269) is called with the new token againsthttps://api.github.com/user. This endpoint is rate limited. GitHubreturns 403.
Step 4. The 1-second retry loop (line 264-282) exhausts. Every retry
also gets 403 (rate limits last minutes to hours).
Step 5. Returns
InvalidTokenError("token failed to validate")at line285. Lines 288-299 are never reached. The new token (with new
access_token AND new refresh_token) is never saved to the DB:
Step 6. DB still has: expired access_token + old refresh_token (which
GitHub already invalidated in step 2).
Step 7. Next call to
RefreshTokenreads the stale DB record.TokenSource.Token()sees expired token, attempts refresh with the old(now invalid) refresh token. GitHub returns an error.
Step 8. The oauth2 library's auth style probing (the
AuthStyleCacheisin-memory and may be empty after restarts) first tries Basic auth.
GitHub receives Basic auth with the old refresh token and returns an
error. The specific error code depends on how GitHub classifies a stale
refresh token sent via Basic auth. This can produce
incorrect_client_credentialsorbad_refresh_tokendepending onGitHub's internal handling.
Step 9.
isFailedRefreshreturns true (either via theErrorCodematchfor
bad_refresh_tokenor via theStatusCodefallthrough forhttp.StatusOK). Refresh token is wiped from DB. Error is cached inoauth_refresh_failure_reason.Step 10. All subsequent calls see: expired token, no refresh token,
cached failure reason. Returns: "token expired and refreshing failed
2 hours ago with: ".
This is a data loss bug. A successful refresh (step 2) is discarded
because a transient rate limit on a completely different endpoint
(
/uservs/login/oauth/access_token) causes the validation to fail.The refresh token rotation means there is no recovery without manual
re-authentication.
Influences: All approaches. This is the primary bug to fix.
Finding 2: ValidateToken treats GitHub 403 (rate limit) as "token invalid"
Category: Verified
ValidateToken(line 348-351):This returns
(false, nil, nil)-- "not valid, no error." GitHub usesHTTP 403 for at least three distinct conditions:
The code does not inspect X-RateLimit-Remaining, X-RateLimit-Reset,
Retry-After, or the response body ("API rate limit exceeded" message) to
distinguish these cases. All three are treated as "token invalid."
The 1-second retry loop (line 264-282) was added for GitHub read-replica
lag after token refresh (comment at line 274-278). It was not designed
for rate limits and is too short.
Influences: Finding 1 (this is the mechanism that triggers the
silent token discard), Approaches 1, 2, 3.
Finding 3: isFailedRefresh treats 403 as permanent token failure
Category: Verified
isFailedRefresh(line 1267-1284):When
isFailedRefreshreturns true, the refresh token is permanentlydestroyed (line 209-222). The code correctly handles 429 as transient,
but GitHub also returns 403 for rate limits. A 403 during refresh falls
through to the status code switch where 403 returns
true, causingpermanent token destruction.
The comment says "Status codes that indicate the request was processed,
and rejected." This assumption is incorrect for GitHub's 403 rate limits.
Additionally, unknown error codes (like
incorrect_client_credentials)fall through the
ErrorCodeswitch to the status code match. GitHubreturns token endpoint errors with HTTP 200, so
http.StatusOKmatchesand classifies them as permanent. This is a catch-all that is too broad.
Influences: Approaches 1, 2, 3.
Finding 4: Device auth and token refresh have an asymmetric client_secret requirement
Category: Verified
formatDeviceTokenURL(line 640-651, called byExchangeDeviceCodeat line 599) only sendsclient_id:Token refresh goes through
oauth2.TokenSource, which uses theoauth2.Configconstructed at line 706-708 withClientSecret.Device auth always works; refresh can fail if credentials are wrong.
GitHub App user-to-server tokens expire (~8 hours) and include a refresh
token. GitHub OAuth App tokens do not expire and have no refresh token.
Influences: Finding 1 (explains why device re-auth works but tokens
break again after expiry).
Finding 5: No rate limit awareness in the token lifecycle
Category: Verified
promoauth/github.go(line 31-34) only tracks rate limit headers for401 (unauthenticated) responses:
Authenticated request rate limits (5000/hour primary bucket) are
invisible to Prometheus. No code in
coderd/externalauth/readsX-RateLimit-Remaining, X-RateLimit-Reset, or Retry-After from validation
or refresh responses.
No code throttles, backs off, or queues token validation/refresh
operations. Multiple callers independently trigger validation:
defaultInterval = 10 * time.Secondin worker.go:23, uses its own TokenResolver)Influences: Finding 1 (this is why rate limits are hit so often).
Finding 6: Singleflight was NOT implemented despite PR claim
Category: Verified
PR #22904 description claims singleflight deduplication was added. The
actual commit (53e52ae) only added the optimistic lock
(
OldOauthRefreshTokenWHERE clause).grep -rn singleflight coderd/externalauth/returns zero results.This means concurrent callers (provisioner + agent + template version
check) each create a separate
TokenSourceinstance (line 194) and canall independently attempt refresh. With GitHub's refresh token rotation,
this creates a race: the first caller's refresh invalidates the old
token, and the second caller's refresh with the now-stale token fails
with
bad_refresh_token.The optimistic lock on
UpdateExternalAuthLinkRefreshTokenprevents thelosing caller from clobbering a winner's NEW token. But it does not
prevent the losing caller from attempting the refresh in the first place
(which wastes a token endpoint request and can hit the 2000/hr token
endpoint rate limit).
Influences: Finding 1 (concurrent callers increase rate limit
pressure and can trigger the token discard chain).
Finding 7: oauth2 library auth style probing doubles token endpoint traffic
Category: Verified
GitHub's endpoint does not set
AuthStyle:The
AuthStyleCacheis in-memory peroauth2.Config. On every Coderserver restart (or first refresh per Config), the oauth2 library probes:
first sends Basic auth (rejected by GitHub with
incorrect_client_credentials), then retries with form-encoded params.(Recalled from oauth2 library documentation; not empirically tested this session.)
This is normally transparent but doubles token endpoint traffic toward
the 2000/hr limit.
The device flow (
ExchangeDeviceCode) useshttp.DefaultClientdirectly, bypassing the
oauth2.Config. It never populates theAuthStyleCache. So every first refresh after a device auth exchangetriggers probing.
Influences: Finding 1 (contributes to token endpoint rate limit
pressure), explains why
incorrect_client_credentialscan appear evenwith correct credentials (it's the first probe attempt, normally masked
by the retry).
Finding 8: "Test validate" calls ValidateToken directly
Category: Verified
externalAuthByIDhandler (coderd/externalauth.go line 62-63):Calls
ValidateTokenwithout attempting a refresh first. If rate limited(403), returns
Authenticated: false. This explains the "Test validatealso errors out" symptom -- it hits the same rate-limited
/userendpoint.Influences: Approach 3.
Finding 9: PR #15608 introduced the "destroy on failed refresh" behavior
Category: Verified
Commit 78f9f43 (PR #15608) introduced the pattern of zeroing the
refresh token on failure. The commit message:
The concern was legitimate: prevent repeated failed refreshes from
exhausting rate limits. But the implementation conflates "this refresh
failed permanently" with "this refresh failed right now."
PR #19339 (commit 4926410) added
oauth_refresh_failure_reasontocache and display the original error, producing the exact error format
the user reported.
Influences: Approaches 1, 2, 3.
Finding 10: gitsync has its own rate limit handling
Category: Verified
coderd/x/gitsync/gitsync.gouses a per-groupatomic.Pointer[gitprovider.RateLimitError]to short-circuit remainingrows when a rate limit is hit. The worker has a 30-second tick timeout
and backs off on errors (120s default, 10min for no-token). This shows
the codebase already has rate-limit-aware patterns not present in the
core
externalauthpackage.Influences: Approach 2 precedent.
Finding 11: Repercussions of each token failure mode per caller
Category: Verified
Provisioner AcquireJob (provisionerdserver.go:648-689)
failJob(...)at line 679continueat line 683access_token=""The
optionalflag oncoder_external_authonly gates the UI submitbutton. If bypassed (API call, auto-start), the build proceeds with an
empty token.
Agent external-auth handler (workspaceagents.go:2056-2128)
Template version auth check (templateversions.go:386-416)
Authenticated = falselistUserExternalAuths (coderd/externalauth.go:370-406)
Iterates ALL links sequentially. Any error sets
Authenticated=false+ValidateError. Blocking stalls the whole page.Approaches
Approach 1: Save the refreshed token before validating
Description: Persist the refreshed token to the DB immediately after
a successful
TokenSource.Token()call, BEFORE callingValidateToken.This prevents the data loss bug in Finding 1 where a successfully
refreshed token is silently discarded because validation fails.
Concrete precedent: The code at line 288-299 already saves the token
when validation succeeds. This approach moves the save earlier. The
optimistic lock on
UpdateExternalAuthLinkRefreshToken(Finding 6)provides a model for safe concurrent writes.
Strongest argument for: Fixes the primary bug directly. A successful
refresh (which consumed the old refresh token on GitHub's side) is never
silently lost. Even if validation fails afterward, the new token is
in the DB and can be retried. This is the smallest change that
eliminates the data loss.
Strongest argument against: Saves a token that hasn't been validated
yet. However, validation is a liveness check, not a security gate.
The original
ValidateURLcomment (PR #9996, commit 45b53c2) says: "ensures an accesstoken is valid before returning it to the user. If omitted, tokens will
not be validated before being returned." It's optional. No commit or PR
in the history motivates validation as a defense against bad IDP
responses.
The only known validation failure after refresh is GitHub read-replica
lag (the Australian customer incident at line 274-278), which is
transient -- the token IS valid, the API just can't confirm it yet.
Saving first is strictly better for this case.
If a token somehow doesn't work: it stays in the DB, fails validation
on the next call, and the user is prompted to re-auth. Same outcome as
today, but the new refresh token is preserved instead of being lost.
Makes easy: Fixing the immediate data loss. No new error types, no
new caller handling, no rate-limit detection needed for this specific
fix.
Makes hard: Nothing. This change is additive and does not conflict
with the other approaches. It should be done regardless of what else is
chosen.
Changes to existing codebase: Move
db.UpdateExternalAuthLinkfromafter validation (line 288) to after
TokenSource.Token()succeeds(around line 258). If validation subsequently fails, the token is still
in the DB. If validation succeeds and the token was already saved, no
second write is needed.
Approach 2: Distinguish rate limit 403 from token-invalid 403
Description: Inspect GitHub API response headers and body to
distinguish rate-limit 403 from token-revoked 403 before classifying the
result. Changes both
ValidateToken(line 348) andisFailedRefresh(line 1278) to check for rate limit signals before treating 403 as
permanent.
Concrete precedent: GitHub's documentation specifies that rate limit
responses include
X-RateLimit-Remaining: 0header and response bodycontaining
"message": "API rate limit exceeded". Thepromoauthpackage already parses these headers (for metrics only). The
gitsyncpackage has
gitprovider.RateLimitErrorhandling.Strongest argument for: Prevents the rate-limit 403 from triggering
Finding 1's chain in the first place.
ValidateTokenwould return anerror (not
(false, nil, nil)) for rate limits, letting the callerdistinguish "can't verify right now" from "token is dead."
Strongest argument against: GitHub-specific. Other providers signal
rate limits differently. Also, even with correct 403 classification,
the 1-second retry window is still inadequate for rate limits.
Makes easy: Fixing the classification bug. Tokens that are actually
revoked are still promptly invalidated.
Makes hard: Generalizing. Each provider needs its own rate limit
detection.
Changes to existing codebase:
ValidateTokenchecksX-RateLimit-Remainingbefore treating 403 as invalid.isFailedRefreshchecks for rate limit indicators before the status code fallthrough.
New tests for 403-with-rate-limit-headers.
Approach 3: Add rate-limit-aware error type for callers
Description: Return a specific
RateLimitErrorfrom tokenoperations when a rate limit is detected. Each caller handles it
according to its own constraints: provisioner fails fast with a clear
message, agent enters listen mode, template version check shows "try
later."
Concrete precedent: The gitsync package's
atomic.Pointer[gitprovider.RateLimitError]pattern. The device authflow's explicit 429 handling (line 562-575).
Strongest argument for: Gives each caller enough information to
make the right decision. A provisioner build shouldn't wait 60 minutes
for a rate limit. An agent in listen mode already has the infrastructure
to poll.
Strongest argument against: Requires changes at every caller site.
The provisioner, agent handler, template version check, and user auth
list all need new error branches.
Makes easy: Accurate error reporting. Users see "rate limited" not
"please re-authenticate."
Makes hard: Coordination. Without shared rate limit state, each
caller still independently discovers the rate limit.
Finding 12: The status code fallthrough in isFailedRefresh can be safely replaced with explicit error codes
Category: Verified
The status code fallthrough (Switch 2) was added in a single commit (PR
15608) as a catch-all. Analysis of what each status code catches
invalid_client,invalid_request,invalid_scope(RFC codes not in Switch 1)invalid_clientvia Basic auth (RFC requires 401 for this)incorrect_client_credentialsand future GitHub error codesThe safe replacement: expand Switch 1 with RFC 6749 Section 5.2 error
codes and GitHub-specific codes, then narrow or remove Switch 2.
Codes to add to Switch 1:
"incorrect_client_credentials"(GitHub: wrong client_id/secret)"invalid_client"(RFC 6749 Section 5.2: client auth failed)"invalid_request"(RFC 6749 Section 5.2: malformed request)After adding these, Switch 2 only catches truly unknown codes. The 403
case can be removed (no spec basis, causes rate limit misclassification).
The 200 case becomes nearly redundant for GitHub (all known codes are
now in Switch 1). The 400/401 cases become redundant for spec providers.
Switch 2 could be kept as a lower-confidence fallback (log a warning
rather than silently treating unknown codes as permanent), or removed
entirely with the understanding that genuinely unknown permanent errors
will be retried until the token expires naturally.
Influences: Approach 2.
Finding 13: Saving before validating is safe
Category: Verified
The only realistic scenario where a refresh returns a "bad" token is
GitHub read-replica lag (the Australian customer incident documented at
line 274-278). In that case the token IS valid; GitHub's read replicas
haven't propagated it yet. Saving first is strictly better: the token
works, validation just can't confirm it yet.
For a genuinely bad token (provider bug): the bad token in the DB still
fails validation on the next call, prompting re-auth. Same outcome as
today, but the new refresh token is preserved instead of being lost.
GenerateTokenExtra(line 259) is purely local (no network, novalidation dependency). All data needed for the DB save is available
after line 262.
Concurrency is safe: the optimistic lock on
UpdateExternalAuthLinkRefreshTokenprevents a losing concurrentrefresher from overwriting the winner's saved token.
UpdateExternalAuthLinkclearsoauth_refresh_failure_reason, whichis correct: the refresh succeeded (IDP issued a token); only the
validation failed (different endpoint, different concern).
Influences: Approach 1.
Decisions
D1: Save refreshed token before validating
Question: How to prevent the data loss in Finding 1 where a
successfully refreshed token is silently discarded when post-refresh
validation fails?
Options considered: (a) Save before validating, (b) Extend the
validation retry window, (c) Skip validation after refresh entirely.
Chosen: (a) Save before validating.
Classification: Human ratified agent recommendation.
Reasoning: Agent recommended based on Finding 1 (data loss chain),
Finding 13 (safety analysis), and the ValidateURL comment (PR #9996,
commit 45b53c2) showing validation is an optional liveness check.
Human agreed: "I agree with your recommendations."
D2: Distinguish rate-limit 403 from token-invalid 403, separate change
Question: Should rate-limit 403 responses be classified differently
from token-revocation 403 responses in ValidateToken and isFailedRefresh?
Options considered: (a) Do it in the same change as D1, (b) Do it
as a separate follow-up change, (c) Don't do it.
Chosen: (b) Separate follow-up change.
Classification: Human ratified agent recommendation.
Reasoning: Agent recommended separating because D1 is a clear bug
fix with minimal risk, while D2 changes error semantics at every caller
site. Mixing them makes review harder and rollback riskier. Human
agreed.
D3: Narrow isFailedRefresh status code fallthrough alongside D2
Question: Should the status code fallthrough in isFailedRefresh be
narrowed?
Options considered: (a) Expand Switch 1 with known error codes and
remove 403 from Switch 2, (b) Remove Switch 2 entirely, (c) Keep as-is.
Chosen: (a) Expand Switch 1, remove 403 from Switch 2.
Classification: Human ratified agent recommendation.
Reasoning: Agent recommended adding
incorrect_client_credentialsand
invalid_clientto Switch 1, removinghttp.StatusForbiddenfromSwitch 2 (no known provider returns 403 from token endpoint per
Finding 12). Same concern as D2 (403 misclassification) applied to the
refresh path. Human agreed.
D4: Defer RateLimitError type for callers
Question: Should a dedicated RateLimitError type be added for
caller-specific handling?
Options considered: (a) Do it now, (b) Defer.
Chosen: (b) Defer.
Classification: Human ratified agent recommendation.
Reasoning: Agent recommended deferring because with D1+D2+D3 the
system is correct (no data loss, no false InvalidTokenError for rate
limits). The remaining gap is UX polish ("Failed to refresh" vs. "rate
limited, retry after X"). Touches every caller site for marginal gain.
Human agreed.
Trace document
External Auth Token Resilience: Trace Document
Traces each causal claim in the implementation flow against the code.
Change 1, Stage 1: Move the DB save
Claim: After
TokenSource.Token()succeeds (line 194) andGenerateTokenExtracompletes (line 259), save the token to the DB.Code: Line 194 is
token, err := c.TokenSource(ctx, existingToken).Token().Line 259 is
extra, err := c.GenerateTokenExtra(token). Lines 262-268have a clean insertion point between the
GenerateTokenExtraerrorcheck and the retry setup. All required variables (
token,extra,externalAuthLink,c.ID) are in scope.Holds.
Claim: If validation succeeds after the early save, the existing
save at line 288 is skipped because
token.AccessToken == externalAuthLink.OAuthAccessToken.Code: Line 288:
if token.AccessToken != externalAuthLink.OAuthAccessToken.The early save uses
UpdateExternalAuthLinkwhich returns the updatedrow (
:onequery). The implementation must assign the result back toexternalAuthLink(as the current code does at line 303). Then thecondition at 288 is false, skipping the write.
Holds -- with requirement that the early save assigns the DB result
back to
externalAuthLink.Change 1, Stage 2: Test for data loss scenario
Claim:
TokenSourcecan be mocked in tests.Code:
testutil.OAuth2ConfighasTokenSourceFunc(test file line101-117). The
FakeIDPalso controls refresh viaoidctest.WithRefresh.Holds.
Claim (corrected): The
Updatestest (line 309) exercises thepost-refresh validation path.
Code: The
Updatestest setslink.OAuthExpiry = expired(a pasttime), which causes
TokenSource.Token()to attempt a real refresh.After refresh,
ValidateTokenis called. This exercises the fullrefresh-then-validate path.
Note: an earlier draft referenced
ValidateRetryGitHub(line 247),which tests non-expired token validation retry without refresh.
The plan was corrected to reference
Updatesinstead.Holds (after correction).
Change 1, Stage 3: Concurrent behavior
Claim: Caller A saves new token, caller B's stale refresh fails,
caller B's destructive write is a no-op.
Code: The
UpdateExternalAuthLinkRefreshTokenWHERE clause:AND oauth_refresh_token = @old_oauth_refresh_token. The@old_oauth_refresh_tokencomes fromexternalAuthLink.OAuthRefreshToken(line 221), which is the value atread-time. After caller A saves via
UpdateExternalAuthLink(whichupdates unconditionally by
provider_id + user_id), the DB has A's newrefresh token. Caller B's WHERE clause matches against the original
token, which no longer matches the DB row. The write is a no-op.
Holds.
Change 2, Stage 1: Rate limit detection in ValidateToken
Claim (revised): For rate-limited 403, return
(true, nil, nil).Code:
ValidateTokenreceives the raw HTTP response fromc.InstrumentedOAuth2Config.Do()(line 343). Thepromoauthinstrumentation only adds metrics counters
(
instrumentedTripper.RoundTrip, promoauth/oauth2.go line 281-298). Itdoes not modify headers or body. Rate-limit headers
(
X-RateLimit-Remaining,X-RateLimit-Reset) are available onres.Returning
(true, nil, nil)means the caller treats the token asvalid. Back in
RefreshToken, line 273if !validis false, soexecution proceeds to line 288 (the DB save, which with Change 1 is
already done). The caller gets a successful result with the token.
This avoids a regression identified during trace: returning an error
would flow through line 270-271 as a non-
InvalidTokenError. Theprovisioner (line 678) treats non-
InvalidTokenerrors asfailJob.Currently rate-limited validation returns
InvalidTokenErrorwhich theprovisioner skips (line 682-683). Returning
(true, nil, nil)is betterthan both.
Holds.
Change 2, Stage 3: Remove 403 from isFailedRefresh
Claim: After removing
http.StatusForbidden, a provider returning403 with unknown error code from the token endpoint is treated as
transient.
Code:
isFailedRefreshat line 1256. With 403 removed from Switch2: Switch 1 (error codes) -- unknown code, no match. Switch 2 (status
codes) -- 403 removed, no match. Falls through to line 1288:
return false.isFailedRefreshreturningfalseskips the destructive DB write atline 209. Execution continues to line 256:
return externalAuthLink, InvalidTokenError(fmt.Sprintf("refresh token: %s", err.Error())).This returns
InvalidTokenErrorto callers. Provisioner skips thetoken (line 682-683). Agent returns auth URL. Template version check
marks as unauthenticated. The refresh token survives in the DB. Next
call retries the refresh.
Holds.