Why 50 failed downloads don't retry at once

A flaky host returning 503s is a problem. 50 downloads slamming it back in lockstep every 2 seconds is a worse problem. Here's the retry strategy we use to avoid the second one.

60s

Max backoff

Jitter strategies

Errors remembered

The naive retry is a trap

The first retry loop anyone writes looks roughly the same. Wait a second, try again. Two seconds, try again. Double the delay each time -- 1, 2, 4, 8, 16 -- and call it done. Most tutorials stop there.

This works fine for one download. Put 50 downloads through the same loop at the same time and all of them retry at second 1, then all of them at second 3, then all of them at second 7. A host that was already overloaded gets re-hammered by 50 identical requests inside the same millisecond, over and over, and the retries actively make things worse than no retries at all. The problem isn't that you retried; it's that the whole pool retried in lockstep.

Thundering herds aren't abstract

We ran into this early. A file host rate-limited us at 429. The app would back off, wait, retry -- and every queued download to that host would retry in the same window, get 429 again, back off again. The queue took longer than it would have with no backoff at all, because each synchronized retry made the host angrier and stretched the penalty window further.

The industry name for this is thundering herd. Every pool of parallel workers that retries on failure has some version of the problem. The fix is to add randomness to the backoff -- jitter -- so retries spread out instead of stacking.

Three real strategies, and one for tests

Veloxar's RetryManager ships four jitter modes. One of them, none, does no randomization at all -- we keep it around for tests that need deterministic timing and for the vanishingly rare case where determinism matters more than throughput. It isn't what you want for real downloads. The three that do matter:

Full. Pick a delay uniformly at random in [0, cap). Total randomization. Spreads retries widely, but the average delay works out to half the cap, so you end up waiting less than the exponential model says you should. Fine for some hosts, aggressive for others.

Equal. Half of the delay is the fixed exponential minimum, the other half is random. (cap/2) + random(0, cap/2). A compromise: you wait at least as long as the floor, but with spread on top. Safer default than Full for hosts you don't want to annoy.

Decorrelated. The delay for attempt N is random(baseDelay, previousDelay * 3). The next attempt's ceiling grows from whatever the previous delay actually turned out to be. This one is borrowed from AWS's writeup on exponential backoff and is our default for host-level retries.

Why decorrelated wins the common case

With Full and Equal, two downloads that started their retries around the same time both draw from the same distribution. The variance helps, but 50 downloads drawing from random(0, 32s) will still produce two or three that land in the same 100-millisecond window. Decorrelated jitter anchors each retry to its own history, so two downloads that happened to collide on attempt 2 will almost certainly diverge on attempt 3 instead of staying synchronized.

delay = min(cap, random(baseDelay, previousDelay * 3)). The previous delay -- not a fixed exponent -- determines the next ceiling. A retry that waited 4 seconds can next wait anywhere from 1 to 12; a retry that waited 6 can next wait from 1 to 18. Two downloads that aligned will almost certainly separate on the next roll.

We cap everything at 60 seconds. Longer than that and the retry becomes indistinguishable from "we gave up" to the user, even if the code is technically still trying. If a host needs more than a minute between attempts, that's a signal to stop retrying and surface the failure, not to keep the spinner going for five minutes.

Some errors aren't worth retrying at all

Jitter is for recoverable failures. A 503 or a timeout means the server is temporarily unhappy; retrying makes sense. A 404 or a malformed URL means the world is never going to change. Retrying those is wasted work and wasted patience.

Veloxar's retry classifier splits errors into two buckets:

Retry: connection timeout, lost connection, DNS failure, 5xx server errors, connection unavailable.
Don't retry: 4xx client errors, user-cancelled, bad URL, decode failure.

The list is short on purpose. We don't want to be clever about 403s. A 403 might mean "session expired and a retry will fix it," but it also might mean "we got blocked and 50 more retries will get us blocked harder." We pick the second interpretation and surface the error so the user can react, instead of burning cycles in the dark.

When it still fails, we remember why

If jitter spreads retries out and the retries still don't work, the download fails and the user has to be shown an error. The usual choice is to show them the last error the loop saw. I have yet to work on a system where that was the right answer.

The "last error wins" pattern is almost always a regression waiting to happen. Say a download fails with a rate limit, retries three times, and the host's clearance cookie expires between retry 2 and 3 so the final error becomes "Cloudflare challenge failed." The real story -- rate limit, rate limit, cookie expired -- is invisible. You fix the Cloudflare bug, the download still fails, and you have no idea you're chasing the wrong thing.

Every download in Veloxar keeps a timestamped error history, capped at 20 entries. When a retry loop gives up, you see the whole arc: first error at one timestamp, same error a few seconds later, a different error on the last attempt. 20 entries is enough to see the pattern. More than that and the history becomes its own problem -- logs that balloon silently are a worse bug than the one they were logging.

In practice

You queue 50 links before you walk away. A few hosts are having a bad day. Some downloads fail their first attempt; Veloxar waits a different amount of time per download and retries. Most of them succeed on attempt 2 or 3.

Retries stagger across seconds instead of lining up inside the same millisecond, so the UI never turns into a wall of synchronized "retrying..." spinners. A download also doesn't give up after one bad minute; the retry stays alive across multiple rolls of backoff. And when something really isn't going to work, the error history shows the whole arc, so you can tell what actually broke instead of guessing from the last line.